In order to understand the evolutionary imperative of a fluid intelligent cognitive system, it is necessary to examine the function of artificial neural networks (ANN) as they stand today. Broadly defined, artificial neural networks are models of sets of neurons used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown.

This approach thus far has resulted in a standard design of ANNs being persisted in a two dimensional model, and this fundamental structure is used for all variants of the neural network family including deep learning and convolutional learning models.

This approach is fundamentally restrictive in the sense that all learned attributes lie on the same plane— meaning all regressive learned attributes, when compared mathematically, persist as functions of a singular dimensionality.  The function of this system is therefore limited to a single type of learned regression with strong biases against learning new regressions.

The capacity for fluid intelligent intuition in humans allows us to compartmentalize these discrete learned attributes and fluidly find relations between them. This capacity is especially critical in finding unsupervised intelligence from polymorphic unstructured data. Simply put, if we, as humans, would learn with the same characteristics of an existing ANN model, then it would have resulted in an intrinsically stovepipe way of learning. However, humans have a much more sophisticated fluid intelligent capacity. This project is an attempt at creating a fundamentally new way of designing cognitive systems: one that attempts to mimic and enhance human learning patterns.

Idea Disparity

The process of node generation from unstructured data requires a foundation to find statistical distributions of words of a set A consisting of each of the documents aggregated. The dynamic set A will be a finite and elastic set of documents that will serve the purpose of representing the first layer of temporal memory without any sub-categorizations.
Using a hybrid version of the collapsed Gibbs sampler, we are able to integrate out a set of variables into which we can assign distributions of words. Hierarchical Bayesian models yield multi modal distributions of words.

Screen Shot 2016-08-09 at 2.26.00 PM.png

This bag-of-words approach allows us to view the words of each subset distribution as statistical members of a larger set rather than lexical members of a semantic set. The equivalence is set up as x~y between a permutation of possible node types. We begin with tokenizing the documents within A as inputs to our Bayesian Gibbs sampler. As an initial dimension to work off of, the derived distributions function similarly to those generated by the Latent Dirichlet allocation methods (LDA). We use the LDA model used in the  to find topic distributions in social media data. In essence, this approach is a hybrid of the LDA classifier method. Instead of topic distributions, we are able to find probabilities of each word given each node type. The sampler is able to find the following conditional probabilities using the bag-of-words approach in which each word is viewed as a statistical element within a vocabulary rather than a coherent part of a larger context.

In the figure above, we demonstrate the hybrid Latent Dirichlet allocation classifier as it find the probability of a statistical element within a subset, Z, of the populations set of documents, A.Screen Shot 2016-08-09 at 2.26.48 PM.png

Each significant subset, Z, of our document collection, A, now becomes a contender for becoming a node within our graph.

Unsupervised Multinetwork

The topic distributions of the current snapshot of nodes (of intermixed types) are then forwarded to an unsupervised neural network with a range of 10-20 hidden layers. A flexible preconditioned version of the conjugate gradient back-propagation method is used:

Screen Shot 2016-08-09 at 2.28.46 PM.png
Alpha is the next optimal location vector relative to its position in the gradient of the linearized distribution sets, where the trained value would be a set of vectors of magnitude determining the distance of each distribution from the others from the subset. The hybrid gradient descent algorithm helps minimize the cross-entropy values during its classification. A separate and adequate network is trained and maintained for each subset of the original document set.
The distributions with the greatest distance are then passed to another unique clustering algorithm based around minimizing the Davies–Bouldin index between cluster components but still maintaining the statistical significance between cluster distributions derived in the LDA phase.

Screen Shot 2016-08-09 at 2.29.14 PM

Where n is the number of clusters, c is the centroid of cluster x, sigma X is the average distance of all elements in cluster x to centroid c, and is the distance between centroids.

Advertisements