TEDx Talk

If you didn’t get a chance to see my TED talk live, the video has just been produced and uploaded onto the TEDx channel on Youtube (below).

The talk is about some of my work in artificial intelligence: specifically the results we’ve observed in our research in synthetic neurointerfaces. Our goal was to functionally and synthetically model the human neocortical columns in an artificial intelligence to give a more differentiable insight into the cognitive behaviors we, as humans, exhibit on a daily basis.

If you would like to know more, I have published the working paper here.

Please let me know what you all think in the comments section below or on Youtube, I would love all the feedback I can get!

Advertisements

Fluid Intelligence: Introduction

 

Fluid intelligence: the capacity to think logically and solve problems in novel situations, independent of acquired knowledge

Psychology has found the basis of fluid intelligence in the juxtaposition of layered memory and application as means to essentially “connect two fluid ideas with an an abstractly analogous property”. Such a mathematical design would have to be able to therefore derive temporal relationships with weighted bonds between two coherently disparate concepts through the means of similar properties. These properties within node types will have to be self-defined and self-propagated within idea types.

Why?

In a pursuit towards a truly dynamic artificial intelligence, it is necessary to establish a recurrent method to decipher the presence of concrete yet abstract entities (“ideas”) independent of a related and coherent topic set.
A considerable amount of work venturing into this field has culminated in the prevalence of statistical methods to extract probabilistic models dependent on large amounts of unstructured data. These Bayesian data analytic techniques often result in an understanding superficial in the context of a true relational understanding. Furthermore, this “bag-of-words” approach when looking at amounts of unstructured data (quantifiable by correct relationships derived between the idea nodes) often relate to a single dimensional understanding of the topics at hand. Traditionally, when these topics are transformed, it is difficult to extract hierarchy and queryable relations using matrix transformations from a derived data set.

The project that I will be describing in the subsequent posts is an effort to change the approach from which dynamic fluid intelligence is derived, finding a backbone in streaming big data. Ideally, this model would be able to take a layered, multi-dimensional approach to autonomous identification of properties of dynamically changing ideas from portions of said data set. It would also be able to find types of relationships, ultimately deriving a set of previously undefined relational schemas through unsupervised machine learning techniques that would ultimately allow for a queryable graph with properties and nodes initially undefined.

Big Data Coorelation: Hadoop Stack

Hive

The Apache Hive project gives a Hadoop developer a view of the data in the Hadoop Distributed File System. This is basically a file manager for Hadoop. Using a SQL-like language, Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL. The table structure effectively projects a structured data set onto unstructured data. If we are using data in HDFS (which we are) our operations can be scaled across all the data nodes and we can manipulate huge datasets.

HIVE.png

HCatalog

The function of Apache HCatalog is to hold location and metadata8 about the data in a Hadoop single node system or cluster. This allows scripts and MapReduce jobs to be separated from each other into data location and metadata. Basically this project is what catalogs and sets pointers to other data bits in different nodes. In our “Hello World” analogy, HCatalog would tell where and which node “Hello” is and where and which node “World” is. Since HCatalog can be used with other Hadoop technologies like Pig and Hive, HCatalog can also help those tools in cataloging and indexing their data. For our purposes we can now reference data by name and we can share or inherit the location and metadata between nodes and Hadoop sub-units.HCATALOG

Pig

Apache Pig is a high-level scripting language. This language though, expresses data analysis and infrastructure processes. When a Pig set is executed, it is translated into a series of MapReduce jobs which are later sent to the Hadoop infrastructure (single node or cluster) though the MapReduce program. Pig’s user defined functions can be written in Java. This is the final layer of the cake on top of MapReduce to give the developer more control and a higher level of precision to create the MapReduce jobs which later translate into data processing in a Hadoop cluster.

Ambari

Apache Ambari is a an operational framework for provisioning and managing Hadoop clusters of multiple nodes or single nodes. Ambari is an effort of cleaning up the messy scripts and views of Hadoop to give a clean look for management and incubating.

YARN

Yarn is basically the new version of MapReduce in Hadoop 2.0. It is the Hadoop operating system that is overlaid on top of the system’s base operating system (CentOS13). YARN provides a global Resource Manager and a per-application manager in its newest iteration. The new idea behind this newer version of MapReduce is to split up the functions of JobTracker into two separate parts. This results in a tighter control of the system and ultimately results in more efficiency and ease of use. The illustration shows that an application run natively in Hadoop can utilize YARN as a cluster resource management tool along with its MapReduce 2.0 features as a bridge to the HDFS.

yarn

Oozie

Apache Oozie is effectively just a calendar for running Hadoop processes. For Hadoop, it is a system to manage a workflow through the Oozie Coordinator to trigger workflow jobs from MapReduce or YARN. Oozie is also a scalable system along with Hadoop and its other sub-products. Its workflow scheduler system runs in the base operating system (YARN) and takes commands from user programs.

OOZIE.png

Read More »

Big Data Coorelation: Hadoop

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System is the foundation for any Hadoop Cluster and/or single-node implementations. The HDFS is the underlying difference between a normal MySQL6 database and a Hadoop implementation. This small change in approaching the data makes all the difference.

A standard MySQL server serves the purpose for any small endeavors and can support an infrastructure about the size of Apple’s database with no problems. The method for processing data usually follows a linear though pattern.Take an example of a phrase “Hello world”. In a very rough representation a MySQL server would save the entire phrase on one hard disk. Then, when the data would be needed the CPU would send a request for the data, the hard disk would spin, and the data would be read/processed.

HDFS

This traditional approach to managing a database hits a few, key problems with no rational and affordable solution. The largest problem that is faced in this system is a mechanical one. At a certain point of complexity and size, a single hard disk can no longer physically spin fast enough to keep up with the seek capabilities of a single CPU. This problem can lead two solutions: make a better hard disk or rethink the way data is processed in the world today. Hadoop offers a solution to rethink the way this problem is dealt with in a radical new way. A Hadoop cluster implements a parallel computing cluster using inexpensive and standard pieces of hardware. The cluster is distributed among many servers running in parallel. The philosophy behind Hadoop is basically to bring the computing to the data. To successfully implement this, the system has to distribute pieces of the same block of data among multiple servers. So basically each data node holds part of the overall data and can process the little data that it holds. This pyramid scheme is visible when the system is scaled up to an infrastructure of Google’s size. The system no longer has the physical barrier of the spinning disks but rather a problem of just storage capacity (which is a very solvable and good problem to have).

HDFS_2

Read More »

Big Data Coorelation (Research P1)

Big Data

What is big data?

The definition of big data states big data as “the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”2. In a sense this is true if we consider the internet to be the collection of large data sets. This emerging industry already has a couple key miners that have developed technologies that fit their purposes of either sifting of crunching through data. The tools we will be using in this exercise were developed by Apache, Google, and Hortonworks but the creation of the engine which will utilize these engines in unison will be the proprietary idea that will be created in this exercise.

Read More »

Big Data Coorelation: Purpose

Question

About 1.8 zettabytes (1.8 trillion gigabytes) of data is being created every year. In all this data there are answers to problems we have been wondering about for ages. It’s just how you can process the information most efficiently and derive correlations from the complexity of the data on the internet. You may not be able to prove anything scientifically, but you may be able to prove hypotheses statistically with huge amounts of data which is hidden somewhere in this intimidating data set. So is it possible to mine hidden information from these huge scales? Can one use existing technologies such as Apache Hadoop, Nutch, Map Reduce, and Google API to develop an engine that can derive comprehendible correlational data autonomously and efficiently?

Purpose

With all this data being produced every year, finding a radical and innovative way of processing large and complex data sets is a need that is unfulfilled. For any computer, processing unstructured data is a very arduous and long process (all the internet’s data is unstructured). This exercise of an engine implementation is an attempt at combining multiple high-end technologies to work in unison to crutch and sift through large and complex data sets to Read More »

Big Data Clustering: Introduction & Topic

The past few years have entailed newer problems to the advancement of human intelligence. Trillions of gigabytes of data are being produced every year, and the total cumulative power of all the computers in existence today can merely compute half that amount using a traditional database system to crunch sheer data. This very problem has created a new industry we now know as Big Data. According to Wikipedia, big data is used “for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”

Read More »

GA Utilizing Efficient Operators in TSP

Through the data collected in the above two pages, it can be reasonably be concluded that center inverse mutation in unison with the inversely linear roulette wheel selection and the random crossover point yield the best result with a higher number of generations. We decided to test a combination of all of these genetic operators and see the value of the lowest path yielded by it. The same input graph used for the other tests was used in this case with 6000 chromosomes in the initial population and 5000 generations with a cutoff percentage of 30%

The results are as follows of the top path after 5000 generations:
Weight = 238
Path: {A, X, C, P, S, G, E, U, Q, Y, B, V, N, T, W, I, F, H, Z, O, D, R, M, L, K, J, A}

Graph (with all edges and weights present):Graph

In the Comparison of Genetic Operators For Solving the Traveling Salesman Problem: Selection

In comparing selection methods, for the sake of comparison it was in our best interest to leave the least to randomness except in the selection method. The mutation method was the center inverse mutation throughout all the trials and a center mutation point was chosen every time. The cutoff percentage was the same (30%) for each trial and the number of generations was a fixed 5000.

The numbers displayed below are the average of 10 trials conducted with the same input graph but a different initial population for each trial.Selection Comparison

In the Comparison of Genetic Operators For Solving the Traveling Salesman Problem: Mutation

In attempt to statistically compare the operators, the input graph and the initial population was kept the same for each trial. The numbers displayed below are the average of 10 trials conducted with the same input graph but a different initial population. The algorithm was ran with an input graph consisting of 26 static nodes and approximately 4.03E26 possible combinations. Each trial ran 5000 generations with an input population of 5000 chromosomes. The fitness percentage was 30% throughout every trial.

Mutation Operators and Crossover Point

In this trial the method of selection was kept standard using the percentage cutoff method to avoid any influence from the selection method.

Random Crossover Point Center Crossover Point
Reverse Sequence Mutation 336 414
Center Inverse Mutation 253 310

The representation of each mutation operator over iterations was tested with a constant center crossover point.

Mutation Operator Comparison