About 1.8 zettabytes (1.8 trillion gigabytes) of data is being created every year. In all this data there are answers to problems we have been wondering about for ages. It’s just how you can process the information most efficiently and derive correlations from the complexity of the data on the internet. You may not be able to prove anything scientifically, but you may be able to prove hypotheses statistically with huge amounts of data which is hidden somewhere in this intimidating data set. So is it possible to mine hidden information from these huge scales? Can one use existing technologies such as Apache Hadoop, Nutch, Map Reduce, and Google API to develop an engine that can derive comprehendible correlational data autonomously and efficiently?
With all this data being produced every year, finding a radical and innovative way of processing large and complex data sets is a need that is unfulfilled. For any computer, processing unstructured data is a very arduous and long process (all the internet’s data is unstructured). This exercise of an engine implementation is an attempt at combining multiple high-end technologies to work in unison to crutch and sift through large and complex data sets to autonomously find correlative information. In simplification, this engine will take in an input of a work or phrase and will return the most correlative information sets in the form of a structured and readable conclusion. This can be tested by using this engine for very simple executions that have already been proven and is usually part of the common knowledge. This engine will merely prove that the data to unlock the answers to some of our oldest questions is somewhere out there on the internet and that it is possible to derive comprehendible correlational data from the unstructured format of the internet to prove something faster and more efficiently than the current process of the scientific method.
If a single-node Hadoop infrastructure is used in unison with a Hortonworks sandbox environment to process data collected from the Google API, then it will be possible to make correlative conclusions and effectively extract intelligence from the unstructured data sets found on the internet.