What is big data?
The definition of big data states big data as “the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”2. In a sense this is true if we consider the internet to be the collection of large data sets. This emerging industry already has a couple key miners that have developed technologies that fit their purposes of either sifting of crunching through data. The tools we will be using in this exercise were developed by Apache, Google, and Hortonworks but the creation of the engine which will utilize these engines in unison will be the proprietary idea that will be created in this exercise.
There are many tools that have been developed since the founding of this industry with a few stable and scalable “super-tools” that will be used extensively in this exercise. The tools can be divided into two types: sifters and crunchers. Sifters are technologies like PIG by Hortonworks that help one sift through data and find data that is useful to the question and later to comprehend it. Crunchers are technological infrastructures (usually databases) like Hadoop that help to crunch or process massive amounts of unstructured data. Unstructured data refers to any data without structure. This paragraph for instance is an unstructured piece of data in the eyes of a computer. An excel file or a table, on the other hand, will be a structured piece of data that a computer can easily process. The main problem that initiated the start of the data race was the fact that that internet is completely unstructured and all these tools were derived to aid in the comprehension of these specific unstructured types of data sets.
Big Data Crunching
There are many sifters available on the open source market and the consumer markets. The most well-known of these sifters is sub-unit of Apache Hadoop3 which Google had a great hand in developing. Apache Hadoop Distributed File System (which will be singularly referred to as HDFS) is a new database infrastructure that offers scalability and efficiency in the way it processes data. The algorithms and implementation all come free from the Apache license but the true genius is in the idea and implementation itself.
Sifting Through Big Data
Apache Hadoop is a set of projects that enables you to solve many problems in big data crunching. Aside from HDFS the Hadoop projects also include many projects like Pig, HBase, Hive, Mahout, and Zookeeper to help sift through the large amounts of data and later make meaning out of the structured data derived from the internet.