Big Data Coorelation: Hadoop Stack


The Apache Hive project gives a Hadoop developer a view of the data in the Hadoop Distributed File System. This is basically a file manager for Hadoop. Using a SQL-like language, Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL. The table structure effectively projects a structured data set onto unstructured data. If we are using data in HDFS (which we are) our operations can be scaled across all the data nodes and we can manipulate huge datasets.



The function of Apache HCatalog is to hold location and metadata8 about the data in a Hadoop single node system or cluster. This allows scripts and MapReduce jobs to be separated from each other into data location and metadata. Basically this project is what catalogs and sets pointers to other data bits in different nodes. In our “Hello World” analogy, HCatalog would tell where and which node “Hello” is and where and which node “World” is. Since HCatalog can be used with other Hadoop technologies like Pig and Hive, HCatalog can also help those tools in cataloging and indexing their data. For our purposes we can now reference data by name and we can share or inherit the location and metadata between nodes and Hadoop sub-units.HCATALOG


Apache Pig is a high-level scripting language. This language though, expresses data analysis and infrastructure processes. When a Pig set is executed, it is translated into a series of MapReduce jobs which are later sent to the Hadoop infrastructure (single node or cluster) though the MapReduce program. Pig’s user defined functions can be written in Java. This is the final layer of the cake on top of MapReduce to give the developer more control and a higher level of precision to create the MapReduce jobs which later translate into data processing in a Hadoop cluster.


Apache Ambari is a an operational framework for provisioning and managing Hadoop clusters of multiple nodes or single nodes. Ambari is an effort of cleaning up the messy scripts and views of Hadoop to give a clean look for management and incubating.


Yarn is basically the new version of MapReduce in Hadoop 2.0. It is the Hadoop operating system that is overlaid on top of the system’s base operating system (CentOS13). YARN provides a global Resource Manager and a per-application manager in its newest iteration. The new idea behind this newer version of MapReduce is to split up the functions of JobTracker into two separate parts. This results in a tighter control of the system and ultimately results in more efficiency and ease of use. The illustration shows that an application run natively in Hadoop can utilize YARN as a cluster resource management tool along with its MapReduce 2.0 features as a bridge to the HDFS.



Apache Oozie is effectively just a calendar for running Hadoop processes. For Hadoop, it is a system to manage a workflow through the Oozie Coordinator to trigger workflow jobs from MapReduce or YARN. Oozie is also a scalable system along with Hadoop and its other sub-products. Its workflow scheduler system runs in the base operating system (YARN) and takes commands from user programs.


Read More »


Big Data Coorelation: Hadoop

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System is the foundation for any Hadoop Cluster and/or single-node implementations. The HDFS is the underlying difference between a normal MySQL6 database and a Hadoop implementation. This small change in approaching the data makes all the difference.

A standard MySQL server serves the purpose for any small endeavors and can support an infrastructure about the size of Apple’s database with no problems. The method for processing data usually follows a linear though pattern.Take an example of a phrase “Hello world”. In a very rough representation a MySQL server would save the entire phrase on one hard disk. Then, when the data would be needed the CPU would send a request for the data, the hard disk would spin, and the data would be read/processed.


This traditional approach to managing a database hits a few, key problems with no rational and affordable solution. The largest problem that is faced in this system is a mechanical one. At a certain point of complexity and size, a single hard disk can no longer physically spin fast enough to keep up with the seek capabilities of a single CPU. This problem can lead two solutions: make a better hard disk or rethink the way data is processed in the world today. Hadoop offers a solution to rethink the way this problem is dealt with in a radical new way. A Hadoop cluster implements a parallel computing cluster using inexpensive and standard pieces of hardware. The cluster is distributed among many servers running in parallel. The philosophy behind Hadoop is basically to bring the computing to the data. To successfully implement this, the system has to distribute pieces of the same block of data among multiple servers. So basically each data node holds part of the overall data and can process the little data that it holds. This pyramid scheme is visible when the system is scaled up to an infrastructure of Google’s size. The system no longer has the physical barrier of the spinning disks but rather a problem of just storage capacity (which is a very solvable and good problem to have).


Read More »

Big Data Coorelation: Purpose


About 1.8 zettabytes (1.8 trillion gigabytes) of data is being created every year. In all this data there are answers to problems we have been wondering about for ages. It’s just how you can process the information most efficiently and derive correlations from the complexity of the data on the internet. You may not be able to prove anything scientifically, but you may be able to prove hypotheses statistically with huge amounts of data which is hidden somewhere in this intimidating data set. So is it possible to mine hidden information from these huge scales? Can one use existing technologies such as Apache Hadoop, Nutch, Map Reduce, and Google API to develop an engine that can derive comprehendible correlational data autonomously and efficiently?


With all this data being produced every year, finding a radical and innovative way of processing large and complex data sets is a need that is unfulfilled. For any computer, processing unstructured data is a very arduous and long process (all the internet’s data is unstructured). This exercise of an engine implementation is an attempt at combining multiple high-end technologies to work in unison to crutch and sift through large and complex data sets to Read More »

Big Data Clustering: Introduction & Topic

The past few years have entailed newer problems to the advancement of human intelligence. Trillions of gigabytes of data are being produced every year, and the total cumulative power of all the computers in existence today can merely compute half that amount using a traditional database system to crunch sheer data. This very problem has created a new industry we now know as Big Data. According to Wikipedia, big data is used “for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”

Read More »