Big Data Coorelation: Hadoop Stack


The Apache Hive project gives a Hadoop developer a view of the data in the Hadoop Distributed File System. This is basically a file manager for Hadoop. Using a SQL-like language, Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL. The table structure effectively projects a structured data set onto unstructured data. If we are using data in HDFS (which we are) our operations can be scaled across all the data nodes and we can manipulate huge datasets.



The function of Apache HCatalog is to hold location and metadata8 about the data in a Hadoop single node system or cluster. This allows scripts and MapReduce jobs to be separated from each other into data location and metadata. Basically this project is what catalogs and sets pointers to other data bits in different nodes. In our “Hello World” analogy, HCatalog would tell where and which node “Hello” is and where and which node “World” is. Since HCatalog can be used with other Hadoop technologies like Pig and Hive, HCatalog can also help those tools in cataloging and indexing their data. For our purposes we can now reference data by name and we can share or inherit the location and metadata between nodes and Hadoop sub-units.HCATALOG


Apache Pig is a high-level scripting language. This language though, expresses data analysis and infrastructure processes. When a Pig set is executed, it is translated into a series of MapReduce jobs which are later sent to the Hadoop infrastructure (single node or cluster) though the MapReduce program. Pig’s user defined functions can be written in Java. This is the final layer of the cake on top of MapReduce to give the developer more control and a higher level of precision to create the MapReduce jobs which later translate into data processing in a Hadoop cluster.


Apache Ambari is a an operational framework for provisioning and managing Hadoop clusters of multiple nodes or single nodes. Ambari is an effort of cleaning up the messy scripts and views of Hadoop to give a clean look for management and incubating.


Yarn is basically the new version of MapReduce in Hadoop 2.0. It is the Hadoop operating system that is overlaid on top of the system’s base operating system (CentOS13). YARN provides a global Resource Manager and a per-application manager in its newest iteration. The new idea behind this newer version of MapReduce is to split up the functions of JobTracker into two separate parts. This results in a tighter control of the system and ultimately results in more efficiency and ease of use. The illustration shows that an application run natively in Hadoop can utilize YARN as a cluster resource management tool along with its MapReduce 2.0 features as a bridge to the HDFS.



Apache Oozie is effectively just a calendar for running Hadoop processes. For Hadoop, it is a system to manage a workflow through the Oozie Coordinator to trigger workflow jobs from MapReduce or YARN. Oozie is also a scalable system along with Hadoop and its other sub-products. Its workflow scheduler system runs in the base operating system (YARN) and takes commands from user programs.


Read More »