The Apache Hive project gives a Hadoop developer a view of the data in the Hadoop Distributed File System. This is basically a file manager for Hadoop. Using a SQL-like language, Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL. The table structure effectively projects a structured data set onto unstructured data. If we are using data in HDFS (which we are) our operations can be scaled across all the data nodes and we can manipulate huge datasets.
The function of Apache HCatalog is to hold location and metadata8 about the data in a Hadoop single node system or cluster. This allows scripts and MapReduce jobs to be separated from each other into data location and metadata. Basically this project is what catalogs and sets pointers to other data bits in different nodes. In our “Hello World” analogy, HCatalog would tell where and which node “Hello” is and where and which node “World” is. Since HCatalog can be used with other Hadoop technologies like Pig and Hive, HCatalog can also help those tools in cataloging and indexing their data. For our purposes we can now reference data by name and we can share or inherit the location and metadata between nodes and Hadoop sub-units.
Apache Pig is a high-level scripting language. This language though, expresses data analysis and infrastructure processes. When a Pig set is executed, it is translated into a series of MapReduce jobs which are later sent to the Hadoop infrastructure (single node or cluster) though the MapReduce program. Pig’s user defined functions can be written in Java. This is the final layer of the cake on top of MapReduce to give the developer more control and a higher level of precision to create the MapReduce jobs which later translate into data processing in a Hadoop cluster.
Apache Ambari is a an operational framework for provisioning and managing Hadoop clusters of multiple nodes or single nodes. Ambari is an effort of cleaning up the messy scripts and views of Hadoop to give a clean look for management and incubating.
Yarn is basically the new version of MapReduce in Hadoop 2.0. It is the Hadoop operating system that is overlaid on top of the system’s base operating system (CentOS13). YARN provides a global Resource Manager and a per-application manager in its newest iteration. The new idea behind this newer version of MapReduce is to split up the functions of JobTracker into two separate parts. This results in a tighter control of the system and ultimately results in more efficiency and ease of use. The illustration shows that an application run natively in Hadoop can utilize YARN as a cluster resource management tool along with its MapReduce 2.0 features as a bridge to the HDFS.
Apache Oozie is effectively just a calendar for running Hadoop processes. For Hadoop, it is a system to manage a workflow through the Oozie Coordinator to trigger workflow jobs from MapReduce or YARN. Oozie is also a scalable system along with Hadoop and its other sub-products. Its workflow scheduler system runs in the base operating system (YARN) and takes commands from user programs.
Apache HBase is a tool to effectively access a large table much similar to the layout of an Excel spreadsheet. An Excel spreadsheet consisting of billions of rows and billions of columns would have trouble randomly accessing specific cells
and pointers to information efficiently but HBase offers a streamlined solution for efficiently reading a large structured data set. HBase can give a developer perfect and quick random access to any cell in its HDFS. This tool utilizes the HDFS and gives real time feedback and is often used with Google’s BigTable16 due to its model. Google uses this model in order to manage its data pointers and indexed file systems. HBase can also be used with HDFS and other Hadoop sub-products. The diagram below illustrates HBase’s place in the HDFS relative to the nodes and data layer.
Apache ZooKeeper is exactly what its name implies. It bundles many technologies together. ZooKeeper is an old technology that wasn’t necessarily defined for Hadoop but can be used with Hadoop if slightly altered. ZooKeeper formally is a tool for “maintaining configuration information, naming, providing distributed synchronization, and providing group services” for similar product groups. Zookeeper is usually implemented in a separate server but will be implemented on the virtual machine in this exercise. The diagram below shows how zookeeper manages its nodes and pushes information from master to slave to manage a client in realtime.
The Hadoop infrastructure sets up redundancies to keep data from ever being lost. About 80 MB (megabytes) blocks of data are stored usually in about 3 nodes simultaneously. The pointers to these three replicas of the same block of data are stored in HCatalog and can be used simultaneously for reading the data at quicker speeds due to the combined usage of 3 nodes and disks reading the same data and it can be used to prevent failure and loss of data. A balance is usually found and managed by HBase. A small cluster that will be run on a laptop for this experiment will use these redundancies to speed up the system rather than prevent data loss. Google often uses these systems to hold their data in multiple locations and access it more quickly because hardware failure is not an uncommon occurrence. The management of redundancies is managed by a Job Tracker like YARN in this system.