Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System is the foundation for any Hadoop Cluster and/or single-node implementations. The HDFS is the underlying difference between a normal MySQL6 database and a Hadoop implementation. This small change in approaching the data makes all the difference.
A standard MySQL server serves the purpose for any small endeavors and can support an infrastructure about the size of Apple’s database with no problems. The method for processing data usually follows a linear though pattern.Take an example of a phrase “Hello world”. In a very rough representation a MySQL server would save the entire phrase on one hard disk. Then, when the data would be needed the CPU would send a request for the data, the hard disk would spin, and the data would be read/processed.
This traditional approach to managing a database hits a few, key problems with no rational and affordable solution. The largest problem that is faced in this system is a mechanical one. At a certain point of complexity and size, a single hard disk can no longer physically spin fast enough to keep up with the seek capabilities of a single CPU. This problem can lead two solutions: make a better hard disk or rethink the way data is processed in the world today. Hadoop offers a solution to rethink the way this problem is dealt with in a radical new way. A Hadoop cluster implements a parallel computing cluster using inexpensive and standard pieces of hardware. The cluster is distributed among many servers running in parallel. The philosophy behind Hadoop is basically to bring the computing to the data. To successfully implement this, the system has to distribute pieces of the same block of data among multiple servers. So basically each data node holds part of the overall data and can process the little data that it holds. This pyramid scheme is visible when the system is scaled up to an infrastructure of Google’s size. The system no longer has the physical barrier of the spinning disks but rather a problem of just storage capacity (which is a very solvable and good problem to have).
So, back to the “Hello World” analogy. In that case, the instance of “hello world” will be split into two parts “hello” and “world” and will be saved on two separate nodes. When the data will need to be processed, a server will, in a rough sense, point a CPU to both nodes and both nodes will seek half the data for a 200% increase in seek speed. Therefore, the more nodes an infrastructure has, the exponentially more times that Hadoop Cluster will be benefited from it.
The framework for this type of distributed file system is called MapReduce. MapReduce is a programming language that aids in the processing of large data sets in parallel with a cluster. Map reduce programs are broken down into mapped procedures that perform summary operations. MapReduce lies in the heart of all Hadoop queries.