Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System is the foundation for any Hadoop Cluster and/or single-node implementations. The HDFS is the underlying difference between a normal MySQL6 database and a Hadoop implementation. This small change in approaching the data makes all the difference.
A standard MySQL server serves the purpose for any small endeavors and can support an infrastructure about the size of Apple’s database with no problems. The method for processing data usually follows a linear though pattern.Take an example of a phrase “Hello world”. In a very rough representation a MySQL server would save the entire phrase on one hard disk. Then, when the data would be needed the CPU would send a request for the data, the hard disk would spin, and the data would be read/processed.
This traditional approach to managing a database hits a few, key problems with no rational and affordable solution. The largest problem that is faced in this system is a mechanical one. At a certain point of complexity and size, a single hard disk can no longer physically spin fast enough to keep up with the seek capabilities of a single CPU. This problem can lead two solutions: make a better hard disk or rethink the way data is processed in the world today. Hadoop offers a solution to rethink the way this problem is dealt with in a radical new way. A Hadoop cluster implements a parallel computing cluster using inexpensive and standard pieces of hardware. The cluster is distributed among many servers running in parallel. The philosophy behind Hadoop is basically to bring the computing to the data. To successfully implement this, the system has to distribute pieces of the same block of data among multiple servers. So basically each data node holds part of the overall data and can process the little data that it holds. This pyramid scheme is visible when the system is scaled up to an infrastructure of Google’s size. The system no longer has the physical barrier of the spinning disks but rather a problem of just storage capacity (which is a very solvable and good problem to have).