Big Data Clustering: Introduction & Topic

The past few years have entailed newer problems to the advancement of human intelligence. Trillions of gigabytes of data are being produced every year, and the total cumulative power of all the computers in existence today can merely compute half that amount using a traditional database system to crunch sheer data. This very problem has created a new industry we now know as Big Data. According to Wikipedia, big data is used “for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”

Now you might be asking “why would anyone in their right mind want to crutch all this data?”

Well, in the grand scheme of things, a lot of the data on the Internet hasn’t found a purpose… yet. But what a lot of big data companies and organizations such as Google, Facebook, Apache, and the NSA (National Security Agency) are trying to achieve is a radical new method to efficiently process zettabytes (1 trillion gigabytes) of information in the limited resources they have to find the useful parts of that data set and identify the useless parts. Within this data, one can derive information that is useful for their own purposes. For the NSA, deriving conclusions from big data may mean maintaining the state of public welfare through surveillance, and for Google, it might mean a better understanding of their customers and therefore better advertisements. Despite its purpose, information has become the new gold of the information age, and mining it has become a ceaseless race among the brightest minds of the future. All the tools are available for free in the open source market but the real challenge is to use the available tools in an efficient manner that will give the best and most accurate correlative information for a scalable and reasonable cause.

The technologies that are being developed right now can basically be divided into two different categories: sifters and crunchers. Sifters are technologies PIG that help one sift through data and find data that is useful to the question and later to comprehend it. Crunchers are technological infrastructures (usually databases) like Hadoop that help to look through massive amounts of unstructured data. The goal of this experiment will be to use these two technologies in unison to come to a correlative conclusion using the data on the internet.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s