Search

The Butterfly Effect

for the idea in all of us

Big Data Coorelation (Research P1)

Big Data

What is big data?

The definition of big data states big data as “the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”2. In a sense this is true if we consider the internet to be the collection of large data sets. This emerging industry already has a couple key miners that have developed technologies that fit their purposes of either sifting of crunching through data. The tools we will be using in this exercise were developed by Apache, Google, and Hortonworks but the creation of the engine which will utilize these engines in unison will be the proprietary idea that will be created in this exercise.

Continue reading “Big Data Coorelation (Research P1)”

Advertisements

Big Data Coorelation: Purpose

Question

About 1.8 zettabytes (1.8 trillion gigabytes) of data is being created every year. In all this data there are answers to problems we have been wondering about for ages. It’s just how you can process the information most efficiently and derive correlations from the complexity of the data on the internet. You may not be able to prove anything scientifically, but you may be able to prove hypotheses statistically with huge amounts of data which is hidden somewhere in this intimidating data set. So is it possible to mine hidden information from these huge scales? Can one use existing technologies such as Apache Hadoop, Nutch, Map Reduce, and Google API to develop an engine that can derive comprehendible correlational data autonomously and efficiently?

Purpose

With all this data being produced every year, finding a radical and innovative way of processing large and complex data sets is a need that is unfulfilled. For any computer, processing unstructured data is a very arduous and long process (all the internet’s data is unstructured). This exercise of an engine implementation is an attempt at combining multiple high-end technologies to work in unison to crutch and sift through large and complex data sets to Continue reading “Big Data Coorelation: Purpose”

Big Data Clustering: Introduction & Topic

The past few years have entailed newer problems to the advancement of human intelligence. Trillions of gigabytes of data are being produced every year, and the total cumulative power of all the computers in existence today can merely compute half that amount using a traditional database system to crunch sheer data. This very problem has created a new industry we now know as Big Data. According to Wikipedia, big data is used “for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”

Continue reading “Big Data Clustering: Introduction & Topic”

GA Utilizing Efficient Operators in TSP

Through the data collected in the above two pages, it can be reasonably be concluded that center inverse mutation in unison with the inversely linear roulette wheel selection and the random crossover point yield the best result with a higher number of generations. We decided to test a combination of all of these genetic operators and see the value of the lowest path yielded by it. The same input graph used for the other tests was used in this case with 6000 chromosomes in the initial population and 5000 generations with a cutoff percentage of 30%

The results are as follows of the top path after 5000 generations:
Weight = 238
Path: {A, X, C, P, S, G, E, U, Q, Y, B, V, N, T, W, I, F, H, Z, O, D, R, M, L, K, J, A}

Graph (with all edges and weights present):Graph

In the Comparison of Genetic Operators For Solving the Traveling Salesman Problem: Selection

In comparing selection methods, for the sake of comparison it was in our best interest to leave the least to randomness except in the selection method. The mutation method was the center inverse mutation throughout all the trials and a center mutation point was chosen every time. The cutoff percentage was the same (30%) for each trial and the number of generations was a fixed 5000.

The numbers displayed below are the average of 10 trials conducted with the same input graph but a different initial population for each trial.Selection Comparison

In the Comparison of Genetic Operators For Solving the Traveling Salesman Problem: Mutation

In attempt to statistically compare the operators, the input graph and the initial population was kept the same for each trial. The numbers displayed below are the average of 10 trials conducted with the same input graph but a different initial population. The algorithm was ran with an input graph consisting of 26 static nodes and approximately 4.03E26 possible combinations. Each trial ran 5000 generations with an input population of 5000 chromosomes. The fitness percentage was 30% throughout every trial.

Mutation Operators and Crossover Point

In this trial the method of selection was kept standard using the percentage cutoff method to avoid any influence from the selection method.

Random Crossover Point Center Crossover Point
Reverse Sequence Mutation 336 414
Center Inverse Mutation 253 310

The representation of each mutation operator over iterations was tested with a constant center crossover point.

Mutation Operator Comparison

Genetic Algorithm: Selection

In every generation, a selection agent comes to play which sifts out the fit chromosomes from the unfit chromosomes. The selection agent “kills off” a user specified percentage of organisms in the population.However, it is under the discretion of the selection agent in determining which chromosomes to kill. As mentioned earlier, fitness is defined by having the lowest weight in the circumstances put forth by the TSP. However selection may not necessarily be only off of that. This can be seen when comparing the two most prevalent types of selection operators:
Continue reading “Genetic Algorithm: Selection”

Genetic Algorithm: Mutation

During the progression of a genetic algorithm, the population can hit a local optima (or extrema). Nature copes for this local optima by adding random genetic diversity to the population set “every-so-often” with the help of mutation. Our genetic algorithm accomplishes this via the mutation operator. Although there are a plethora of mutation types our GA focused on a select two:

1. Reverse Sequence Mutation – In the reverse sequence mutation operator, we take a random point in the sequence or organism. We split the path (P1) at the selected point. The second half of the split path (P1H2) is then inverted and appended to the end of the first half (P1H1) with the necessary corrections made to make sure the last node is the same as the start node to get a final mutated path (M1).

P1: {A, C, J | D, G, H, E, B, F, I, A}  ⇒ M1: {A, C, J, I, F, B, E, H, G, D, A}

2. Center Inverse Mutation – The chromosome (path or organism) is divided into two sections at the middle of the chromosome. Each of these sections are then inverted and added to a new path. The order of each of these halves remains constant, meaning the first inverted half remains the first half in the mutated path. The necessary corrections are made to amend the mutated path into a viable path so solve the TSP.

P1: {A, C, J, D, G, H, E | B, F, I, A}  ⇒ M1: {A, E, H, G, D, J, C, I, F, B, A}

Genetic Algorithms: Crossover

The method of crossover remains fairly constant regardless of the problem and scope. Crossover is achieved by first selecting a crossover point within a pair of defined and unique organisms P1 and P2 (which are the equivalent of parents for the crossed over parent). The chromosomes are then split at the selected crossover point. The second half of P2 (P2H2)  is then appended to the first half of P`1  (P1H1) to make one child chromosome (C1). The second child (C2) is made by appending the second half of P1 (P1H2) to the first half of P2 (P2H1). Continue reading “Genetic Algorithms: Crossover”

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: