Genetic Algorithms in Data Science

We have often heard of optimization as a subject more relevant to Operations Research. But as a Data Science student, I got an opportunity to explore Genetic Algorithms in the field of data mining & machine learning.  It is interesting that Darwin’s theory of evolution is an optimization process and there have been many research & bio-inspired computing work done in the field of Genetic Algorithms.

 

This blog talks about what is GA, why it acts as an optimization process & how it is used in data sciences.

 

What is a Genetic Algorithm?

A genetic algorithm is an adaptation procedure based on the mechanics of natural genetics and natural selection. It has 2 essential components-

  1. Survival of the fittest
  2. Genetic Diversity

This algorithm was first developed by John Holland in 1975. It is a search heuristic that mimics the process of natural evolution & uses concepts of Darwin’s theory of ‘Natural Selection’ and ‘Genetic Inheritance’.

 

Nature’s Diversity

Every other animal is adapted to the environment it lives in. We can certainly learn from nature and try to devise similar materials, approaches and try to solve our problems.

 

Evolution in Real World

To understand biological processes properly, understanding cell is important. Human bodies contain trillions of cells.

Every cell contains a core structure(nucleus) that contain chromosomes. And each of the 23 chromosomes are made up of tightly coiled strands of DNA.

Genes are segments of DNA that determine specific traits, such as hair color. Humans have more than 20000 genes where every gene determines some aspect of organism. A collection of genes is called as genotype & collection of aspects (like hair color) is called as phenotype.

Gene Mutation is an alteration of DNA which can be inherited or acquired during lifetime as cells age or when contacted with some chemicals. Some changes in genes may result in genetic disorders.

Reproduction involves recombination of genes from parents and then small amounts of mutations while copying. e.g. Female is XY and Male is XX chromosome, then daughter would be XX where one X is from Mother & other from Father & son would be XY where X is from Mother & Y is from father.

Fitness of an organism relates with -how much it can reproduce before it dies.

Natural Selection – This concept is from Darwin’s theory of evolution. Only those organisms which adapt well and survive in their environment tend to increase their breed through reproduction and rest are eliminated. Genetic Algorithms maintain a population of candidate solutions at hand & makes it evolve by iteratively applying a set of random operators.

 

Interesting things about Natural Evolution

Organisms while evolving adapt or improve through a biological random process than trying to incorporate understanding of the entire environment around it. Organisms try to cope with the demands of the environments. This has inspired Bio- Inspired Computing branch.

 

Bio-inspired approach takes an evolutionary approach to learn. In traditional AI, usually logic is fed by the programmer. In Bio-Inspired approach simple rules for how to learn and synthesize are fed and intelligence is let to evolve.

 

Problem Solving using GA and its association with Data Science

Usually Brute -Force method of finding solution doesn’t work for a practical problem due to large number of possible solutions which are to be generated and tested before finding the best solution.

Genetic Algorithms can be used instead.

The steps of Genetic Algorithms are as below:

  • STEP1: INITIALIZE POPULATION
  • STEP2: EVALUATE EVERY INDIVIDUAL’S FITNESS
  • STEP3: SATISFIES THE CONSTRAINTS
  • STEP4A: SATISFIES THE CONSTRAINTS -> (IF YES) OUTPUT RESULTS
  • STEP4B: SATISFIES THE CONSTRAINTS -> (IF NO) SELECT SURVIVORS -> RANDOMLY VARY EVERY SURVIVOR -> STEP2

Simple Regression Problem

 

Behind every regression problem are optimized weights that suit the training & testing data well. These weights can be obtained through Genetic Algorithms also.

Let’s say aim is to find optimized Theta0 and Theta1 for the regression equation Demand = Theta0 + Theta1*Price

 

Step 1: The population here is all possible combinations of theta0 & theta1 values where an individual is a set of theta0 and theta1 value. That is theta0, theta1 = {0.5,0.9}  is an individual from a sample population like theta0,theta1 = {0.5,0.9}, theta0,theta1 = {0.55,0.8}, theta0, theta1 = {0.34,0.44},…. And so on.

Step2 & Step3:  We define a cost function or a fitness function to evaluate the fitness of every individual. In OLS regression, the cost function is aiming to minimize error. So, cost function/fitness function here is Error Function. Every individual theta value set is fed into the cost function and checked for which set/individual reduces error closest to zero. More the error is close to zero, fitter is that theta individual. Here the constraint of the fitness function was that the fitness function should return value close to zero.

Step 4: Now, as we obtained the fittest thetas, there could be further score of improving these theta values such that error further reduces. From the fittest individuals, now the thetas are varied by employing natural selection, reproduction with mutation and again tested for its fitness. By this I mean, that the fittest individual theta0, theta1 combination is selected along with a random other individual theta0 and theta1 combination (Natural Selection). Then a cross over point is selected. Let’s say we select theta0 from first individual & theta1 from second individual will be restored in the child combination (Theta0 is the cross over point). Now this child combination will be checked for fitness.

 

Over repeated application of this process, we usually reach the fittest value using Genetic Algorithms.

 

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.