This repository describes the theory behind “Populations Methods” and after that, it try to use this methdos in prartical projects
The theory part, is from chapter 9 of Kochenderfer, M. J. & Wheeler, T. A. (2019). Algorithms for Optimization. The MIT Press. The various sections have been modified with other materials found on the web
population methods involve optimization using a collection of design points, called individuals. Having a large number of individuals distributed throughout the design space can help the algorithm avoid becoming stuck in a local minimum (this is an improvement from methods where a single design point is moved incrementally toward a minimum).
Information at different points in the design space can be shared between individuals to globally optimize the objective function. Most population methods are stochastic in nature, and it is generally easy to parallelize the computation.
Unlike “trajectory-based” methods (like Gradient Descent) that follow a single path, these algorithms throw a whole crowd of potential solutions into the search space and let them interact to find the best answer.
It’s essentially “survival of the fittest” applied to data.
Why we don’t just use standard calculus (gradients), it’s because real-world math is often messy. Population methods offer:
Global Exploration: They are less likely to get stuck in a “local trap” (a small hill that looks like a mountain until you see the real Everest further away).
No Gradient Needed: They work on “black box” problems where we don’t have a neat formula for the derivative.
Parallelism: Since you have many candidates, you can calculate their fitness simultaneously on multiple processors.
Note: The trade-off is speed. These methods usually require more “fitness evaluations” (calculations) than gradient-based methods.
Population methods begin with an initial population, just as descent methods require an initial design point. The initial population should be spread over the design space to increase the chances that the samples are close to the best regions.
We can often constrain the design variables to a region of interest consisting of a hyperrectangle defined by lower and upper bounds a and b. Initial populations can be sampled from a uniform distribution for each coordinate:
\[x_i^{(j)} \sim U(a_i, b_i)\]where $x^{(j)}$ is the $j$th individual in the population.
A comparison of the normal distribution with standard deviation 1 and the Cauchy dis- tribution with scale 1. Although σ is sometimes used for the scale parameter in the Cauchy distri- bution, this should not be con- fused with the standard deviation since the standard deviation of the Cauchy distribution is undefined. The Cauchy distribution is heavy- tailed, allowing it to cover the de- sign space more broadly.
nota: va messo riformattato emesso le formule A comparison of the normal distribution with standard deviation 1 and the Cauchy dis- tribution with scale 1. Although σ is sometimes used for the scale parameter in the Cauchy distri- bution, this should not be con- fused with the standard deviation since the standard deviation of the Cauchy distribution is undefined. The Cauchy distribution is heavy- tailed, allowing it to cover the de- sign space more broadly. —
Genetic algorithms borrow inspiration from biological evolution, where fitter individuals are more likely to pass on their genes to the next generation. An individual’s fitness for reproduction is inversely related to the value of the objective function at that point. The design point associated with an individual is represented as a chromosome. At each generation, the chromosomes of the fitter individuals are passed on to the next generation after undergoing the genetic operations of crossover and mutation.
There are several ways to represent chromosomes. The simplest is the binary string chromosome, a representation similar to the way DNA is encoded. A random binary string of length d can be generated using bitrand(d).
Binary strings are often used due to the ease of expressing crossover and mutation. It is often more natural to represent a chromosome using a list of real values. Such real-valued chromosomes are vectors in $\mathbb{R}^d$ that directly correspond to points in the design space.
A chromosome represented as a binary string.
Genetic algorithms start with a random initial population. Binary string chromosomes are typically initialized using random bit strings.
Selection is the process of choosing chromosomes to use as parents for the next generation. For a population with m chromosomes, a selection method will produce a list of m parental pairs for the m children of the next generation.
There are several approaches for biasing the selection toward the fittest:
k chromosomes in the population.k randomly chosen chromosomes.Crossover combines the chromosomes of parents to form children. There are several crossover schemes:
Two-point crossover — uses two random crossover points.
Uniform crossover — each bit has a fifty percent chance of coming from either parent.
For real-valued chromosomes, values can also be linearly interpolated between the parents:
\[x \leftarrow (1 - \lambda)x_a + \lambda x_b\]where $\lambda$ is a scalar parameter typically set to one-half.
If new chromosomes were produced only through crossover, many traits not present in the initial random population could never occur, and the most-fit genes could saturate the population. Mutation allows new traits to spontaneously appear, allowing the genetic algorithm to explore more of the state space. Child chromosomes undergo mutation after crossover.
Each bit in a binary-valued chromosome typically has a small probability of being flipped. For a chromosome with m bits, this mutation rate is typically set to 1/m, yielding an average of one mutation per child chromosome. Mutation for real-valued chromosomes is more commonly implemented by adding zero-mean Gaussian noise.
above44
Mutation for binary string chromosomes gives each bit a small probability of flipping.
above
A genetic algorithm with truncation selection, single point crossover, and Gaussian mu- tation with σ = 0.1 applied to the Michalewicz function
Differential evolution attempts to improve each individual in the population by recombining other individuals according to a simple formula. It is parameterized by a crossover probability p and a differential weight w. Typically, w is between 0.4 and 1. For each individual x:
Differential evolution takes three individuals a, b, and c and combines them to form the can- didate individual z. watch:
The algorithm is demonstrated with the following image. in particular you can see:
Differential evolution with p = 0.5 and w = 0.2 applied to Ackley’s function,
Particle swarm optimization introduces momentum to accelerate convergence toward minima. Each individual, or particle, in the population keeps track of its current position, velocity, and the best position it has seen so far. Momentum allows an individual to accumulate speed in a favorable direction, independent of local perturbations.
At each iteration, each individual is accelerated toward both the best position it has seen and the best position found thus far by any individual. The update equations are:
\[x^{(i)} \leftarrow x^{(i)} + v^{(i)}\] \[v^{(i)} \leftarrow wv^{(i)} + c_1 r_1 \left(x^{(i)}_\text{best} - x^{(i)}\right) + c_2 r_2 \left(x_\text{best} - x^{(i)}\right)\]where $x_\text{best}$ is the best location found so far over all particles; $w$, $c_1$, and $c_2$ are parameters; and $r_1$ and $r_2$ are random numbers drawn from $U(0, 1)$.
A common strategy is to allow the inertia $w$ to decay over time.
The firefly algorithm was inspired by the manner in which fireflies flash their lights to attract mates. In the firefly algorithm, each individual in the population is a firefly and can flash to attract other fireflies. At each iteration, all fireflies are moved toward all more attractive fireflies. A firefly a is moved toward a firefly b with greater attraction according to:
\[a \leftarrow a + \beta I(\|b - a\|)(b - a) + \alpha \epsilon\]where $I$ is the intensity of the attraction and $\beta$ is the source intensity. A random walk component is included as well, where $\epsilon$ is drawn from a zero-mean, unit covariance multivariate Gaussian, and $\alpha$ scales the step size.
The intensity $I$ decreases as the distance $r$ between the two fireflies increases and is defined to be 1 when $r = 0$. Several models are available:
Inverse square law: $I(r) = \dfrac{1}{r^2}$
Exponential decay (absorbed in medium): $I(r) = e^{-\gamma r}$
Gaussian drop-off (recommended — avoids singularity at $r = 0$): $I(r) = e^{-\gamma r^2}$
Firefly search, with α = 0.5,β = 1,andγ = 0.1ap- plied to the Branin function
Cuckoo search is another nature-inspired algorithm named after the cuckoo bird, which engages in a form of brood parasitism. Cuckoos lay their eggs in the nests of other birds; the host bird may detect the invasive egg and destroy it or establish a new nest elsewhere, or may accept and raise it.
In cuckoo search, each nest represents a design point. New design points are produced using Lévy flights from nests — random walks with step-lengths from a heavy-tailed distribution. A new design point can replace a nest if it has a better objective function value.
The core rules are:
Cuckoo search uses a Cauchy distribution for random flights, which has a heavier tail than uniform or Gaussian distributions and is more representative of animal movement patterns in the wild.
Other nature-inspired algorithms include the artificial bee colony, the gray wolf optimizer, the bat algorithm, glowworm swarm optimization, intelligent water drops, and harmony search. There has been some criticism of the proliferation of methods that make analogies to nature without fundamentally contributing novel methods and understanding.
Many population methods perform well in global search, being able to avoid local minima and finding the best regions of the design space. However, these methods do not perform as well in local search compared to descent methods. Several hybrid methods (also referred to as memetic algorithms or genetic local search) have been developed to extend population methods with descent-based features.
Cuckoo search applied to the Branin function
There are two general approaches:
Lamarckian learning — the population method is extended with a local search method that locally improves each individual. The original individual and its objective function value are replaced by the individual’s optimized counterpart.
Baldwinian learning — the same local search method is applied to each individual, but the results are used only to update the individual’s perceived objective function value. Individuals are not replaced but are merely associated with optimized objective function values. Baldwinian learning can help prevent premature convergence.
Consider optimizing $f(x) = -e^{-x^2} - 2e^{-(x-3)^2}$ using a population of individuals initialized near $x = 0$.
A Lamarckian local search update applied to this population would move the individuals toward the local minimum, reducing the chance that future individuals escape and find the global optimum near $x = 3$.
A Baldwinian approach will compute the same update but leaves the original designs unchanged. The selection step will value each design according to its value from a local search, preserving diversity and improving the chances of finding the global optimum.