author | bibliography | title | ||
---|---|---|---|---|
|
|
Parallel Standard Particle Swarm Optimization |
The Particle Swarm Optimization @PSO95 is a meta-heuristic based on the behavior of bird flocks. Every iteration, each particle moves in the search space based on three components:
-
Social: This component contributes for the exploitation of the algorithm. It guides the search towards the best solutions found by the swarm.
-
Individual: The individual component guides the particle to the best region that the particle has found.
-
Inertia: The inertia component increase the exploration of the algorithm. It is responsible for keeping the particle moving towards a previous direction and avoid abrupt changes of direction.
The position of one particle is a set of variable values, i.e. a solution for an optimization problem. Then, the algorithms evaluate the solution, and a fitness value is associated. Based on the fitness value the Social and Individual components are updated. Moreover, these components are used to compute the next position of the particle (another solution for the problem).
Since its first publication, the PSO had proposed several adaptations and improvements. In 2006 it was created the first version of a Standard Particle Swarm Optimization @SPSO. The SPSO was not proposed to be the best PSO available, but establish a common benchmark as a baseline to assess the PSO variants in the literature.
Since its first publication (2006), there is three versions of SPSO:
2006, 2007 and 2011. In the latest version the description is given as
follow (and showed in Algorithm [alg:spso2011]): The first population
(set of solutions) is initialized randomly in the search space. Then,
the algorithms initialize the velocity also randomly. The suggested
population size is
In the previous versions of PSO, the velocity was updated dimension by dimension. However, in SPSO2011 the velocity is updated in a geometrical way, that does not depend on the system of coordinates.
In the literature, it is possible to find different proposals of Parallel PSOs. One of the most recent is the GPU-based Asynchronous Particle Swarm Optimization @GPUPSO. That implements a simple ring topology. In @ReviewGPUPSO, it is presented a review of Particle Swarm Optimization on GPU. The paper presents 28 references of PSO on GPU including multi-objective versions and variations of PSO from 2009 to 2014. One of the papers from the review implements the Standard Particle Swarm Optimization 2007 using GPU @GPUSPSO2007. The GPU version was 11 times faster than the CPU, mainly with a large population and high dimensional problems.
In this work, we use the SPSO2011, which uses the Adaptive Random
Neighborhood topology. During the literature review, it was not found
any other implementation of the SPSO2011 using GPU. The parallel
implementation associates each particle with one thread
(Algorithm [alg:pspso2011]). In this way, all
foreach particle i \in swarm
were removed and executed in parallel
based on the thread id inside the block (
The first particle copies the best fitness for later comparison. Inside
the main loop, the particle updates its velocity, position, and fitness.
The personal best information is updated. Before set the neighbor best
information a synchronization is necessary because the threads must
compare its fitness to the updated value of its neighbors. Then, the
neighbor best information is updated. Also, the best fitness is updated
using the __syncthreads()
. To wait for all threads update the
best fitness before using its value. If the best fitness is not
improved, then the new neighborhood matrix is generated. After
In the first implemented version it was not used shared memory, but in a
second implementation it was used and the execution time was improved.
Another item important to highlight is the two __syncthreads()
inside
the loop. The use of __syncthreads()
generally slow down the execution
of a CUDA program. However, in this implementation we did not have this
problem due to the block size equals to warp size. Since the threads in
the same warp execute in true (hardware) parallelism, there is no
situation where threads need to sync with others on the same block.
The first experiments were executed using only one block. However, the experiments run the SPSO several times; those executions can be run in parallel, using different and independent blocks. In the experiments using only one block the CPU implementation had a better execution time. However, when using multi executions at once, the GPU implementation achieved a better execution time. The next section details the experiments and results.
The first experiment was used to validate the equivalence regarding the quality of the GPU and CPU implementation. Both implementations showed similar convergence along the execution.
For this, and all other, experiment it was used
The Table [tbl:gpuinfo] presents the information about the used GPU. Besides, the Table [tbl:cpuinfo] presents the CPU information.
Table 1. | GPU Information |
---|---|
--- General Information for device 0 --- | |
Name | GeForce GTX 680 |
Compute capability | 3.0 |
Clock rate | 1058500 |
Device copy overlap | Enabled |
Kernel execution timeout | Disabled |
--- Memory Information for device 0 --- | |
Total global mem | 2095382528 |
Total constant Mem | 65536 |
Max mem pitch | 2147483647 |
Texture Alignment | 512 |
--- MP Information for device 0 --- | |
Multiprocessor count | 8 |
Shared mem per mp | 49152 |
Registers per mp | 65536 |
Threads in warp | 32 |
Max threads per block | 1024 |
Max thread dimensions | (1024, 1024, 64) |
Max grid dimensions | (2147483647, 65535, 65535) |
Table 2. | CPU Information |
---|---|
vendor_id | GenuineIntel |
cpu family | 6 |
model | 63 |
model name | Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz |
stepping | 2 |
microcode | 0x36 |
cpu MHz | 3599.941 |
cache size | 15360 KB |
In the Figures [fig:loglog_convergence] and [fig:loglog_convergence] it is presented the average convergence of the algorithms (GPU and CPU) during the iterations. Where it is possible to observe similar behavior.
loglog of mean CPU and GPU convergence
semilogy of mean CPU and GPU convergence
Then, we implemented a version using shared memory also. The best-known
fitness, the previous best, the population (or swarm), the fitness
values and the adjacency matrix use shared memory. We compare the three
versions (CPU, GPU, and GPU with shared memory). In terms of convergence
the results were similar (as they should
[Figure [fig:sphere10_32particles_fitness]]). The experiments were
repeated
Convergence of different implementations (single block - 51 executions)
Execution time of different implementations (single block - 51 executions)
To use the CUDA parallelism capabilities and exploit the characteristics
of SPSO the GPU implementation executes the
Convergence of different implementations (GPU 51-blocks, 30 repetitions)
Execution time of different implementations (GPU 51-blocks, 30 repetitions)
In this project, we implement a Parallel Standard Particle Swarm Optimization. First, we compare the implementations regarding convergence quality, in which both GPU and CPU implementations were similar. After different implementations and experiments, it was possible to achieve a 12.054 speed up. The execution time of the CPU version for 51 runs was 12.84892 seconds (in an average of 30 repetitions). Moreover, the execution time for the GPU version was 1.065946 seconds.