We did some benchmarking of our implementation of BEAST (1.6.1) on the BEAGLE framework. All benchmarking runs were performed on the Trestles Supercomputer.
To determine how to run BEAST with the BEAGLE library efficiently, we asked the users who requested BEAST to contribute exemplar data sets. The four data sets we received had too few unique sites per partition to benefit from using GPUs, much like the benchmark1.xml data set distributed with BEAST. Accordingly we implemented BEAST with BEAGLE on Trestles at SDSC using the beagle_SSE option (using CPUs instead of GPUs).
Two types of threaded parallelization are available and we used these together to optimize resource use.
The following table shows the run times we measured on Trestles using the native BEAST kernel and the beagle_SSE kernel for various combinations of threads and CPU cores. The best performance was generally obtained using the beagle_SSE kernel with BEAST threads first and BEAGLE threads second. Since the speed of a given run does not increase linearly with the number of cores used, there is a tradeoff in cost between decreasing the run time and increasing resource use. Runs made on Trestles using eight cores (highlighted in green in the table below) seemed to optimize these two criteria, giving speedups from 2 to 7-fold depending upon the data set. Higher speedups are possible, but with the data sets we looked at, these speedups came only at much higher cost. (Please see the Cost column in the table below for examples).
Run times and speedups of BEAST/BEAGLE on Trestles for four user data sets | ||||||||||||
Data set | ntax |
nchar |
Partitions |
Unique sites /partition |
Time steps |
Kernel |
BEAST threads |
BEAGLE threads |
Cores |
Run time |
Speedup |
Cost |
(min) |
(cpu min) |
|||||||||||
DS 1 | 48 |
1,577 |
2 |
237 - 276 |
100k |
native |
1 |
1 |
2.16 |
1.00 |
2.16 |
|
native |
2 |
2 |
1.95 |
1.11 |
3.90 |
|||||||
beagle_SSE |
1 |
1 |
1 |
2.60 |
0.83 |
2.60 |
||||||
beagle_SSE |
2 |
4 |
8 |
1.06 |
2.03 |
8.48 |
||||||
beagle_SSE |
2 |
8 |
16 |
1.03 |
2.10 |
16.48 |
||||||
DS 2 | 131 |
3,095 |
4 |
122 - 752 |
10k |
native |
1 |
1 |
6.15 |
1.00 |
6.15 |
|
native |
4 |
4 |
3.55 |
1.73 |
14.20 |
|||||||
beagle_SSE |
1 |
1 |
1 |
3.45 |
1.78 |
3.45 |
||||||
beagle_SSE |
4 |
2 |
8 |
1.14 |
5.41 |
9.09 |
||||||
beagle_SSE |
4 |
4 |
16 |
0.87 |
7.04 |
13.07 |
||||||
DS 3 | 348 |
6,954 |
16 |
37 - 814 |
10k |
native |
1 |
1 |
141.22 |
1.00 |
141.22 |
|
native |
16 |
16 |
33.09 |
4.27 |
529.22 |
|||||||
beagle_SSE |
1 |
1 |
1 |
66.83 |
2.11 |
66.83 |
||||||
beagle_SSE |
8 |
1 |
8 |
20.37 |
6.93 |
162.93 |
||||||
beagle_SSE |
16 |
2 |
32 |
14.66 |
9.63 |
469.05 |
||||||
DS 4 | 271 |
11,440 |
27 |
26 - 245 |
10k |
native |
1 |
1 |
45.07 |
1.00 |
45.07 |
|
native |
27 |
27 |
10.81 |
4.17 |
291.81 |
|||||||
beagle_SSE |
1 |
1 |
1 |
31.02 |
1.45 |
31.02 |
||||||
beagle_SSE |
8 |
1 |
8 |
11.40 |
3.95 |
91.18 |
||||||
beagle_SSE |
27 |
1 |
27 |
9.81 |
4.59 |
264.89 |
If you feel your data set differs dramatically from those given above, you can send us a copy, and we will look at possible new configurations for your data set. We are always happy to receive input on the speedups you see using our BEAST implementation, and advice on how to make BEAST more useful to the community.
If there is a tool or a feature you need, please let us know.