BEAST on BEAGLE Benchmarks

We did some benchmarking of our implementation of BEAST (1.6.1) on the BEAGLE framework. All benchmarking runs were performed on the Trestles Supercomputer.
To determine how to run BEAST with the BEAGLE library efficiently, we used the benchmarks provided with the BEAST release, and also user data sets that have been submitted or provided by users who requested the BEAST program. Benchmark2 shows some benefit from using GPUs, but benchmark1 and all of the data sets we received from users have too few unique sites per partition to make good use of GPUs. Accordingly we implemented BEAST with BEAGLE on Trestles at SDSC using the beagle_SSE option (using CPUs instead of GPUs).The data sets we received had too few unique sites per partition to benefit from using GPUs, much like the benchmark1.xml data set distributed with BEAST. Accordingly we implemented BEAST with BEAGLE on Trestles at SDSC using the beagle_SSE option (using CPUs instead of GPUs).

Two types of threaded parallelization are available, and we used these together to improve performance.
•     BEAST allows a separate thread for each partition.
•     BEAGLE allows an arbitrary number of threads within a given partition.

The table below shows run times on Trestles using the native BEAST kernel and the beagle_SSE kernel for various combinations of threads and processor cores.  The first two data sets are benchmark1 and benchmark2.  They have only a single partition, whereas most user data sets have multiple partitions. For partitioned data sets, using BEAST threads is typically more effective than using BEAGLE threads.

Since the speed of a given run increases less than linearly with the number of cores, there is a tradeoff between decreasing the run time and increasing the cost of the run.  For runs on Trestles via the CIPRES gateway, we decided that a reasonable compromise is to use eight cores in all cases.

•     For a single partition, 8 BEAGLE threads are used.
•     For 2 or 3 partitions, 2 BEAST threads and 4 BEAGLE threads are used.
•     For 4 or more partitions, 4 BEAST threads and 2 BEAGLE threads are used.

These rules give speedups relative to the native kernel of between 2 and 10 depending upon the data set.  Higher speedups are possible, but for the data sets we looked at, these speedups come only at much higher cost (as can be seen from the last column in the table).

Since the speed of a given run does not increase linearly with the number of cores used, there is a tradeoff in cost between decreasing the run time and increasing resource use.  Runs made on Trestles using eight cores (highlighted in green in the table below) seemed to optimize these two criteria, giving speedups from 2 to 7-fold depending upon the data set. Higher speedups are possible, but with the data sets we looked at, these speedups came only at much higher cost. (Please see the Cost column in the table below for examples).

Run times and speedups of BEAST/BEAGLE on Trestles
 
 
Data set
Data type
ntax
nchar
Partitions

Unique sites
/partition

Time steps
Kernel

BEASTthreads

BEAGLE threads
Cores

Run time
(min)

Speedup

Cost
(cpu min)

Benchmark 1
DNA
1,441
987
1
593
10k
native
1
1
14.64
1.00
15
beagle_SSE
1
1
1
5.53
2.65
6
beagle_SSE
1
8
8
5.57
2.63
45
Benchmark 2
DNA
62
10,869
1
5,565
10k
native
1
1
8.41
1.00
8
beagle_SSE
1
1
1
9.27
0.91
9
beagle_SSE
1
8
8
1.84
4.57
15
beagle_SSE
1
16
16
1.63
5.17
26
DS3
DNA
219
1,956
1
1,314
10k
native
1
1
3.29
1.00
3
beagle_SSE
1
1
1
3.74
0.88
4
beagle_SSE
1
8
8
0.97
3.40
8
beagle_SSE
1
16
16
0.94
3.48
15
 
             
DS 4
DNA
48
1,577
2
237 - 276
100k
native
1
1
2.16
1.00
2.16
native
2
2
1.95
1.11
3.90
beagle_SSE
1
1
1
2.60
0.83
2.60
beagle_SSE
2
4
8
1.06
2.03
8.48
beagle_SSE
2
8
16
1.03
2.10
16.48
DS 5
AA
131
3,095
4
122 - 752
10k
native
1
1
6.15
1.00
6.15
native
4
4
3.55
1.73
14.20
beagle_SSE
1
1
1
3.45
1.78
3.45
beagle_SSE
4
2
8
1.14
5.41
9.09
beagle_SSE
4
4
16
0.87
7.04
13.07
DS 6
DNA
348
6,954
16
37 - 814
10k
native
1
1
141.22
1.00
141.22
native
16
16
33.09
4.27
529.22
beagle_SSE
1
1
1
66.83
2.11
66.83
beagle_SSE
8
1
8
20.37
6.93
162.93
beagle_SSE
16
2
32
14.66
9.63
469.05
DS 7
DNA
271
11,440
27
26 - 245
10k
native
1
1
45.07
1.00
45.07
native
27
27
10.81
4.17
291.81
beagle_SSE
1
1
1
31.02
1.45
31.02
beagle_SSE
8
1
8
11.40
3.95
91.18
beagle_SSE
27
1
27
9.81
4.59
264.89

If you feel your data set differs dramatically from those given above, you can send us a copy, and we will look at possible new configurations for your data set. We are always happy to receive input on the speedups you see using our BEAST implementation, and advice on how to make BEAST more useful to the community.

If there is a tool or a feature you need, please let us know.