CIPRES

BEAST on BEAGLE Benchmarks

We did some benchmarking of our implementation of BEAST (1.6.1) on the BEAGLE framework. All benchmarking runs were performed on the Trestles Supercomputer.
To determine how to run BEAST with the BEAGLE library efficiently, we used the benchmarks provided with the BEAST release, and also user data sets that have been submitted or provided by users who requested the BEAST program. Benchmark2 shows some benefit from using GPUs, but benchmark1 and all of the data sets we received from users have too few unique sites per partition to make good use of GPUs. Accordingly we implemented BEAST with BEAGLE on Trestles at SDSC using the beagle_SSE option (using CPUs instead of GPUs).The data sets we received had too few unique sites per partition to benefit from using GPUs, much like the benchmark1.xml data set distributed with BEAST. Accordingly we implemented BEAST with BEAGLE on Trestles at SDSC using the beagle_SSE option (using CPUs instead of GPUs).

Two types of threaded parallelization are available, and we used these together to improve performance.
•     BEAST allows a separate thread for each partition.
•     BEAGLE allows an arbitrary number of threads within a given partition.

The table below shows run times on Trestles using the native BEAST kernel and the beagle_SSE kernel for various combinations of threads and processor cores. The first two data sets are benchmark1 and benchmark2. They have only a single partition, whereas most user data sets have multiple partitions. For partitioned data sets, using BEAST threads is typically more effective than using BEAGLE threads.

Since the speed of a given run increases less than linearly with the number of cores, there is a tradeoff between decreasing the run time and increasing the cost of the run. For runs on Trestles via the CIPRES gateway, we decided that a reasonable compromise is to use eight cores in all cases.

•     For a single partition, 8 BEAGLE threads are used.
•     For 2 or 3 partitions, 2 BEAST threads and 4 BEAGLE threads are used.
•     For 4 or more partitions, 4 BEAST threads and 2 BEAGLE threads are used.

These rules give speedups relative to the native kernel of between 2 and 10 depending upon the data set. Higher speedups are possible, but for the data sets we looked at, these speedups come only at much higher cost (as can be seen from the last column in the table).

Since the speed of a given run does not increase linearly with the number of cores used, there is a tradeoff in cost between decreasing the run time and increasing resource use. Runs made on Trestles using eight cores (highlighted in green in the table below) seemed to optimize these two criteria, giving speedups from 2 to 7-fold depending upon the data set. Higher speedups are possible, but with the data sets we looked at, these speedups came only at much higher cost. (Please see the Cost column in the table below for examples).

Run times and speedups of BEAST/BEAGLE on Trestles

Data set	Data type	ntax	nchar	Partitions	Unique sites /partition	Time steps	Kernel	BEASTthreads	BEAGLE threads	Cores	Run time (min)	Speedup	Cost (cpu min)
Benchmark 1	DNA	1,441	987	1	593	10k	native	1		1	14.64	1.00	15
							beagle_SSE	1	1	1	5.53	2.65	6
							beagle_SSE	1	8	8	5.57	2.63	45

Benchmark 2	DNA	62	10,869	1	5,565	10k	native	1		1	8.41	1.00	8
							beagle_SSE	1	1	1	9.27	0.91	9
							beagle_SSE	1	8	8	1.84	4.57	15
							beagle_SSE	1	16	16	1.63	5.17	26

DS3	DNA	219	1,956	1	1,314	10k	native	1		1	3.29	1.00	3
							beagle_SSE	1	1	1	3.74	0.88	4
							beagle_SSE	1	8	8	0.97	3.40	8
							beagle_SSE	1	16	16	0.94	3.48	15

DS 4	DNA	48	1,577	2	237 - 276	100k	native	1		1	2.16	1.00	2.16
							native	2		2	1.95	1.11	3.90
							beagle_SSE	1	1	1	2.60	0.83	2.60
							beagle_SSE	2	4	8	1.06	2.03	8.48
							beagle_SSE	2	8	16	1.03	2.10	16.48

DS 5	AA	131	3,095	4	122 - 752	10k	native	1		1	6.15	1.00	6.15
							native	4		4	3.55	1.73	14.20
							beagle_SSE	1	1	1	3.45	1.78	3.45
							beagle_SSE	4	2	8	1.14	5.41	9.09
							beagle_SSE	4	4	16	0.87	7.04	13.07

DS 6	DNA	348	6,954	16	37 - 814	10k	native	1		1	141.22	1.00	141.22
							native	16		16	33.09	4.27	529.22
							beagle_SSE	1	1	1	66.83	2.11	66.83
							beagle_SSE	8	1	8	20.37	6.93	162.93
							beagle_SSE	16	2	32	14.66	9.63	469.05

DS 7	DNA	271	11,440	27	26 - 245	10k	native	1		1	45.07	1.00	45.07
							native	27		27	10.81	4.17	291.81
							beagle_SSE	1	1	1	31.02	1.45	31.02
							beagle_SSE	8	1	8	11.40	3.95	91.18
							beagle_SSE	27	1	27	9.81	4.59	264.89

If you feel your data set differs dramatically from those given above, you can send us a copy, and we will look at possible new configurations for your data set. We are always happy to receive input on the speedups you see using our BEAST implementation, and advice on how to make BEAST more useful to the community.

If there is a tool or a feature you need, please let us know.

beast_how_fast

BEAST on BEAGLE Benchmarks

Get 1000 Hours free