BEST 3.2

The input file for BEST 2.3 is in NEXUS format and consists of two major blocks: data block and mrbayes block. There is no divierence between the
input files of BEST 2.3 and MrBayes in terms of the data block. Thus, create the data block in the same way as that in the regular MrBayes. For
example, the data block in the example file test.nex (available at http://www.stat.osu.edu/~dkp/BEST/examples/test.nex) is

begin data;
dimensions ntax=4 nchar=3000;
format datatype=DNA interleave missing=? gap=-;
matrix
H TTTCGGGTAT GATTGAACCG .....
C TTTCGGGTAT GATTGAACCG .....
G TTTCGGGTAT GATTGAACCG .....
O TTTCGGGTAT GATTGAACCG .....
.......
;
end;

The multilocus sequences should be concatenated across loci in the data block. Missing nucleotides or sequences are replaced by question marks. Al-
though BEST 2.3 can still estimate the species tree when the whole sequences are missing for some genes, users should remain cautious about the result because the placement of the taxa in the tree is purely derived from the prior distribution. If the dataset contains both diploid genes (nucleotide DNA)
and haploid genes (e.g. mtDNA), users may duplicate the haploid genes to make it compatible with the diploid genes or randomly choose one of the two
sequences for the diploid genes to make it compatible with the haploid genes.

BEST 2.3 is the extension of the regular MrBayes revised for the purpose of estimating species phylogenies. The commands for specifying substitution
models and prior distributions at gene tree level in the regular MrBayes are still valid for being used in BEST 2.3. There are some new commands being
added in BEST 2.3 in order to set the prior distribution for the species tree, population sizes, and variable mutation rates across genes. A typical mrbayes
block for BEST 2.3 specifies

1. the locations of genes along sequences.
2. the sequence-species relationship, i.e., which sequences belong to which species.
3. substitution models for genes.
4. priors for the parameters in the substitution model.
5. priors for the species tree, mutation rates across genes, and population sizes.
6. haploid genes. which will be explained item by item using the example file test.nex as shown below.

begin mrbayes;
set autoclose=yes nowarn=yes;
outgroup 4;
taxset H = 1;
taxset C = 2;
taxset G = 3;
taxset O =4;
CHARSET gene1 = 1 - 500;
CHARSET gene2 = 501 - 1000;
CHARSET gene3 = 1001 - 1500;
CHARSET gene4 = 1501 - 2000;
CHARSET gene5 = 2001 - 2500;
CHARSET gene6 = 2501 - 3000;
partition Genes = 6: gene1, gene2, gene3, gene4, gene5, gene6;
set partition=Genes;
prset thetapr=invgamma(3,0.003) GeneMuPr=uniform(0.5,1.5) BEST=1;
unlink topology=(all) brlens=(all) genemu=(all);
mcmc ngen=1000 nrun = 2 nchain = 2 samplefreq=100;
sumt nrun=2 filename=test.nex.sptree;
end;

Section 3.1 set genes
The location of each gene is defined by CHARSET. For example, CHARSET gene1 = 1 - 500; says that the first 500 nucleotides belong to the gene <gene1>. The command "partition" divides sequences into genes specified by CHARSET. The partition must be activated by the command"set parti-
tion=gene". It is quite common in gene tree estimation to use Codon models to group the nucleotides in triplets and assume divierent evolutionary models for the three groups of nucleotides. While users are still allowed to use codon model to group nucleotides in triplets, we do not recommend dividing data by triplets and treating each of the three groups of nucleotides as a "gene". It is more appropriate to define codon model within each gene specified by CHARSET, but the current version of BEST is unable to do it.

Section 3.2 set the sequence-species relationship
The command "TAXSET" tells the program which sequences belong to which species. In the example file, the list of TAXSETs implies that there are four
species and each species has only one sequence (single allele data). Although the species names (H, C, G, O) coincide with the sequences' names (H, C, G, O) in the example file, it is totally valid to use other names for species, for instance,
taxset s1 = 1;
taxset s2 = 2;
taxset s3 = 3;
taxset s4 =4;
For multiple allele data, multiple sequences may belong to the same species.
For example,
taxset s1 = 1-4;
taxset s2 = 5,7,9;
taxset s3 = 6,8;
taxset s4 =10;

3.3 set substitution models for genes
The substitution model for each partition is specified by the command "lset". Users may refer to the mrbayes manual for the information about the com-
mand "lset".

3.4 set priors for the parameters in the substitution model
Please refer to the mrbayes manual for the information about how to specify prior distributions for the parameters in the substitution model.

3.5 set priors for the species tree, mutation rates across genes, and population sizes. The priors for the species tree, mutation rates, and population sizes are set in the command "prset".

Options:

BEST: this parameter initiates the Bayesian analysis for estimating species trees when setting BEST =1. If BEST=0, the regular MrBayes is implemented.

speciestreepr: this parameter sets the prior distribution for the species tree.
Two options: the uniform distribution when speciestreepr=0 and birth and death prior when speciestreepr=1.
The birth and death prior is not available in the current version of BEST.

thetapr: this parameter sets the prior distribution for population sizes. There is only one option, inverse gamma distribution with parameter alpha and beta. The mean of the inverse gamma distribution is beta=(1/alpha-1). Users should choose reasonable values for alpha and beta such that the prior mean of the population size theta is in the reasonable range. In the example file, thetapr=invgamma(3,0.003) implies that the prior mean of the theta is 0.0015.

genemupr: this parameter sets the prior distribution for the mutation rates across genes. Two options: genemupr=uniform or genemupr=fixed(a).

Default model settings for BEST

Parameters Options Current settings
BEST 0/1 0
Speciestreepr 0/1 0
thetapr invgamma invgamma
genemupr uniform/fixed fixed(1.0)

In order to implement BEST 2.3, the command "unlink" must be used to unlink topology, branch lengths, and mutation rates across loci, i.e., un-
link topology=(all) brlens=(all) genemupr=(all);. Users may unlink other parameters in the model if necessary, but they are optional.

Section 3.6 set haploid genes

Use "lset Ploidy=haploid" to define haploid genes.
For example, "lset applyto=(1,2) ploidy=haploid" implies that the first two genes are haploid while other genes are diploid (diploid is the default setting for ploidy).

Section 4. Output files

Like the regular MrBayes, BEST 2.3 produces .p, .mcmc and .t files for which the description can be found in the MrBayes manual. The species trees generated from the posterior distribution is saved in the .sptree file. For multiple runs (for example, nruns=2), BEST 2.3 produces a .sptree file for each run. The names in the translation table in the .sptree file matches the species names specified by the command "TAXSET" in the mrbayes block in the input file. In the tree block, the number right after the pound sign is the population size theta.

Section 5 Summarizing the posterior distribution of the species tree

The estimated posterior distribution of the species tree can be summarized by the command "sumt". For single allele data such as the example file
test.nex, if the species names match the sequences' names, the command "sumt" can be executed directly in the input data file. For multiple-allele
data or when the species names are not identical to the sequences' names, users must create a new input file for summarizing the species trees in the
.sptree file. The sequences' names in the new input file must match the names in the translation table in the .sptree file. BEST 2.3 can produce such input
file automatically after the MCMC run. This file is named xxx.sumt. To summarize the esitmated posterior distribution of the species tree, type type
"execute xxx.sumt" at the BEST command line or "./mbbest -i xxx.sumt" outside BEST. The command "sumt" produces .con, .trprobs, and .part files.

The consensus tree as well as the estimates of population sizes (after "#") is output to the .con file. The population sizes are estimated by the posterior
mean. The posterior mean and standard deviation of the population size and divergence time for each population are saved in the .part file.


If you use POY, please cite:

Hello Everyone,

Notice that the citation of POY has changed. Please cite POY version 4
as:

Varón, A., L. S. Vinh, W. C.Wheeler. 2010. POY version 4: phylogenetic analysis using dynamic homologies. Cladistics. 26: in press. http://research.amnh.org/scicomp/projects/poy.php.

If there is a tool or a feature you need, you can add it yourself or let us know.