There are only two kinds of information provided in Phylip format: Line 1 provides the number of Taxa and Characters in your matrix; Line 2 and subsequent lines provide data in the following rigid format: a Taxon identifier (up to 10 characters), followed by characters for that taxon. The following rules govern Phylip format:
Line number 1 provides number of taxa, then one space (Tab not allowed), then the number of characters in each taxon (each taxon has the same number of characters). An extra carriage return (that is, an empty line between Line 1 and Line 2, or between any other lines) will cause failure.
Line number 2 provides taxon identifier and data. The taxon identifier can be up to 10 characters. Numbers, underscores, spaces, are all allowed.
Strict Phylip expects the first character state to appear on Column 11 for each and every sequence, no ifs, and, or buts.
Relaxed Phylip Format is used by some tools (RAxML, for example), and these adhere to other aspects of Phylip, but permit longer taxon names
Sample of Phylip Format Data
5 13
Alpha AACGTGGCCACAT
Beta AAGGTCGCCACAC
Gamma CAGTTCGCCACAA
Delta GAGATTTCCGCCT
Epsilon GAGATCTCCGCCC
Most tools will also expect to recognize the characters in the file as following a convention. You must check each tool's documentation to be sure; for reference, the Phylip programs follow these conventions:
Input for DNA sequence programs: (shamelessly stolen from Dr. Felsenstein's site, thanks Joe!).
The input format for the DNA sequence programs is standard: the data have A's, G's, C's and T's (or U's). The base sequence is one of the letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, W, X, Y, ?, or - (a period was is no longer allowed, because it sometimes is used in different senses in other programs). Blanks and numerical digits are ignored. Characters can be either upper or lower case. The characters constitute the IUPAC (IUB) nucleic acid code plus some slight extensions. They enable input of nucleic acid sequences taking full account of any ambiguities in the sequence.
Symbol | Meaning | |||
A | Adenine | |||
G | Guanine | |||
C | Cytosine | |||
T | Thymine | |||
U | Uracil | |||
Y | pYrimidine | (C or T) | ||
R | puRine | (A or G) | ||
W | "Weak" | (A or T) | ||
S | "Strong" | (C or G) | ||
K | "Keto" | (T or G) | ||
M | "aMino" | (C or A) | ||
B | not A | (C or G or T) | ||
D | not C | (A or G or T) | ||
H | not G | (A or C or T) | ||
V | not T | (A or C or G) | ||
X,N,? | unknown | (A or C or G or T) | ||
O | deletion | |||
- | deletion |
Input for the Protein Sequence Programs
The first line contains the number of species and the number of amino acid positions (counting any stop codons that you want to include). The sequences can have internal blanks but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion. The protein sequences are given by the one-letter code used by the late Margaret Dayhoff's group in the Atlas of Protein Sequences, and consistent with the IUB standard abbreviations. In the present version it is:
Symbol | Stands for |
A | ala |
B | asx |
C | cys |
D | asp |
E | glu |
F | phe |
G | gly |
H | his |
I | ileu |
J | (not used) |
K | lys |
L | leu |
M | met |
N | asn |
O | (not used) |
P | pro |
Q | gln |
R | arg |
S | ser |
T | thr |
U | (not used) |
V | val |
W | trp |
X | unknown amino acid |
Y | tyr |
Z | glx |
* | nonsense (stop) |
? | unknown amino acid or deletion |
- | deletion |
where "nonsense", and "unknown" mean respectively a nonsense (chain termination) codon and an amino acid whose identity has not been determined. The state "asx" means "either asn or asp", and the state "glx" means "either gln or glu" and the state "deletion" means that alignment studies indicate a deletion has happened in the ancestry of this position, so that it is no longer present. Note that if two polypeptide chains are being used that are of different length owing to one terminating before the other, they can be coded as (say)
HIINMA*???? HIPNMGVWABT
since after the stop codon we do not definitely know that there has been a deletion, and do not know what amino acid would have been there. If DNA studies tell us that there is DNA sequence in that region, then we could use "X" rather than "?". Note that "X" means an unknown amino acid, but definitely an amino acid, while "?" could mean either that or a deletion. Otherwise one will usually want to use "?" after a stop codon, if one does not know what amino acid is there. If the DNA sequence has been observed there, one probably ought to resist putting in the amino acids that this DNA would code for, and one should use "X" instead, because under the assumptions implicit in this either the parsimony or the distance methods, changes to any non coding sequence are much easier than changes in a coding region that change the amino acid