Fasta format:

A sequence in FASTA format begins with a single-line description, a carriage return, and then any number of lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. A simple example of one sequence in FASTA format:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]  
    LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV  
    EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG  
    LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL  
    GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX  
    IENY  
      

Header line

The header line, must begin with '>', and end with a carriage return. Everything else is optional. By convention, this line gives a name and/or a unique identifier for the sequence, and often lots of other information in databses sich as genbank. Many different sequence databases use standardized headers, which helps when automatically extracting information from the header. The header line may contain more than one header, separated by a ^A (Control-A) character (as in [1]).

In the original Pearson FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Effectively no databases and no bioinformatics applications recognize these comments so don't use them. Instead, everyone follows the NCBI FASTA specification.

An example of a multiple sequence FASTA file follows:

>SEQUENCE_1  
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK  
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL  
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
 

>SEQUENCE_2  
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI  
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH  
      

Sequence representation

After the header line and comments, one or more lines may follow describing the sequence: each line of a sequence should have fewer than 80 characters. Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters (see sequence alignment). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters. Numerical digits are not allowed but are used in some databases to indicate the position in the sequence.

Aligned Fasta

Aligned FASTA format is a simple variant of standard FASTA format. To make an aligned FASTA file, individual sequences in FASTA format are concatenated together and made the same length by including characters for leading, trailing, and gap positions. Depending on the program, any non-alphanumeric characters may be tolerated in these positions.

Please send us your feature requestsRequest a Feature