Metagenomic and rDNA Taxonomic Assignment (MARTA)
This java-based software blasts each sequence that you provide it, and then looks for a consensus taxon among the top-hits returned from blast.
Briefly, MARTA uses NCBI’s megablast program to align your sequence(s) against a local installation of blast. Then MARTA uses GenInfo Identifiers from the top-hits to retrieve taxonomic information from NCBI’s taxonomy database. Using your thresholds/cutoffs, MARTA ‘votes’ to find a taxonomic assignment by consensus; MARTA might resolve some sequences to species level, and others to kingdom or to no level, depending on the taxonomic information held within your tag or sequence.
MARTA takes as input a tab-delimited text file with each line containing an id and a sequence to blast. After blast-aligning the sequences with options that you choose, MARTA asks NCBI’s program ‘megablast’ for the top-n (default: 100)
alignments and descriptions. MARTA then parses the output from megablast to retrieve the bitscores, e-values and GenInfo Identifiers (GIs) for each of the top-n hits. Using the GI values, MARTA retrieves the taxonomic information from a local database (described below), built from NCBI’s taxonomy database dump. MARTA retrieves the taxonomic information for each of the GIs (default: 100 hits), provided that they cover (overlap) the query sequence by 80% (else those hits are filtered; if no hits remain after this ‘query coverage’ check, the read is labeled ‘Filtered’).
MARTA determines the taxonomic assignment of your blasted read by sorting the bitscores and then ‘voting’; that is, MARTA looks for consensus in the determined taxonomy of your top-hits. If you ask MARTA to use a best score strategy, then the program will sort the hits by bitscore and vote on the top-hit(s) only (note that ties are common). If you ask the program to use the best bitscore and there is no taxonomic information for the top-hit, or hits in the case of ties, then the sequence will not have a taxonomic assignment (the call is: ‘Uncertain’). If no sequences are returned from the BLAST utility, the assignment is “No significant similarity found”.
Many BLAST users will argue that there is taxonomically important information after the best scoring hit. If you do not require the program to use the blast-result with the highest bitscore, the program will use a ‘slippage tolerance’ to try to find the taxonomic assignment before exiting out at the calculated ‘slip-score’ basement.
Consider the following example:
1.) You blast a sequence and get 100 alignments and descriptions. By default, your hits are sorted in NCBI’s Blast browser.
2.) At first glance, there seems to be some support for the assignment ‘Tetracladium’; however, the top scoring hit lacks a phylogenetic classification. If yo do not ask MARTA to use a best-score strategy:
- MARTA will then use the slippage tolerance to define the set of candidates that we will considered. By default, the slippage tolerance is ’98’, and the lowest score that will be tolerated is a function of the ‘topscore’ (here 431). The ‘stop-score’ in this example (using the default slippage tolerance) is 431*0.98 == 422.38.
- After reviewing the ‘Uncultured’ read at 431, the program will step to the shelf of hits whose bitscores are 425. By default, only Cultured hits vote; here, Tetracladium sp. wins (see voting thresholds below).
- Since this is a ‘spuh’ (no species epithet), the assignment level is ‘GENUS’.
- One can easily distinguish between assignments whose bitscores were the top scoring hits, and assignments that ‘slipped’ to some extent.
To search for the consensus taxon among the considered taxa, one imposes a threshold… or some sort of taxonomic-level cutoff. The default cutoffs that MARTA uses are:
- Superkingdom: 1
- Kingdom: 1
- Phylum: 1
- Class: 1
- Order: 1
- Family: 1
- Genus: 2/3
- Species: 2/3
This is a command-line argument and it can be altered at runtime. MARTA tries to “succeed early” by voting among bitscore ties at the Species level first, then the Genus, Family, Order, Class, Phylum, Kingdom, up and finishing at the Domain level. If there is no agreement at any of the levels, given the threshold, then the read is labeled ‘Uncertain’.
MARTA builds a directory for each sequence that you provide it. If you’re blasting 1,000,000 sequences, this might be surprising, but I try to take steps to lessen the sysadmin’s shock. For example, if you run the program in parallel mode, the program splits your input file into a bunch of batches, each with its own directory structure. The collect tool will navigate through these trees to get all of your taxonomic assignments. In a future version of MARTA, the directory tree will be avoided, and MARTA will just build a series of voting files that will be parsed after the fact.
For now, each sequence has a directory of its supporting files. This structure will allow you to quickly review the files that MARTA used to determine the taxonomy of your sequence. This structure also allows you to quickly Revote (without blasting) when you decide to change your voting strategy.
The sequence-id is: S169917 and based on our voting thresholds (above), this sequence was determined to be ‘Plectosphaerella’ at the Genus level. When you’re done voting, you need to ask MARTA to collect the assignments from your output directory. Instructions are shown below.
There are several strategies for voting. MARTA can align your sequences against the blast installation, but it can also ‘revote’, leveraging your previous blast result (the bottleneck in assigning a taxonomic label to a sequence/read is the alignment itself).
MARTA is written in JAVA and should run on any OS. Although the program is written on a laptop running Mac OS X (> 10.4), for large datasets you should probably run the software on clusters of several *nix or Unix-like machines each having a set of 4 or more cpus. MARTA can run your sequence file in-tandem, or in-parallel (preferable); in the latter case MARTA outputs .sh files that it runs using qsub. The Sun’s Grid Engine (SGE) installed; the parameters for SGE may have been modified by your administrator, and you might prefer to recompile MARTA with your parameters in mind.
NCBI’s megablast application seems to do quite well when you give it several cpus to work with (with the -a option). My experience is that blasting large #s of sequences (>>10k) runs in a timely fashion when I use an ‘a’ option (cpu-number) of >= 3 for each instance of megablast. Please collaborate with your system administrator to make sure that your use of MARTA is neighborly for your grid system.
MARTA is written in JAVA and will run provided that you have a JVM to launch it. You can either install blast locally or ask your system administrator to put it in a place accessible to all users. If your administrator is generous enough to install NCBI’s blast database and programs for you, ask her or him if s/he would also write a CRON job to update the files some (innocuous) night during the week.
To run MARTA you will need the following:
- Some Unix-like system should do. If you have a Windows PC, consider running Linux in a virtual machine or from a USB drive.
- An installation of R.
- A current java installation: http://www.java.com/en/download/manual.jsp
- Instructions for downloading the blast database(s) and blast program(s) are found here: http://www.ncbi.nlm.nih.gov/BLAST/download.shtml.
(ask your administrator to wrap his/her CRON job around the perl
updater). Don’t forget that Blast expects an .ncbirc file in your home directory! Currently, MARTA is only configured to use megablast.