Tutorial¶
In this tutorial we will run TCUP on a sample input file that is available for download from here:
http://bioinformatics.math.chalmers.se/tcup/tutorial/tcup_tutorial_sample.zip
The input file contains 1500 peptides in FASTA format, randomly subsampled from a real tandem mass spectrometry sample, sarched using X!Tandem against a large peptide database of bacterial proteins.
Preparations¶
Follow the installation instructions in Installation to install TCUP. When you have completed the installation, download the following databases that are required to run TCUP:
- Reference genome database. BLAST or BLAT reference databases [FASTA or BLASTDB]. About 21 GB uncompressed.
- Taxonomy (“taxref”) database. Links reference genome sequences to taxonomy structure [sqlite3]. About 340 MB uncompressed.
- Annotation database. Contains annotations for all genome sequences included in the reference genome database [sqlite3]. About 5 GB uncompressed.
- Resistance gene database. Based on ResFinder [FASTA or BLASTDB and sqlite3]. About 1 MB uncompressed.
Example versions of these databases can be downloaded from the links in the list above. Note that the databases are quite large. The total download size is approximately 10.3 GB. There are instructions on how to create your own databases in Preparing databases for use with TCUP. Do not forget to download the sample (linked in the section above).
You also need to install either BLAT (Linux) or BLAST (Windows), depending on your OS. Please refer to their respective installation instructions for installing them.
Running TCUP¶
A simple way to run TCUP is to use the included run_tcup.py
script that
is installed as part of the TCUP package. Running run_tcup
without any
arguments (or with -h
/--help
) produces this helpful output:
$ run_tcup.py
usage: run_tcup.py [-h] -t TAXREF_DB -a ANNOTATION_DB -r RESISTANCE_DB
SAMPLE GENOME_DB RESISTANCE_DB
TCUP wrapper; align peptides to reference databases in parallel and run TCUP
on alignment results. Fredrik Boulund 2016
positional arguments:
SAMPLE FASTA file with peptides from tandem MS.
GENOME_DB Reference bacterial genome db (FASTA or blastdb format
depending on OS).
RESISTANCE_DB Antibiotic resistance gene db (FASTA or blastdb format
depending on OS).
optional arguments:
-h, --help show this help message and exit
Taxonomic composition:
-t TAXREF_DB, --taxref-db TAXREF_DB
Path to taxref db (sqlite3).
-a ANNOTATION_DB, --annotation-db ANNOTATION_DB
Path to annotation db (sqlite3).
Antibiotic resistance:
-r RESISTANCE_DB, --resistance-db RESISTANCE_DB
Path to resistance db (sqlite3).
As indicated by the help text, we need to supply a number of arguments to
run_tcup
in order to run TCUP. The program requires the SAMPLE
file containing the sample peptides as the first argument. The second argument
is the path to the GENOME_DB
. The third argument is the path to the
RESISTANCE_DB
. In addition to these positional arguments, three additional
arguments are required: -t
specifies the path to the taxref.sqlite3
file, -a
is the path to the annotation_db.sqlite3
file, and -r
is
the path to the resfinder.sqlite3
file.
Assuming you have downloaded and extracted all the databases listed above, and downloaded the tutorial sample FASTA file, into a folder like this:
tutorial/
tcup_tutorial_sample.fasta
databases/
annotation_db.sqlite3
reference_genomes.00.nhr
reference_genomes.00.nin
reference_genomes.00.nsq
reference_genomes.01.nhr
reference_genomes.01.nin
reference_genomes.01.nsq
reference_genomes.01.fasta
reference_genomes.02.nhr
reference_genomes.02.nin
reference_genomes.02.nsq
reference_genomes.02.fasta
reference_genomes.03.nhr
reference_genomes.03.nhr
reference_genomes.03.nin
reference_genomes.04.nsq
reference_genomes.04.nin
reference_genomes.04.nsq
reference_genomes.nal
resfinder.fasta
resfinder.phr
resfinder.pin
resfinder.psq
resfinder.sqlite3
taxref.sqlite3
To run TCUP on Windows, type the following command line (without linebreaks):
> run_tcup.exe
-t databases\taxref.sqlite3
-a databases\annotation_db.sqlite3
-r databases\resfinder.sqlite3
tcup_tutorial_sample.fasta
databases\reference_genomes
databases\resfinder
To run TCUP on Linux, type the following command line (without linebreaks):
$ run_tcup
-t databases/taxref.sqlite3
-a databases/annotation_db.sqlite3
-r databases/resfinder.sqlite3
tcup_tutorial_sample.fasta
databases/reference_genomes
databases/resfinder.fasta
Running TCUP on the tutoral sample will take some time, sometimes up to a couple of hours, depending on your computer’s speed and amount of memory. After completing, TCUP will produce the following output files:
tcup_tutorial_sample.fasta.genomes.blast8
tcup_tutorial_sample.fasta.ar.blast8
tcup_tutorial_sample.fasta.antibiotic_resistance.txt
tcup_tutorial_sample.fasta.taxonomic_composition.txt
tcup_tutorial_sample.fasta.taxonomic_composition.xslx
The mapping output files *.blast8*
contain the raw mapping results in BLAST
tabular format (BLAST actually calls this blast6). The *.txt
and
*.xlslx
files contain the output from TCUP.
Note
TCUP is actually not intended to be run via the ‘run_tcup’ script as described in this section. The script is provided as a convenience to easily try out TCUP to see how it works, but for real world use of TCUP, please refer to Running TCUP.
In the next section we will analyze the output from TCUP.
Analysis of the results¶
Note
NCBI BLAST produces more false positives than BLAT, and TCUP has only been optimized for use with BLAT at this time. The use of BLAST together with TCUP to determine taxonomic composition or expressed antibiotic resistance peptides is currently not recommended. Thus, if you are running TCUP on Windows, keep in mind that the results likely will contain a higher number of false positive assignments, both for taxonomic affiliation and antibiotic resistance detection.
Complete details on how to interpret TCUP output is available in Example output.
Taxonomic composition¶
First off, let’s have a look at the taxonomic composition of the sample. The taxonomic composition estimation is presented in two formats: plain text and as an Excel spreadsheet. They both contain the same information regarding the taxonomic composition estimation of the sample, but the Excel file also includes a sheet with information on hits to annotated regions of the reference sequences.
The table in the first sheet of tcup_tutorial_sample.fasta.taxonomic_composition.xslx
shows columns containing:
Cumulative Count Percentage Rank Spname
The leftmost column, Cumulative
, shows the number of peptides that are
discriminative at the taxonomic rank specified in the Rank
column. This
forms a cumulative sum as you look at ranks higher up in the taxonomic
hierarchy. If e.g. the rank of superkingdom was included in the results, it
would contain the total cumulative sum of the number of discriminative peptides
at all taxa in the bacterial tree.
The Percentage
column shows the relative proportion of peptides classified
to the species given in the Spname
column. This number is relative to all
other entries of the same taxonomic rank, e.g. the sum of all the percentages
across all species would sum to 100%.
The Excel format makes it easy to use the filtering functions in Excel to look at the most interesting parts of the results, e.g. to filter out only matches to the genus or species levels.
The second sheet in the Excel file contains a listing of all hits to regions in the reference genome sequences that were matched by any discriminative peptide.
Antibiotic resistance¶
Second, let’s have a look at the antibiotic resistance results. These are presented in a text file. The output contains four columns:
Disc. Hits % Family
The first column, Disc.
, shows the number of discriminative peptides that matched
to the resistance gene family listed in the Family
column. The Hits
column shows
how many separate matches the discriminative peptides produced to the family in question.
The %
column shows the proportion of peptides that matched to each family.
Congratulations, you have now completed the tutorial. There is more detailed information on how to use TCUP in the Running TCUP section.