Tutorial

In this tutorial we will run TCUP on a sample input file that is available for download from here:

http://bioinformatics.math.chalmers.se/tcup/tutorial/tcup_tutorial_sample.zip

The input file contains 1500 peptides in FASTA format, randomly subsampled from a real tandem mass spectrometry sample, sarched using X!Tandem against a large peptide database of bacterial proteins.

Preparations

Follow the installation instructions in Installation to install TCUP. When you have completed the installation, download the following databases that are required to run TCUP:

  • Reference genome database. BLAST or BLAT reference databases [FASTA or BLASTDB]. About 21 GB uncompressed.
  • Taxonomy (“taxref”) database. Links reference genome sequences to taxonomy structure [sqlite3]. About 340 MB uncompressed.
  • Annotation database. Contains annotations for all genome sequences included in the reference genome database [sqlite3]. About 5 GB uncompressed.
  • Resistance gene database. Based on ResFinder [FASTA or BLASTDB and sqlite3]. About 1 MB uncompressed.

Example versions of these databases can be downloaded from the links in the list above. Note that the databases are quite large. The total download size is approximately 10.3 GB. There are instructions on how to create your own databases in Preparing databases for use with TCUP. Do not forget to download the sample (linked in the section above).

You also need to install either BLAT (Linux) or BLAST (Windows), depending on your OS. Please refer to their respective installation instructions for installing them.

Running TCUP

A simple way to run TCUP is to use the included run_tcup.py script that is installed as part of the TCUP package. Running run_tcup without any arguments (or with -h/--help) produces this helpful output:

$ run_tcup.py
usage: run_tcup.py [-h] -t TAXREF_DB -a ANNOTATION_DB -r RESISTANCE_DB
                   SAMPLE GENOME_DB RESISTANCE_DB

TCUP wrapper; align peptides to reference databases in parallel and run TCUP
on alignment results. Fredrik Boulund 2016

positional arguments:
  SAMPLE                FASTA file with peptides from tandem MS.
  GENOME_DB             Reference bacterial genome db (FASTA or blastdb format
                        depending on OS).
  RESISTANCE_DB         Antibiotic resistance gene db (FASTA or blastdb format
                        depending on OS).

optional arguments:
  -h, --help            show this help message and exit

Taxonomic composition:
  -t TAXREF_DB, --taxref-db TAXREF_DB
                        Path to taxref db (sqlite3).
  -a ANNOTATION_DB, --annotation-db ANNOTATION_DB
                        Path to annotation db (sqlite3).

Antibiotic resistance:
  -r RESISTANCE_DB, --resistance-db RESISTANCE_DB
                        Path to resistance db (sqlite3).

As indicated by the help text, we need to supply a number of arguments to run_tcup in order to run TCUP. The program requires the SAMPLE file containing the sample peptides as the first argument. The second argument is the path to the GENOME_DB. The third argument is the path to the RESISTANCE_DB. In addition to these positional arguments, three additional arguments are required: -t specifies the path to the taxref.sqlite3 file, -a is the path to the annotation_db.sqlite3 file, and -r is the path to the resfinder.sqlite3 file.

Assuming you have downloaded and extracted all the databases listed above, and downloaded the tutorial sample FASTA file, into a folder like this:

tutorial/
    tcup_tutorial_sample.fasta
    databases/
        annotation_db.sqlite3
        reference_genomes.00.nhr
        reference_genomes.00.nin
        reference_genomes.00.nsq
        reference_genomes.01.nhr
        reference_genomes.01.nin
        reference_genomes.01.nsq
        reference_genomes.01.fasta
        reference_genomes.02.nhr
        reference_genomes.02.nin
        reference_genomes.02.nsq
        reference_genomes.02.fasta
        reference_genomes.03.nhr
        reference_genomes.03.nhr
        reference_genomes.03.nin
        reference_genomes.04.nsq
        reference_genomes.04.nin
        reference_genomes.04.nsq
        reference_genomes.nal
        resfinder.fasta
        resfinder.phr
        resfinder.pin
        resfinder.psq
        resfinder.sqlite3
        taxref.sqlite3

To run TCUP on Windows, type the following command line (without linebreaks):

> run_tcup.exe
     -t databases\taxref.sqlite3
     -a databases\annotation_db.sqlite3
     -r databases\resfinder.sqlite3
     tcup_tutorial_sample.fasta
     databases\reference_genomes
     databases\resfinder

To run TCUP on Linux, type the following command line (without linebreaks):

$ run_tcup
     -t databases/taxref.sqlite3
     -a databases/annotation_db.sqlite3
     -r databases/resfinder.sqlite3
     tcup_tutorial_sample.fasta
     databases/reference_genomes
     databases/resfinder.fasta

Running TCUP on the tutoral sample will take some time, sometimes up to a couple of hours, depending on your computer’s speed and amount of memory. After completing, TCUP will produce the following output files:

tcup_tutorial_sample.fasta.genomes.blast8
tcup_tutorial_sample.fasta.ar.blast8
tcup_tutorial_sample.fasta.antibiotic_resistance.txt
tcup_tutorial_sample.fasta.taxonomic_composition.txt
tcup_tutorial_sample.fasta.taxonomic_composition.xslx

The mapping output files *.blast8* contain the raw mapping results in BLAST tabular format (BLAST actually calls this blast6). The *.txt and *.xlslx files contain the output from TCUP.

Note

TCUP is actually not intended to be run via the ‘run_tcup’ script as described in this section. The script is provided as a convenience to easily try out TCUP to see how it works, but for real world use of TCUP, please refer to Running TCUP.

In the next section we will analyze the output from TCUP.

Analysis of the results

Note

NCBI BLAST produces more false positives than BLAT, and TCUP has only been optimized for use with BLAT at this time. The use of BLAST together with TCUP to determine taxonomic composition or expressed antibiotic resistance peptides is currently not recommended. Thus, if you are running TCUP on Windows, keep in mind that the results likely will contain a higher number of false positive assignments, both for taxonomic affiliation and antibiotic resistance detection.

Complete details on how to interpret TCUP output is available in Example output.

Taxonomic composition

First off, let’s have a look at the taxonomic composition of the sample. The taxonomic composition estimation is presented in two formats: plain text and as an Excel spreadsheet. They both contain the same information regarding the taxonomic composition estimation of the sample, but the Excel file also includes a sheet with information on hits to annotated regions of the reference sequences.

The table in the first sheet of tcup_tutorial_sample.fasta.taxonomic_composition.xslx shows columns containing:

Cumulative  Count   Percentage  Rank    Spname

The leftmost column, Cumulative, shows the number of peptides that are discriminative at the taxonomic rank specified in the Rank column. This forms a cumulative sum as you look at ranks higher up in the taxonomic hierarchy. If e.g. the rank of superkingdom was included in the results, it would contain the total cumulative sum of the number of discriminative peptides at all taxa in the bacterial tree.

The Percentage column shows the relative proportion of peptides classified to the species given in the Spname column. This number is relative to all other entries of the same taxonomic rank, e.g. the sum of all the percentages across all species would sum to 100%.

The Excel format makes it easy to use the filtering functions in Excel to look at the most interesting parts of the results, e.g. to filter out only matches to the genus or species levels.

The second sheet in the Excel file contains a listing of all hits to regions in the reference genome sequences that were matched by any discriminative peptide.

Antibiotic resistance

Second, let’s have a look at the antibiotic resistance results. These are presented in a text file. The output contains four columns:

Disc.  Hits   %    Family

The first column, Disc., shows the number of discriminative peptides that matched to the resistance gene family listed in the Family column. The Hits column shows how many separate matches the discriminative peptides produced to the family in question. The % column shows the proportion of peptides that matched to each family.

Congratulations, you have now completed the tutorial. There is more detailed information on how to use TCUP in the Running TCUP section.