Tutorial

Before starting, make sure you have followed the instructions in Installation and Setup.

Also see the Command Line Interface page for complete documentation of all GAMBIT subcommands and options.

Example data set

The examples in this page make use of the following genome assembly files:

From here on we’ll pretend they’ve been downloaded to a directory called genomes/.

These are from genome set 3 in the initial GAMBIT publication, derived from clinical samples.

Telling GAMBIT where the database files are

The query command requires the GAMBIT reference database files. You can let GAMBIT know which directory contains these files in one of two ways:

Using a command line option

The first is to explicitly pass it via the command line using the --db option, like so:

gambit --db /path/to/database/ COMMAND ...

Note that this option must appear immediately after gambit and before the command name.

Using an environment variable

The second is to use the GAMBIT_DB_PATH environment variable. This can be done by running the following command at the beginning of your shell session:

export GAMBIT_DB_PATH="/path/to/database/"

Alternatively, you can add this line to your .bashrc to have it apply to all future sessions (make sure to restart your current session after doing so).

Genome input

Genome assemblies used as input must be in FASTA format, optionally compressed with gzip.

Most commands accept a list of genome files as positional arguments, e.g.:

gambit COMMAND [OPTIONS] genomes/16AC1611138-CAP.fasta.gz genomes/17AC0001410A.fasta.gz ...

or making use of shell expansion:

gambit COMMAND [OPTIONS] genomes/*.fasta.gz

Alternatively you can use the -l option to provide a text file containing the genome file names/paths, one per line. The paths in this file are considered relative to the directory given by the --ldir option if given.

So for example, you can create the file genomes.txt containing the following:

16AC1611138-CAP.fasta.gz
17AC0001410A.fasta.gz
17AC0006310.fasta.gz
17AC0006313-1.fasta.gz
19AC0011210.fasta.gz

The command would then be:

gambit COMMAND [OPTIONS] -l genomes.txt --ldir genomes/

This method makes more sense when you have a lot of files to include. Note that the gambit dist command has different names for these options because there are two lists of genomes to specify. See the Command Line Interface page for more complete information.

Predicting taxonomy of unknown genomes

The query command compares a set of query genomes against the reference database and attempts to predict their taxonomy. The following runs a query with our five FASTA files and writes the results to out.csv:

gambit query -o out.csv genomes/*.fasta.gz

Contents of out.csv
query	predicted.name	predicted.rank	predicted.ncbi_id	predicted.threshold	closest.distance	closest.description	next.name	next.rank	next.ncbi_id	next.threshold
16AC1611138-CAP	Escherichia coli	species	562	0.0	0.1586	[GCF_000351725.1] Escherichia coli KTE77 (E. coli)
17AC0001410A	Enterococcus faecalis	species	1351	0.4697	0.1264	[GCF_000148005.1] Enterococcus faecalis DAPTO 512 (firmicutes)
17AC0006310	Bacillus cereus	species	1396	0.0	0.1068	[GCF_001619385.1] Bacillus cereus (firmicutes)
17AC0006313-1	Veillonella	genus	29465	0.9438	0.9126	[GCF_000024945.1] Veillonella parvula DSM 2008 (firmicutes)	Veillonella parvula	species	29466	0.6693
19AC0011210					0.9916	[GCF_000169595.1] Ureaplasma urealyticum serovar 9 str. ATCC 33175 (mycoplasmas)	Ureaplasma	genus	2129	0.8897

The predicted columns describe the predicted taxonomic classification of each query genome. closest.description is the database reference genome closest to the query, closest.distance is the distance between them. The next columns have the same format as predicted but describe the next most specific taxon for which the classification threshold was not met.

In this example GAMBIT was able to make a species-level prediction for the first three genomes but stopped at the genus level for the fourth and made no prediction for the fifth. This is because GAMBIT attempts to be conservative and error on the side of making a less specific prediction or no prediction rather than giving false positives. The next columns can give you a clue as to what a more specific classifiction might be, however.

See the cli documentation for a complete description of the output columns. Generally the CSV output format should be sufficient, but there is also a JSON-based format which contains more detailed information and may be useful in pipelines. Use -f json to use this format.

Todo

Explain why predicted.threshold is sometimes zero for certain taxa.

Pre-computing k-mer signatures

TODO

Calculating GAMBIT distances

TODO

Creating relatedness trees

TODO