Command Line Interface

Genome assembly files accepted by the CLI must be in FASTA format, optionally compressed with gzip.

Root command group

gambit [OPTIONS] COMMAND [ARGS]...

Some top-level options are set at the root command group, and should be specified before the name of the subcommand to run.

Options

-d, --db DIR: Path to the directory containing reference database files. Required by the query subcommand. As an alternative you can specify the database location with the GAMBIT_DB_PATH environment variable.

Environment variables

GAMBIT_DB_PATH: Alternative to -d for specifying path to database.

Querying the database

“query” command

gambit query [OPTIONS] (-s SIGFILE | -l LISTFILE | GENOMES...)

Predict taxonomy of microbial samples from genome sequences.

The reference database must be specified from the root command group.

Query genomes

Query genomes can be specified using one of the following methods:

Give paths of one or more genome files as positional arguments.
Use the -l option to specify a text file containing paths of the genome files.
Use the -s option to use a signatures file created with the signatures create command.

-l LISTFILE: File containing paths to genomes, one per line.

--ldir DIRECTORY: Parent directory of paths in file given by -l option.

-s, --sigfile FILE: A genome signatures file.

Additional Options

-o, --output FILE: File to write output to. If omitted will write to stdout.

-f, --outfmt {csv|json|archive}: Results format (see next section).

--progress / --no-progress: Show/don’t show progress meter.

-c, --cores INT: Number of CPU cores to use.

Result Formats

CSV

A .csv file with one row per query. Contains the following columns:

query - Query genome file name (minus extension).
predicted - Predicted taxon.
- predicted.name - Name of taxon.
- predicted.rank - Taxonomic rank (genus, species, etc.).
- predicted.ncbi_id - Numeric ID in NCBI taxonomy database, if any.
- predicted.threshold - Classification threshold.
closest - Reference genome closest to query.
- closest.distance - Distance to closest genome.
- closest.decription - Text description.
next - Next most specific taxon for which the classification threshold was not met.
- next.name
- next.rank
- next.ncbi_id
- next.threshold

JSON

A machine-readable format meant to be used in pipelines.

Todo

Document schema

Archive

A more verbose JSON-based format used for testing and development.

Generating and inspecting k-mer signatures

“signatures info” command

gambit signatures info [OPTIONS] FILE

Print information about a GAMBIT signatures file. Defaults to a basic human-readable format.

Options

-j, --json: Print information in JSON format. Includes more information than standard output.

-p, --pretty: Prettify JSON output to make it more human-readable.

-i, --ids: Print IDs of all signatures in file.

“signatures create” command

gambit signatures create [OPTIONS] -o OUTFILE (-l LISTFILE | GENOMES...)

Calculate GAMBIT signatures of a set of genomes and write to a binary file.

Input/output

-l LISTFILE: File containing paths to genomes, one per line.

--ldir DIRECTORY: Parent directory of paths in file given by -l option.

-o, --output FILE: Path to write file to (required).

K-mer parameters

-k INTEGER: Length of k-mers to find (does not include length of prefix). Default is 11.

-p, --prefix STRING: K-mer prefix to match, a non-empty string of DNA nucleotide codes. Default is ATGAC.

Metadata

-i, --ids FILE: File containing IDs to assign to signatures in file metadata. Should contain one ID per line. If omitted will use file names stripped of extensions.

-m, --meta-json FILE: JSON file containing metadata to attach to file.

Todo

Document metadata schema

Additional Options

--progress / --no-progress: Show/don’t show progress meter.

-c, --cores INT: Number of CPU cores to use.

Calculating genomic distances

“dist” command

gambit dist [OPTIONS] -o OUTFILE
    (-q GENOME... | --ql LISTFILE | --qs SIGFILE)
    (-r GENOME... | --rl LISTFILE | --rs SIGFILE | --square | --use-db)

Calculate pairwise distances between a set of query genomes and a set of reference genomes. Output is a .csv file. If using --qs along with --rs or -use-db, the k-mer parameters of the query signature file must match the reference parameters.

Query genomes

-q GENOME: Path to a single genome file. May be used multiple times.

--ql LISTFILE: File containing paths of genome files, one per line.

--qdir DIRECTORY: Parent directory of paths in file given by --ql option.

--qs SIGFILE: A genome signatures file.

Reference genomes

-r GENOME: Path to a single genome file. May be used multiple times.

--rl LISTFILE: File containing paths of genome files, one per line.

--rdir DIRECTORY: Parent directory of paths in file given by --rl option.

--rs SIGFILE: A genome signatures file.

-s, --square: Use same genomes as the query.

-d, --use-db: Use all genomes in reference database.

Output

-o FILE: File to write output to. Required.

K-mer parameters

Only allowed if query and reference genomes do not come from precomputed signature files.

-k INTEGER: Length of k-mers to find (does not include length of prefix). Default is 11.

-p, --prefix STRING: K-mer prefix to match, a non-empty string of DNA nucleotide codes. Default is ATGAC.

Additional options

--progress / --no-progress: Show/don’t show progress meter.

-c, --cores INT: Number of CPU cores to use.

Creating relatedness trees

“gambit tree” command

gambit tree [OPTIONS] (-l LISTFILE | -s SIGFILE | GENOMES...)

Estimate a relatedness tree for a set of genomes and output in Newick format.

Input/output

-l LISTFILE: File containing paths of genome files, one per line.

--ldir DIRECTORY: Parent directory of paths in file given by -l option.

-s, --sigfile SIGFILE: A genome signatures file.

-o FILE: File to write output to. If omitted will write to stdout.

Todo

Allow using a distance matrix calculated using gambit dist.

K-mer parameters

Not allowed if the -s/--sigfile option was used.

-k INTEGER: Length of k-mers to find (does not include length of prefix). Default is 11.

-p, --prefix STRING: K-mer prefix to match, a non-empty string of DNA nucleotide codes. Default is ATGAC.

Additional options

--progress / --no-progress: Show/don’t show progress meter.

-c, --cores INT: Number of CPU cores to use.