Command Line Interface

Genome assembly files accepted by the CLI must be in FASTA format, optionally compressed with gzip.

Root command group

gambit [OPTIONS] COMMAND [ARGS]...

Some top-level options are set at the root command group, and should be specified before the name of the subcommand to run.

Options

-d, --db DIR

Path to the directory containing reference database files. Required by the query subcommand. As an alternative you can specify the database location with the GAMBIT_DB_PATH environment variable.

Environment variables

GAMBIT_DB_PATH

Alternative to -d for specifying path to database.

Querying the database

“query” command

gambit query [OPTIONS] (-s SIGFILE | -l LISTFILE | GENOMES...)

Predict taxonomy of microbial samples from genome sequences.

The reference database must be specified from the root command group.

Query genomes

Query genomes can be specified using one of the following methods:

  • Give paths of one or more genome files as positional arguments.

  • Use the -l option to specify a text file containing paths of the genome files.

  • Use the -s option to use a signatures file created with the signatures create command.

-l LISTFILE

File containing paths to genomes, one per line.

--ldir DIRECTORY

Parent directory of paths in file given by -l option.

-s, --sigfile FILE

A genome signatures file.

Additional Options

-o, --output FILE

File to write output to. If omitted will write to stdout.

-f, --outfmt {csv|json|archive}

Results format (see next section).

--progress / --no-progress

Show/don’t show progress meter.

-c, --cores INT

Number of CPU cores to use.

Result Formats

CSV

A .csv file with one row per query. Contains the following columns:

  • query - Query genome file name (minus extension).

  • predicted - Predicted taxon.

    • predicted.name - Name of taxon.

    • predicted.rank - Taxonomic rank (genus, species, etc.).

    • predicted.ncbi_id - Numeric ID in NCBI taxonomy database, if any.

    • predicted.threshold - Classification threshold.

  • closest - Reference genome closest to query.

    • closest.distance - Distance to closest genome.

    • closest.decription - Text description.

  • next - Next most specific taxon for which the classification threshold was not met.

    • next.name

    • next.rank

    • next.ncbi_id

    • next.threshold

JSON

A machine-readable format meant to be used in pipelines.

Todo

Document schema

Archive

A more verbose JSON-based format used for testing and development.

Generating and inspecting k-mer signatures

“signatures info” command

gambit signatures info [OPTIONS] FILE

Print information about a GAMBIT signatures file. Defaults to a basic human-readable format.

Options

-j, --json

Print information in JSON format. Includes more information than standard output.

-p, --pretty

Prettify JSON output to make it more human-readable.

-i, --ids

Print IDs of all signatures in file.

“signatures create” command

gambit signatures create [OPTIONS] -o OUTFILE (-l LISTFILE | GENOMES...)

Calculate GAMBIT signatures of a set of genomes and write to a binary file.

Input/output

-l LISTFILE

File containing paths to genomes, one per line.

--ldir DIRECTORY

Parent directory of paths in file given by -l option.

-o, --output FILE

Path to write file to (required).

K-mer parameters

-k INTEGER

Length of k-mers to find (does not include length of prefix). Default is 11.

-p, --prefix STRING

K-mer prefix to match, a non-empty string of DNA nucleotide codes. Default is ATGAC.

Metadata

-i, --ids FILE

File containing IDs to assign to signatures in file metadata. Should contain one ID per line. If omitted will use file names stripped of extensions.

-m, --meta-json FILE

JSON file containing metadata to attach to file.

Todo

Document metadata schema

Additional Options

--progress / --no-progress

Show/don’t show progress meter.

-c, --cores INT

Number of CPU cores to use.

Calculating genomic distances

“dist” command

gambit dist [OPTIONS] -o OUTFILE
    (-q GENOME... | --ql LISTFILE | --qs SIGFILE)
    (-r GENOME... | --rl LISTFILE | --rs SIGFILE | --square | --use-db)

Calculate pairwise distances between a set of query genomes and a set of reference genomes. Output is a .csv file. If using --qs along with --rs or -use-db, the k-mer parameters of the query signature file must match the reference parameters.

Query genomes

-q GENOME

Path to a single genome file. May be used multiple times.

--ql LISTFILE

File containing paths of genome files, one per line.

--qdir DIRECTORY

Parent directory of paths in file given by --ql option.

--qs SIGFILE

A genome signatures file.

Reference genomes

-r GENOME

Path to a single genome file. May be used multiple times.

--rl LISTFILE

File containing paths of genome files, one per line.

--rdir DIRECTORY

Parent directory of paths in file given by --rl option.

--rs SIGFILE

A genome signatures file.

-s, --square

Use same genomes as the query.

-d, --use-db

Use all genomes in reference database.

Output

-o FILE

File to write output to. Required.

K-mer parameters

Only allowed if query and reference genomes do not come from precomputed signature files.

-k INTEGER

Length of k-mers to find (does not include length of prefix). Default is 11.

-p, --prefix STRING

K-mer prefix to match, a non-empty string of DNA nucleotide codes. Default is ATGAC.

Additional options

--progress / --no-progress

Show/don’t show progress meter.

-c, --cores INT

Number of CPU cores to use.

Creating relatedness trees

“gambit tree” command

gambit tree [OPTIONS] (-l LISTFILE | -s SIGFILE | GENOMES...)

Estimate a relatedness tree for a set of genomes and output in Newick format.

Input/output

-l LISTFILE

File containing paths of genome files, one per line.

--ldir DIRECTORY

Parent directory of paths in file given by -l option.

-s, --sigfile SIGFILE

A genome signatures file.

-o FILE

File to write output to. If omitted will write to stdout.

Todo

Allow using a distance matrix calculated using gambit dist.

K-mer parameters

Not allowed if the -s/--sigfile option was used.

-k INTEGER

Length of k-mers to find (does not include length of prefix). Default is 11.

-p, --prefix STRING

K-mer prefix to match, a non-empty string of DNA nucleotide codes. Default is ATGAC.

Additional options

--progress / --no-progress

Show/don’t show progress meter.

-c, --cores INT

Number of CPU cores to use.