Command Line Interface
Genome assembly files accepted by the CLI must be in FASTA format, optionally compressed with gzip.
Root command group
gambit [OPTIONS] COMMAND [ARGS]...
Some top-level options are set at the root command group, and should be specified before the name of the subcommand to run.
Options
- -d, --db DIR
Path to the directory containing reference database files. Required by the query subcommand. As an alternative you can specify the database location with the
GAMBIT_DB_PATH
environment variable.
Environment variables
Querying the database
“query” command
gambit query [OPTIONS] (-s SIGFILE | -l LISTFILE | GENOMES...)
Predict taxonomy of microbial samples from genome sequences.
The reference database must be specified from the root command group.
Query genomes
Query genomes can be specified using one of the following methods:
Give paths of one or more genome files as positional arguments.
Use the
-l
option to specify a text file containing paths of the genome files.Use the
-s
option to use a signatures file created with the signatures create command.
- -l LISTFILE
File containing paths to genomes, one per line.
- --ldir DIRECTORY
Parent directory of paths in file given by
-l
option.
- -s, --sigfile FILE
A genome signatures file.
Additional Options
- -o, --output FILE
File to write output to. If omitted will write to stdout.
- -f, --outfmt {csv|json|archive}
Results format (see next section).
- --progress / --no-progress
Show/don’t show progress meter.
- -c, --cores INT
Number of CPU cores to use.
Result Formats
CSV
A .csv file with one row per query. Contains the following columns:
query
- Query genome file name (minus extension).predicted
- Predicted taxon.predicted.name
- Name of taxon.predicted.rank
- Taxonomic rank (genus, species, etc.).predicted.ncbi_id
- Numeric ID in NCBI taxonomy database, if any.predicted.threshold
- Classification threshold.
closest
- Reference genome closest to query.closest.distance
- Distance to closest genome.closest.decription
- Text description.
next
- Next most specific taxon for which the classification threshold was not met.next.name
next.rank
next.ncbi_id
next.threshold
JSON
A machine-readable format meant to be used in pipelines.
Todo
Document schema
Archive
A more verbose JSON-based format used for testing and development.
Generating and inspecting k-mer signatures
“signatures info” command
gambit signatures info [OPTIONS] FILE
Print information about a GAMBIT signatures file. Defaults to a basic human-readable format.
Options
- -j, --json
Print information in JSON format. Includes more information than standard output.
- -p, --pretty
Prettify JSON output to make it more human-readable.
- -i, --ids
Print IDs of all signatures in file.
“signatures create” command
gambit signatures create [OPTIONS] -o OUTFILE (-l LISTFILE | GENOMES...)
Calculate GAMBIT signatures of a set of genomes and write to a binary file.
Input/output
- -l LISTFILE
File containing paths to genomes, one per line.
- --ldir DIRECTORY
Parent directory of paths in file given by
-l
option.
- -o, --output FILE
Path to write file to (required).
K-mer parameters
- -k INTEGER
Length of k-mers to find (does not include length of prefix). Default is 11.
- -p, --prefix STRING
K-mer prefix to match, a non-empty string of DNA nucleotide codes. Default is ATGAC.
Metadata
- -i, --ids FILE
File containing IDs to assign to signatures in file metadata. Should contain one ID per line. If omitted will use file names stripped of extensions.
- -m, --meta-json FILE
JSON file containing metadata to attach to file.
Todo
Document metadata schema
Additional Options
- --progress / --no-progress
Show/don’t show progress meter.
- -c, --cores INT
Number of CPU cores to use.
Calculating genomic distances
“dist” command
gambit dist [OPTIONS] -o OUTFILE
(-q GENOME... | --ql LISTFILE | --qs SIGFILE)
(-r GENOME... | --rl LISTFILE | --rs SIGFILE | --square | --use-db)
Calculate pairwise distances between a set of query genomes and a set of reference genomes.
Output is a .csv file. If using --qs
along with --rs
or -use-db
, the k-mer parameters
of the query signature file must match the reference parameters.
Query genomes
- -q GENOME
Path to a single genome file. May be used multiple times.
- --ql LISTFILE
File containing paths of genome files, one per line.
- --qdir DIRECTORY
Parent directory of paths in file given by
--ql
option.
- --qs SIGFILE
A genome signatures file.
Reference genomes
- -r GENOME
Path to a single genome file. May be used multiple times.
- --rl LISTFILE
File containing paths of genome files, one per line.
- --rdir DIRECTORY
Parent directory of paths in file given by
--rl
option.
- --rs SIGFILE
A genome signatures file.
- -s, --square
Use same genomes as the query.
- -d, --use-db
Use all genomes in reference database.
Output
- -o FILE
File to write output to. Required.
K-mer parameters
Only allowed if query and reference genomes do not come from precomputed signature files.
- -k INTEGER
Length of k-mers to find (does not include length of prefix). Default is 11.
- -p, --prefix STRING
K-mer prefix to match, a non-empty string of DNA nucleotide codes. Default is ATGAC.
Additional options
- --progress / --no-progress
Show/don’t show progress meter.
- -c, --cores INT
Number of CPU cores to use.