Taxonomic Classification and Database Queries
gambit.classify
Classify queries based on distance to reference sequences.
- class gambit.classify.ClassifierResult
Bases:
objectResult of applying the classifier to a single query genome.
- success
Whether the classification process ran successfully with no fatal errors. If True it is still possible no prediction was made.
- Type:
- predicted_taxon
Taxon predicted by classifier.
- Type:
gambit.db.models.Taxon | None
- primary_match
Match to closest reference genome which produced a predicted taxon equal to or a descendant of
predicted_taxon. None if no prediction was made.- Type:
gambit.classify.GenomeMatch | None
- closest_match
Match to closest reference genome overall. This should almost always be identical to
primary_match.
- next_taxon
Next most specific taxon for which the threshold was not met. Currently this just taken from the ancestry of
closest_match.genome.taxon.- Type:
gambit.db.models.Taxon | None
- __init__(success, predicted_taxon, primary_match, closest_match, next_taxon=NOTHING, warnings=NOTHING, error=None)
Method generated by attrs for class ClassifierResult.
- Parameters:
success (bool)
predicted_taxon (Taxon | None)
primary_match (GenomeMatch | None)
closest_match (GenomeMatch)
next_taxon (Taxon | None)
error (str | None)
- Return type:
None
- class gambit.classify.GenomeMatch
Bases:
objectMatch between a query and a single reference genome.
This is just used to report the distance from a query to some significant reference genome, it does not imply that this distance was close enough to actually make a taxonomy prediction or that the prediction was the primary prediction overall.
- genome
Reference genome matched to.
- matched_taxon
Taxon prediction based off of this match alone. Will always be
genome.taxonor one of its ancestors.- Type:
gambit.db.models.Taxon | None
- __init__(genome, distance, matched_taxon=NOTHING)
Method generated by attrs for class GenomeMatch.
- Parameters:
genome (AnnotatedGenome)
distance (float)
matched_taxon (Taxon | None)
- Return type:
None
- gambit.classify.classify(ref_genomes, dists, *, strict=False)
Predict the taxonomy of a query genome based on its distances to a set of reference genomes.
- Parameters:
ref_genomes (Sequence[AnnotatedGenome]) – List of reference genomes from database.
dists (ndarray) – Array of distances to each reference genome.
strict (bool) – If true find all significant matches to reference genomes and attempt to reconcile them if they result in different taxa. If False just consider the top (closest) match. Defaults to False.
- Return type:
- gambit.classify.consensus_taxon(taxa)
Take a set of taxa matching a query and find a single consensus taxon for classification.
If a query matches a given taxon, it is expected that there may be matches to some of that taxon’s ancestors as well. In this case all matched taxa lie in a single lineage and the most specific will be the consensus.
It may also be possible for a query to match multiple taxa which are “inconsistent” with each other in the sense that one is not a descendant of another. In that case the consensus will be the lowest taxon which is either a descendant or ancestor of all taxa in the argument. It’s also possible in pathological cases (depending on reference database design) that the taxa may be within entirely different trees, in which case the consensus will be
None. The second element of the returned tuple is the set of taxa in the argument which are strict descendants of the consensus. This set will contain at least two taxa in the case of such an inconsistency and be empty otherwise.
- gambit.classify.find_matches(itr)
Find taxonomy matches given distances from a query to a set of reference genomes.
- gambit.classify.matching_taxon(taxon, d)
Find first taxon in linage for which distance
dis within its classification threshold.
gambit.query
Run queries against a GAMBIT database to predict taxonomy of genome sequences.
- class gambit.query.QueryParams
Bases:
objectParameters for running a query.
- classify_strict
strictparameter togambit.classify.classify(). Defaults to False.- Type:
- chunksize
Number of reference signatures to process at a time.
Nonemeans no chunking is performed. Defaults to 1000.- Type:
- report_closest
Number of closest genomes to report in results. Does not affect classification.
- Type:
- class gambit.query.QueryResultItem
Bases:
objectResult for a single query sequence.
- classifier_result
Result of running classifier.
- report_taxon
Final taxonomy prediction to be reported to the user.
- Type:
gambit.db.models.Taxon | None
- closest_genomes
List of closest reference genomes to query. Length determined by
QueryParams.report_closest.- Type:
- file
Path to file containing query genome (optional).
- Type:
pathlib.Path | None
- __init__(label, classifier_result, report_taxon=None, closest_genomes=NOTHING, file=None)
Method generated by attrs for class QueryResultItem.
- Parameters:
label (str)
classifier_result (ClassifierResult)
report_taxon (Taxon | None)
closest_genomes (list[GenomeMatch])
- Return type:
None
- class gambit.query.QueryResults
Bases:
objectResults for a set of queries, as well as information on database and parameters used.
- items
Results for each query sequence.
- Type:
- params
Parameters used to run query.
- Type:
gambit.query.QueryParams | None
- genomeset
Genome set used.
- Type:
- signaturesmeta
Metadata for signatures set used.
- Type:
- timestamp
Time query was completed.
- Type:
- __init__(items, params=None, genomeset=None, signaturesmeta=None, gambit_version='1.1.0', timestamp=NOTHING, extra=NOTHING)
Method generated by attrs for class QueryResults.
- Parameters:
items (list[QueryResultItem])
params (QueryParams | None)
genomeset (ReferenceGenomeSet | None)
signaturesmeta (SignaturesMeta | None)
gambit_version (str)
timestamp (datetime)
- Return type:
None
- gambit.query.get_result_item(db, params, dists, label)
Perform classification and create result item object for single query input.
- Parameters:
db (ReferenceDatabase)
params (QueryParams)
dists (ndarray) – 1D array of distances from query to all reference genomes.
label (str)
- Return type:
- gambit.query.query(db, queries, params=None, *, labels=None, progress=None, **kw)
Predict the taxonomy of one or more query genomes using a GAMBIT reference database.
- Parameters:
db (ReferenceDatabase) – Database to query.
queries (Sequence[KmerSignature]) – Sequence of k-mer signatures for query genomes.
params (QueryParams | None) –
QueryParamsinstance defining parameter values. If None take values from additional keyword arguments or use defaults.labels (Sequence[str] | None) – Optional list of string labels for each query. Only used for reporting (sets
labelattribute ofQueryResultItemin results object), does not any other aspect of results.progress – Report progress for distance matrix calculation and classification. See
gambit.util.progress.get_progress()for description of allowed values.**kw – Passed to
QueryParams.
- Return type:
- gambit.query.query_parse(db, files, params=None, *, labels=None, parse_kw=None, **kw)
Query a database with signatures derived by parsing a set of genome sequence files.
- Parameters:
db (ReferenceDatabase) – Database to query.
files (Sequence[str | PathLike]) – Sequence files containing query files.
params (QueryParams | None) –
QueryParamsinstance defining parameter values. If None take values from additional keyword arguments or use defaults.labels (Sequence[str] | None) – Custom labels to use for each file in returned results object. If None use file names.
parse_kw (dict[str, Any] | None) – Keyword parameters to pass to
gambit.sigs.calc.calc_file_signatures().**kw – Additional keyword arguments passed to
query().
- Return type: