Taxonomic Classification and Database Queries
gambit.classify
Classify queries based on distance to reference sequences.
- class gambit.classify.ClassifierResult
Bases:
object
Result of applying the classifier to a single query genome.
- success
Whether the classification process ran successfully with no fatal errors. If True it is still possible no prediction was made.
- Type:
bool
- predicted_taxon
Taxon predicted by classifier.
- Type:
gambit.db.models.Taxon | None
- primary_match
Match to closest reference genome which produced a predicted taxon equal to or a descendant of
predicted_taxon
. None if no prediction was made.- Type:
gambit.classify.GenomeMatch | None
- closest_match
Match to closest reference genome overall. This should almost always be identical to
primary_match
.
- next_taxon
Next most specific taxon for which the threshold was not met. Currently this just taken from the ancestry of
closest_match.genome.taxon
.- Type:
gambit.db.models.Taxon | None
- warnings
List of non-fatal warning messages to report.
- Type:
List[str]
- error
Message describing a fatal error which occurred, if any.
- Type:
str | None
- __init__(success, predicted_taxon, primary_match, closest_match, next_taxon=_Nothing.NOTHING, warnings=_Nothing.NOTHING, error=None)
Method generated by attrs for class ClassifierResult.
- Parameters:
success (bool) –
predicted_taxon (Taxon | None) –
primary_match (GenomeMatch | None) –
closest_match (GenomeMatch) –
next_taxon (Taxon | None) –
warnings (List[str]) –
error (str | None) –
- Return type:
None
- class gambit.classify.GenomeMatch
Bases:
object
Match between a query and a single reference genome.
This is just used to report the distance from a query to some significant reference genome, it does not imply that this distance was close enough to actually make a taxonomy prediction or that the prediction was the primary prediction overall.
- genome
Reference genome matched to.
- distance
Distance between query and reference genome.
- Type:
float
- matching_taxon
Taxon prediction based off of this match alone. Will always be
genome.taxon
or one of its ancestors.
- __init__(genome, distance, matched_taxon=_Nothing.NOTHING)
Method generated by attrs for class GenomeMatch.
- Parameters:
genome (AnnotatedGenome) –
distance (float) –
matched_taxon (Taxon | None) –
- Return type:
None
- gambit.classify.classify(ref_genomes, dists, *, strict=False)
Predict the taxonomy of a query genome based on its distances to a set of reference genomes.
- Parameters:
ref_genomes (Sequence[AnnotatedGenome]) – List of reference genomes from database.
dists (ndarray) – Array of distances to each reference genome.
strict (bool) – If true find all significant matches to reference genomes and attempt to reconcile them if they result in different taxa. If False just consider the top (closest) match. Defaults to False.
- Return type:
- gambit.classify.compare_classifier_results(result1, result2)
Compare two
ClassifierResult
instances for equality.- Parameters:
result1 (ClassifierResult) –
result2 (ClassifierResult) –
- Return type:
bool
- gambit.classify.compare_genome_matches(match1, match2)
Compare two
GenomeMatch
instances for equality.The values for the
distance
attribute are only checked for approximate equality, to support instances where one was loaded from a results archive (saving and loading a float in JSON is lossy).Also allows one or both values to be None.
- Parameters:
match1 (GenomeMatch | None) –
match2 (GenomeMatch | None) –
- Return type:
bool
- gambit.classify.consensus_taxon(taxa)
Take a set of taxa matching a query and find a single consensus taxon for classification.
If a query matches a given taxon, it is expected that there may be matches to some of that taxon’s ancestors as well. In this case all matched taxa lie in a single lineage and the most specific will be the consensus.
It may also be possible for a query to match multiple taxa which are “inconsistent” with each other in the sense that one is not a descendant of another. In that case the consensus will be the lowest taxon which is either a descendant or ancestor of all taxa in the argument. It’s also possible in pathological cases (depending on reference database design) that the taxa may be within entirely different trees, in which case the consensus will be
None
. The second element of the returned tuple is the set of taxa in the argument which are strict descendants of the consensus. This set will contain at least two taxa in the case of such an inconsistency and be empty otherwise.
- gambit.classify.find_matches(itr)
Find taxonomy matches given distances from a query to a set of reference genomes.
- Parameters:
itr (Iterable[Tuple[AnnotatedGenome, float]]) – Iterable over
(genome, distance)
pairs.- Returns:
Mapping from taxa to indices of genomes matched to them.
- Return type:
Dict[Taxon, List[Int]]
- gambit.classify.matching_taxon(taxon, d)
Find first taxon in linage for which distance
d
is within its classification threshold.
gambit.query
Run queries against a GAMBIT database to predict taxonomy of genome sequences.
- class gambit.query.QueryInput
Bases:
object
Information on a query genome.
- label
Some unique label for the input, probably the file name.
- Type:
str
- file
Source file (optional).
- Type:
gambit.seq.SequenceFile | None
- __init__(label, file=None)
Method generated by attrs for class QueryInput.
- Parameters:
label (str) –
file (SequenceFile | None) –
- Return type:
None
- classmethod convert(x)
Convenience function to convert flexible argument types into QueryInput.
Accepts single string label,
SequenceFile
(uses file path for label), or existingQueryInput
instance (returned unchanged).- Parameters:
x (QueryInput | SequenceFile | str) –
- Return type:
- class gambit.query.QueryParams
Bases:
object
Parameters for running a query.
- classify_strict
strict
parameter togambit.classify.classify()
. Defaults to False.- Type:
bool
- chunksize
Number of reference signatures to process at a time.
None
means no chunking is performed. Defaults to 1000.- Type:
int
- report_closest
Number of closest genomes to report in results. Does not affect classification.
- Type:
int
- __init__(classify_strict=False, chunksize=1000, report_closest=10)
Method generated by attrs for class QueryParams.
- Parameters:
classify_strict (bool) –
chunksize (int) –
report_closest (int) –
- Return type:
None
- class gambit.query.QueryResultItem
Bases:
object
Result for a single query sequence.
- input
Information on input genome.
- Type:
- classifier_result
Result of running classifier.
- report_taxon
Final taxonomy prediction to be reported to the user.
- Type:
gambit.db.models.Taxon | None
- closest_genomes
List of closest reference genomes to query. Length determined by
QueryParams.report_closest
.- Type:
- __init__(input, classifier_result, report_taxon=None, closest_genomes=_Nothing.NOTHING)
Method generated by attrs for class QueryResultItem.
- Parameters:
input (QueryInput) –
classifier_result (ClassifierResult) –
report_taxon (Taxon | None) –
closest_genomes (List[GenomeMatch]) –
- Return type:
None
- class gambit.query.QueryResults
Bases:
object
Results for a set of queries, as well as information on database and parameters used.
- items
Results for each query sequence.
- Type:
- params
Parameters used to run query.
- Type:
gambit.query.QueryParams | None
- genomeset
Genome set used.
- Type:
- signaturesmeta
Metadata for signatures set used.
- Type:
- gambit_version
Version of GAMBIT command/library used to generate the results.
- Type:
str
- timestamp
Time query was completed.
- Type:
datetime.datetime
- extra
JSON-able dict containing additional arbitrary metadata.
- Type:
Dict[str, Any]
- __init__(items, params=None, genomeset=None, signaturesmeta=None, gambit_version='1.0.1', timestamp=_Nothing.NOTHING, extra=_Nothing.NOTHING)
Method generated by attrs for class QueryResults.
- Parameters:
items (List[QueryResultItem]) –
params (QueryParams | None) –
genomeset (ReferenceGenomeSet | None) –
signaturesmeta (SignaturesMeta | None) –
gambit_version (str) –
timestamp (datetime) –
extra (Dict[str, Any]) –
- Return type:
None
- gambit.query.compare_result_items(item1, item2)
Compare two
QueryResultItem
instances for equality.Does not compare the value of the
input
attributes.- Parameters:
item1 (QueryResultItem) –
item2 (QueryResultItem) –
- Return type:
bool
- gambit.query.get_result_item(db, params, dists, input)
Perform classification and create result item object for single query input.
- Parameters:
db (ReferenceDatabase) –
params (QueryParams) –
dists (ndarray) – Distances from query to reference genomes.
input (QueryInput) –
- Return type:
- gambit.query.query(db, queries, params=None, *, inputs=None, progress=None, **kw)
Predict the taxonomy of one or more query genomes using a GAMBIT reference database.
- Parameters:
db (ReferenceDatabase) – Database to query.
queries (Sequence[KmerSignature]) – Sequence of k-mer signatures for query genomes.
params (QueryParams | None) –
QueryParams
instance defining parameter values. If None take values from additional keyword arguments or use defaults.inputs (Sequence[QueryInput | SequenceFile | str] | None) – Description for each input, converted to
gambit.query.result.QueryInput
in results object. Only used for reporting, does not any other aspect of results. Items can beQueryInput
,SequenceFile
orstr
.progress – Report progress for distance matrix calculation and classification. See
gambit.util.progress.get_progress()
for description of allowed values.**kw – Passed to
QueryParams
.
- Return type:
- gambit.query.query_parse(db, files, params=None, *, file_labels=None, parse_kw=None, **kw)
Query a database with signatures derived by parsing a set of genome sequence files.
- Parameters:
db (ReferenceDatabase) – Database to query.
files (Sequence[SequenceFile]) – Sequence files containing query files.
params (QueryParams | None) –
QueryParams
instance defining parameter values. If None take values from additional keyword arguments or use defaults.file_labels (Sequence[str] | None) – Custom labels to use for each file in returned results object. If None use file names.
parse_kw (Dict[str, Any] | None) – Keyword parameters to pass to
gambit.sigs.calc.calc_file_signatures()
.**kw – Additional keyword arguments passed to
query()
.
- Return type: