Taxonomic Classification and Database Queries

gambit.classify

Classify queries based on distance to reference sequences.

class gambit.classify.ClassifierResult

Bases: object

Result of applying the classifier to a single query genome.

success

Whether the classification process ran successfully with no fatal errors. If True it is still possible no prediction was made.

Type:

bool

predicted_taxon

Taxon predicted by classifier.

Type:

Optional[gambit.db.models.Taxon]

primary_match

Match to closest reference genome which produced a predicted taxon equal to or a descendant of predicted_taxon. None if no prediction was made.

Type:

Optional[gambit.classify.GenomeMatch]

closest_match

Match to closest reference genome overall. This should almost always be identical to primary_match.

Type:

gambit.classify.GenomeMatch

next_taxon

Next most specific taxon for which the threshold was not met. Currently this just taken from the ancestry of closest_match.genome.taxon.

Type:

Optional[gambit.db.models.Taxon]

warnings

List of non-fatal warning messages to report.

Type:

List[str]

error

Message describing a fatal error which occurred, if any.

Type:

Optional[str]

__init__(success, predicted_taxon, primary_match, closest_match, next_taxon=_Nothing.NOTHING, warnings=_Nothing.NOTHING, error=None)

Method generated by attrs for class ClassifierResult.

Parameters:
  • success (bool) –

  • predicted_taxon (Optional[Taxon]) –

  • primary_match (Optional[GenomeMatch]) –

  • closest_match (GenomeMatch) –

  • next_taxon (Optional[Taxon]) –

  • warnings (List[str]) –

  • error (Optional[str]) –

Return type:

None

class gambit.classify.GenomeMatch

Bases: object

Match between a query and a single reference genome.

This is just used to report the distance from a query to some significant reference genome, it does not imply that this distance was close enough to actually make a taxonomy prediction or that the prediction was the primary prediction overall.

genome

Reference genome matched to.

Type:

gambit.db.models.AnnotatedGenome

distance

Distance between query and reference genome.

Type:

float

matching_taxon

Taxon prediction based off of this match alone. Will always be genome.taxon or one of its ancestors.

__init__(genome, distance, matched_taxon=_Nothing.NOTHING)

Method generated by attrs for class GenomeMatch.

Parameters:
Return type:

None

next_taxon()

Get next most specific taxon in lineage of genome for which the threshold was not met.

Return type:

Optional[Taxon]

gambit.classify.classify(ref_genomes, dists, *, strict=False)

Predict the taxonomy of a query genome based on its distances to a set of reference genomes.

Parameters:
  • ref_genomes (Sequence[AnnotatedGenome]) – List of reference genomes from database.

  • dists (ndarray) – Array of distances to each reference genome.

  • strict (bool) – If true find all significant matches to reference genomes and attempt to reconcile them if they result in different taxa. If False just consider the top (closest) match. Defaults to False.

Return type:

ClassifierResult

gambit.classify.compare_classifier_results(result1, result2)

Compare two ClassifierResult instances for equality.

Parameters:
Return type:

bool

gambit.classify.compare_genome_matches(match1, match2)

Compare two GenomeMatch instances for equality.

The values for the distance attribute are only checked for approximate equality, to support instances where one was loaded from a results archive (saving and loading a float in JSON is lossy).

Also allows one or both values to be None.

Parameters:
Return type:

bool

gambit.classify.consensus_taxon(taxa)

Take a set of taxa matching a query and find a single consensus taxon for classification.

If a query matches a given taxon, it is expected that there may be matches to some of that taxon’s ancestors as well. In this case all matched taxa lie in a single lineage and the most specific will be the consensus.

It may also be possible for a query to match multiple taxa which are “inconsistent” with each other in the sense that one is not a descendant of another. In that case the consensus will be the lowest taxon which is either a descendant or ancestor of all taxa in the argument. It’s also possible in pathological cases (depending on reference database design) that the taxa may be within entirely different trees, in which case the consensus will be None. The second element of the returned tuple is the set of taxa in the argument which are strict descendants of the consensus. This set will contain at least two taxa in the case of such an inconsistency and be empty otherwise.

Parameters:

taxa (Iterable[Taxon]) –

Returns:

Consensus taxon along with the set of any taxa in the argument which are descended from it.

Return type:

Tuple[Optional[Taxon], Set[Taxon]]

gambit.classify.find_matches(itr)

Find taxonomy matches given distances from a query to a set of reference genomes.

Parameters:

itr (Iterable[Tuple[AnnotatedGenome, float]]) – Iterable over (genome, distance) pairs.

Returns:

Mapping from taxa to indices of genomes matched to them.

Return type:

Dict[Taxon, List[Int]]

gambit.classify.matching_taxon(taxon, d)

Find first taxon in linage for which distance d is within its classification threshold.

Parameters:
  • taxon (Taxon) – Taxon to start searching from.

  • d (float) – Distance value.

Returns:

Most specific taxon in ancestry with threshold_distance >= d.

Return type:

Optional[Taxon]

gambit.query

Run queries against a GAMBIT database to predict taxonomy of genome sequences.

class gambit.query.QueryInput

Bases: object

Information on a query genome.

label

Some unique label for the input, probably the file name.

Type:

str

file

Source file (optional).

Type:

Optional[gambit.seq.SequenceFile]

__init__(label, file=None)

Method generated by attrs for class QueryInput.

Parameters:
Return type:

None

classmethod convert(x)

Convenience function to convert flexible argument types into QueryInput.

Accepts single string label, SequenceFile (uses file path for label), or existing QueryInput instance (returned unchanged).

Parameters:

x (Union[QueryInput, SequenceFile, str]) –

Return type:

QueryInput

class gambit.query.QueryParams

Bases: object

Parameters for running a query.

classify_strict

strict parameter to gambit.classify.classify(). Defaults to False.

Type:

bool

chunksize

Number of reference signatures to process at a time. None means no chunking is performed. Defaults to 1000.

Type:

int

report_closest

Number of closest genomes to report in results. Does not affect classification.

Type:

int

__init__(classify_strict=False, chunksize=1000, report_closest=10)

Method generated by attrs for class QueryParams.

Parameters:
  • classify_strict (bool) –

  • chunksize (int) –

  • report_closest (int) –

Return type:

None

class gambit.query.QueryResultItem

Bases: object

Result for a single query sequence.

input

Information on input genome.

Type:

gambit.query.QueryInput

classifier_result

Result of running classifier.

Type:

gambit.classify.ClassifierResult

report_taxon

Final taxonomy prediction to be reported to the user.

Type:

Optional[gambit.db.models.Taxon]

closest_genomes

List of closest reference genomes to query. Length determined by QueryParams.report_closest.

Type:

List[gambit.classify.GenomeMatch]

__init__(input, classifier_result, report_taxon=None, closest_genomes=_Nothing.NOTHING)

Method generated by attrs for class QueryResultItem.

Parameters:
Return type:

None

class gambit.query.QueryResults

Bases: object

Results for a set of queries, as well as information on database and parameters used.

items

Results for each query sequence.

Type:

List[gambit.query.QueryResultItem]

params

Parameters used to run query.

Type:

Optional[gambit.query.QueryParams]

genomeset

Genome set used.

Type:

Optional[gambit.db.models.ReferenceGenomeSet]

signaturesmeta

Metadata for signatures set used.

Type:

Optional[gambit.sigs.base.SignaturesMeta]

gambit_version

Version of GAMBIT command/library used to generate the results.

Type:

str

timestamp

Time query was completed.

Type:

datetime.datetime

extra

JSON-able dict containing additional arbitrary metadata.

Type:

Dict[str, Any]

__init__(items, params=None, genomeset=None, signaturesmeta=None, gambit_version='1.0.0', timestamp=_Nothing.NOTHING, extra=_Nothing.NOTHING)

Method generated by attrs for class QueryResults.

Parameters:
Return type:

None

gambit.query.compare_result_items(item1, item2)

Compare two QueryResultItem instances for equality.

Does not compare the value of the input attributes.

Parameters:
Return type:

bool

gambit.query.get_result_item(db, params, dists, input)

Perform classification and create result item object for single query input.

Parameters:
Return type:

QueryResultItem

gambit.query.query(db, queries, params=None, *, inputs=None, progress=None, **kw)

Predict the taxonomy of one or more query genomes using a GAMBIT reference database.

Parameters:
  • db (ReferenceDatabase) – Database to query.

  • queries (Sequence[KmerSignature]) – Sequence of k-mer signatures for query genomes.

  • params (Optional[QueryParams]) – QueryParams instance defining parameter values. If None take values from additional keyword arguments or use defaults.

  • inputs (Optional[Sequence[Union[QueryInput, SequenceFile, str]]]) – Description for each input, converted to gambit.query.result.QueryInput in results object. Only used for reporting, does not any other aspect of results. Items can be QueryInput, SequenceFile or str.

  • progress – Report progress for distance matrix calculation and classification. See gambit.util.progress.get_progress() for description of allowed values.

  • **kw – Passed to QueryParams.

Return type:

QueryResults

gambit.query.query_parse(db, files, params=None, *, file_labels=None, parse_kw=None, **kw)

Query a database with signatures derived by parsing a set of genome sequence files.

Parameters:
  • db (ReferenceDatabase) – Database to query.

  • files (Sequence[SequenceFile]) – Sequence files containing query files.

  • params (Optional[QueryParams]) – QueryParams instance defining parameter values. If None take values from additional keyword arguments or use defaults.

  • file_labels (Optional[Sequence[str]]) – Custom labels to use for each file in returned results object. If None use file names.

  • parse_kw (Optional[Dict[str, Any]]) – Keyword parameters to pass to gambit.sigs.calc.calc_file_signatures().

  • **kw – Additional keyword arguments passed to query().

Return type:

QueryResults