Database
gambit.db
gambit.db.refdb
- exception gambit.db.refdb.DatabaseLoadError
Bases:
ExceptionRaised when there is a problem loading a database.
- directory
Directory we’re attempting to load from.
- Type:
pathlib.Path | None
- genomes_file
- Type:
pathlib.Path | None
- signatures_file
- Type:
pathlib.Path | None
- __init__(msg, directory=None, genomes_file=None, signatures_file=None)
- class gambit.db.refdb.ReferenceDatabase
Bases:
objectObject containing reference genomes, their k-mer signatures, and associated data.
This is all that is needed at runtime to run queries.
- genomeset
Genome set containing reference genomes.
- genomes
List of reference genomes.
- Type:
Sequence[gambit.db.models.AnnotatedGenome]
- signatures
K-mer signatures for each genome. A subtype of
ReferenceSignatures, so contains metadata on signatures as well as the signatures themselves. Type may represent signatures stored on disk (e.g.HDF5Signatures) instead of in memory. OK to contain additional signatures not corresponding to any genome ingenomes.
- sig_indices
Index of signature in
signaturescorresponding to each genome ingenomes. In sorted order to improve performance when iterating over them (improve locality if in memory and avoid seeking if in file).- Type:
Sequence[int]
- session
The SQLAlchemy session
genomesetand the elements ofgenomesbelong to. It is important to keep a reference to this, just having references to the ORM objects themselves is not enough to keep the session from being garbage collected.
- Parameters:
genomeset (gambit.db.models.ReferenceGenomeSet)
signatures (gambit.sigs.base.ReferenceSignatures)
- __init__(genomeset, signatures)
- Parameters:
genomeset (ReferenceGenomeSet)
signatures (ReferenceSignatures)
- classmethod load(genomes_file, signatures_file)
Load complete database given paths to SQLite genomes database file and HDF5 signatures file.
- Parameters:
- Return type:
- classmethod load_from_dir(path)
Load complete database given directory containing SQLite genomes database file and HDF5 signatures file.
See
locate_files()for how these files are located within the directory.- Raises:
RuntimeError – If files cannot be located in directory.
- Parameters:
path (FilePath)
- Return type:
- classmethod locate_files(path)
Locate an SQLite genome database file and HDF5 signatures file in a directory.
Files are located by extension,
.gdbor.dbfor SQLite file and.gsor.h5for signatures file. Does not look in subdirectories.- Parameters:
path (FilePath) – Path to directory to look within.
- Returns:
Paths to genomes database file and signatures file.
- Return type:
- Raises:
FileNotFoundError – If the path does not exist.
NotADirectoryError – If the given path does not point to a directory.
DatabaseLoadError – If files could not be located or if multiple files with the same extension exist in the directory.
- gambit.db.refdb.genomes_by_id(genomeset, id_attr, ids, strict=True)
Match a
ReferenceGenomeSet’s genomes to a set of ID values.This is primarily used to match genomes to signatures based on the ID values stored in a signature file. It is expected that the signature file may contain signatures for more genomes than are present in the genome set, see also
genomes_by_id_subset()for that condition.- Parameters:
genomeset (ReferenceGenomeSet)
id_attr (str | InstrumentedAttribute) – ID attribute of
gambit.db.models.Genometo use for lookup. Can be used as the attribute itself (e.g.Genome.refseq_acc) or just the name ('refsec_acc'). SeeID_ATTRSfor the set of allowed values.ids (Sequence) – Sequence of ID values (strings or integers, matching type of attribute).
strict (bool) – Raise an exception if a matching genome cannot be found for any ID value.
- Returns:
List of genomes of same length as
ids. Ifstrict=Falseand a genome cannot be found for a given ID the list will containNoneat the corresponding position.- Return type:
list[Optional[AnnotatedGenome]]
- Raises:
KeyError – If
strict=Trueand any ID value cannot be found.
- gambit.db.refdb.genomes_by_id_subset(genomeset, id_attr, ids)
Match a
ReferenceGenomeSet’s genomes to a set of ID values, allowing missing genomes.This calls
genomes_by_id()withstrict=Falseand filters anyNonevalues from the output. The filtered list is returned along with the indices of all values inidswhich were not filtered out. The indices can be used to load only those signatures which have a matched genome from a signature file.Note that it is not checked that every genome in
genomesetis matched by an ID. Check the size of the returned lists for this.- Parameters:
genomeset (ReferenceGenomeSet)
id_attr (str | InstrumentedAttribute) – ID attribute of
gambit.db.models.Genometo use for lookup. Can be used as the attribute itself (e.g.Genome.refseq_acc) or just the name ('refsec_acc'). SeeID_ATTRSfor the set of allowed values.ids (Sequence) – Sequence of ID values (strings or integers, matching type of attribute).
- Return type:
tuple[list[AnnotatedGenome], list[int]]
- gambit.db.refdb.load_genomeset(db_file)
Get the only
gambit.db.models.ReferenceGenomeSetfrom a genomes database file.- Parameters:
db_file (FilePath)
- Return type:
tuple[Session, ReferenceGenomeSet]
gambit.db.models
SQLAlchemy models for storing reference genomes and taxonomy information.
- class gambit.db.models.AnnotatedGenome
Bases:
BaseA genome with additional annotations as part of a genome set.
This object serves to attach a genome to a
ReferenceGenomeSet, and to assign a taxonomy classification to that genome. Hybrid attributes mirroring the attributes of the connected genome effectively make this behave as an extendedGenomeobject.- genome_id
Integer column, part of composite primary key. ID of
Genomethe annotations are for.- Type:
- genome_set_id
Integer column, part of composite primary key. ID of the
ReferenceGenomeSetthe annotations are under.- Type:
- organism
String column. Single string describing the organism. May be “Genus species [strain]” but could contain more specific information. Intended to be human-readable and shouldn’t have any semantic meaning for the application (in contrast to the
taxonrelationship).- Type:
- genome_set
Many-to-one relationship to
ReferenceGenomeSet.- Type:
- taxon
Many-to-one relationship to
Taxon. The primary taxon this genome is classified as under the associatedReferenceGenomeSet. Should be the most specific and “regular” (ideally defined on NCBI) taxon this genome belongs to.- Type:
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class gambit.db.models.Genome
Bases:
BaseBase model for a reference genome that can be compared to query.
Corresponds to a single assembly (one or more contigs, but at least partially assembled) from what should be a single sequencing run. The same organism or strain may have several genome entries for it. Typically this will correspond directly to a record in Genbank (assembly database).
The data on this model should primarily pertain to the sample and sequencing run itself. It would be updated if for example a better assembly was produced from the original raw data, however more advanced interpretation such as taxonomy assignments belong on an attached
AnnotatedGenomeobject.- key
String column (unique). Unique “external id” used to reference the genome from outside the SQL database, e.g. from a file containing K-mer signatures.
- Type:
- description
String column (optional). Short one-line description. Recommended to be unique but this is not enforced.
- Type:
Optional[str]
- ncbi_db
String column (optional). If the genome corresponds to a record downloaded from an NCBI database this column should be the database name (e.g.
'assembly') andncbi_idshould be the entry’s UID. Unique along withncbi_id.- Type:
Optional[str]
- genbank_acc
String column (optional, unique). GenBank accession number for this genome, if any.
- Type:
Optional[str]
- refseq_acc
String column (optional, unique). RefSeq accession number for this genome, if any.
- Type:
Optional[str]
- annotations
One-to-many relationship to
AnnotatedGenome.- Type:
Collection[AnnotatedGenome]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- ID_ATTRS = ('key', 'genbank_acc', 'refseq_acc', 'ncbi_id')
Attributes which serve as unique IDs.
- class gambit.db.models.ReferenceGenomeSet
Bases:
BaseA collection of reference genomes along with additional annotations and data. A full GAMBIT database which can be used for queries consists of a genome set plus a set of k-mer signatures for those genomes (stored separately).
Membership of
Genomes in the set is determined by the presence of an associatedAnnotatedGenomeobject, which also holds additional annotation data for the genome. The genome set also includes a set of associatedTaxonentries, which form a taxonomy tree under which all its genomes are categorized.This schema technically allows for multiple genome sets within the same database (which can share
Genomes but with different annotations), but the GAMBIT application generally expects that genome sets are stored in their own SQLite files.- key
String column. An “external id” used to uniquely identify this genome set. Unique along with
version.- Type:
- version
Optional version string, an updated version of a previous genome set should have the same key with a later version number. Should be in the format defined by PEP 440.
- Type:
- genomes
Many-to-many relationship with
AnnotatedGenome, annotated versions of genomes in this set.- Type:
Collection[AnnotatedGenome]
- base_genomes
Unannotated
Genomes in this set. Association proxy to thegenomerelationship of members ofgenomes.- Type:
Collection[Genome]
- taxa
One-to-many relationship to
Taxon. The taxa that form the classification system for this genome set.- Type:
Collection[Taxon]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- root_taxa()
Query for root taxa belonging to the set.
- Return type:
sqlalchemy.orm.query.Query
- class gambit.db.models.Taxon
Bases:
BaseA taxon used for classifying genomes.
Taxa are specific to a
ReferenceGenomeSetand form a tree/forest structure through theparentandchildrenrelationships.- name
String column. Human-readable name for the taxon, typically the standard scientific name.
- Type:
- rank
String column (optional). Taxonomic rank, if any. Species, genus, family, etc.
- Type:
Optional[str]
- distance_threshold
Float column (optional). Query genomes within this distance of one of the taxon’s reference genomes will be classified as that taxon. If NULL the taxon is just used establish the tree structure and is not used directly in classification.
- Type:
Optional[float]
- report
Boolean column. Whether to report this taxon directly as a match when producing a human-readable query result. Some custom taxa might need to be “hidden” from the user, in which case the value should be false. The application should then ascend the taxon’s lineage and choose the first ancestor where this field is true. Defaults to true.
- Type:
- genome_set_id
Integer column. ID of
ReferenceGenomeSetthe taxon belongs to.- Type:
- ncbi_id
Integer column (optional). ID of the entry in the NCBI taxonomy database this taxon corresponds to, if any.
- Type:
Optional[int]
- parent
Many-to-one relationship with
Taxon, the parent of this taxon (if any).- Type:
Optional[Taxon]
- genome_set
Many-to-one relationship to
ReferenceGenomeSet.- Type:
- genomes
One-to-many relationship with
AnnotatedGenome, genomes which are assigned to this taxon.- Type:
Collection[AnnotatedGenome]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- ancestor_of_rank(rank)
Get the taxon’s ancestor with the given rank, if it exists.
- ancestors(incself=False)
Iterate through the taxon’s ancestors from bottom to top.
- classmethod common_ancestors(taxa)
Get list of common ancestors of a set of taxa.
- descendants(postorder=False)
Iterate through taxa all of the taxon’s descendants.
This is the same as
traverse()except the taxon itself is not included.
- has_genome(genome)
Check whether the given genome is assigned to this taxon or any of its descendants.
- Parameters:
genome (AnnotatedGenome)
- Return type:
- classmethod lca(taxa)
Find the Least Common Ancestor of a set of taxa.
Returns None if taxa is empty or its members do not all lie in the same tree.
- leaves()
Iterate through all leaves in the taxon’s subtree.
For leaf taxa this will just yield the taxon itself.
- lineage(ranks=None)
Get a last of this taxon’s ancestors.
With an argument, gets ancestors with the given ranks. Without, gets a sorted list of the taxon’s ancestors from top to bottom (including itself)
- print_tree(f=None, *, indent=' ', sort_key=None)
Print the taxon’s subtree for debugging.
- Parameters:
f (Callable[[Taxon], str] | None) – A function which takes a taxon and returns the string representation to print for it. Defaults to
short_repr().indent (str) – String used to indent each level of descendants.
sort_key (Callable[[Taxon], Any] | None) – A function which takes a taxon and returns a sort key, to determine what order a taxon’s children are printed in. Defaults to the taxon’s name.
- root()
Get the root taxon of this taxon’s tree.
The set of taxa in a
ReferenceGenomeSetwill generally form a forest instead of a single tree, so there can be multiple root taxa.Returns self if the taxon has no parent.
- Return type:
- short_repr()
Get a short string representation of the Taxon for logging and warning/error messages.
- subtree_genomes()
Iterate through all genomes assigned to this taxon or its descendants.
- Return type:
- gambit.db.models.only_genomeset(session)
Get the only
ReferenceGenomeSetin a database.The format which is used to distribute GAMBIT databases and is expected by the CLI is an sqlite file containing a single genome set.
- Parameters:
session (Session) – ORM session connected to database.
- Raises:
RuntimeError – If the database does not contain a single genome set.
- Return type:
gambit.db.sqla
Custom types and other utilities for SQLAlchemy.
- class gambit.db.sqla.JsonString
Bases:
TypeDecoratorSQLA column type for JSON data which is stored in the database as a standard string column.
Data is automatically serialized/unserialized when saved/loaded. Important: mutation tracking is not enabled for this type. If the value is a list or dict and you modify it in place these changes will not be detected. Instead, re-assign the attribute.
- class gambit.db.sqla.ReadOnlySession
Bases:
SessionSession class that doesn’t allow flushing/committing.