Database
gambit.db
gambit.db.refdb
- class gambit.db.refdb.ReferenceDatabase
Bases:
object
Object containing reference genomes, their k-mer signatures, and associated data.
This is all that is needed at runtime to run queries.
- genomeset
Genome set containing reference genomes.
- genomes
List of reference genomes.
- Type:
Sequence[gambit.db.models.AnnotatedGenome]
- signatures
K-mer signatures for each genome. A subtype of
ReferenceSignatures
, so contains metadata on signatures as well as the signatures themselves. Type may represent signatures stored on disk (e.g.HDF5Signatures
) instead of in memory. OK to contain additional signatures not corresponding to any genome ingenomes
.
- sig_indices
Index of signature in
signatures
corresponding to each genome ingenomes
. In sorted order to improve performance when iterating over them (improve locality if in memory and avoid seeking if in file).- Type:
Sequence[int]
- session
The SQLAlchemy session
genomeset
and the elements ofgenomes
belong to. It is important to keep a reference to this, just having references to the ORM objects themselves is not enough to keep the session from being garbage collected.
- Parameters:
genomeset (gambit.db.models.ReferenceGenomeSet) –
signatures (gambit.sigs.base.ReferenceSignatures) –
- __init__(genomeset, signatures)
- Parameters:
genomeset (ReferenceGenomeSet) –
signatures (ReferenceSignatures) –
- classmethod load(genomes_file, signatures_file)
Load complete database given paths to SQLite genomes database file and HDF5 signatures file.
- Parameters:
genomes_file (Union[str, PathLike]) –
signatures_file (Union[str, PathLike]) –
- Return type:
- classmethod load_from_dir(path)
Load complete database given directory containing SQLite genomes database file and HDF5 signatures file.
See
locate_db_files()
for how these files are located within the directory.- Raises:
RuntimeError – If files cannot be located in directory.
- Parameters:
path (Union[str, PathLike]) –
- Return type:
- classmethod locate_files(path)
Locate an SQLite genome database file and HDF5 signatures file in a directory.
Files are located by extension,
.gdb
or.db
for SQLite file and.gs
or.h5
for signatures file. Does not look in subdirectories.- Parameters:
path (Union[str, PathLike]) – Path to directory to look within.
- Return type:
Paths to genomes database file and signatures file.
- Raises:
FileNotFoundError – If files could not be located or if multiple files with the same extension exist in the directory.
- gambit.db.refdb.genomes_by_id(genomeset, id_attr, ids, strict=True)
Match a
ReferenceGenomeSet
’s genomes to a set of ID values.This is primarily used to match genomes to signatures based on the ID values stored in a signature file. It is expected that the signature file may contain signatures for more genomes than are present in the genome set, see also
genomes_by_id_subset()
for that condition.- Parameters:
genomeset (ReferenceGenomeSet) –
id_attr (Union[str, InstrumentedAttribute]) – ID attribute of
gambit.db.models.Genome
to use for lookup. Can be used as the attribute itself (e.g.Genome.refseq_acc
) or just the name ('refsec_acc'
). SeeGENOME_IDS
for the set of allowed values.ids (Sequence) – Sequence of ID values (strings or integers, matching type of attribute).
strict (bool) – Raise an exception if a matching genome cannot be found for any ID value.
- Returns:
List of genomes of same length as
ids
. Ifstrict=False
and a genome cannot be found for a given ID the list will containNone
at the corresponding position.- Return type:
List[Optional[AnnotatedGenome]]
- Raises:
KeyError – If
strict=True
and any ID value cannot be found.
- gambit.db.refdb.genomes_by_id_subset(genomeset, id_attr, ids)
Match a
ReferenceGenomeSet
’s genomes to a set of ID values, allowing missing genomes.This calls
genomes_by_id()
withstrict=False
and filters anyNone
values from the output. The filtered list is returned along with the indices of all values inids
which were not filtered out. The indices can be used to load only those signatures which have a matched genome from a signature file.Note that it is not checked that every genome in
genomeset
is matched by an ID. Check the size of the returned lists for this.- Parameters:
genomeset (ReferenceGenomeSet) –
id_attr (Union[str, InstrumentedAttribute]) – ID attribute of
gambit.db.models.Genome
to use for lookup. Can be used as the attribute itself (e.g.Genome.refseq_acc
) or just the name ('refsec_acc'
). SeeGENOME_IDS
for the set of allowed values.ids (Sequence) – Sequence of ID values (strings or integers, matching type of attribute).
- Return type:
Tuple[List[AnnotatedGenome], List[int]]
- gambit.db.refdb.load_genomeset(db_file)
Get the only
gambit.db.models.ReferenceGenomeSet
from a genomes database file.- Parameters:
db_file (Union[str, PathLike]) –
- Return type:
Tuple[Session, ReferenceGenomeSet]
gambit.db.models
SQLAlchemy models for storing reference genomes and taxonomy information.
- class gambit.db.models.AnnotatedGenome
Bases:
Base
A genome with additional annotations as part of a genome set.
This object serves to attach a genome to a
ReferenceGenomeSet
, and to assign a taxonomy classification to that genome. Hybrid attributes mirroring the attributes of the connected genome effectively make this behave as an extendedGenome
object.- genome_id
Integer column, part of composite primary key. ID of
Genome
the annotations are for.- Type:
int
- genome_set_id
Integer column, part of composite primary key. ID of the
ReferenceGenomeSet
the annotations are under.- Type:
int
- organism
String column. Single string describing the organism. May be “Genus species [strain]” but could contain more specific information. Intended to be human-readable and shouldn’t have any semantic meaning for the application (in contrast to the
taxa
relationship).- Type:
str
- genome_set
Many-to-one relationship to
ReferenceGenomeSet
.- Type:
.ReferenceGenomeSet
- taxon
Many-to-one relationship to
Taxon
. The primary taxon this genome is classified as under the associatedReferenceGenomeSet
. Should be the most specific and “regular” (ideally defined on NCBI) taxon this genome belongs to.- Type:
.Taxon
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- class gambit.db.models.Genome
Bases:
Base
Base model for a reference genome that can be compared to query.
Corresponds to a single assembly (one or more contigs, but at least partially assembled) from what should be a single sequencing run. The same organism or strain may have several genome entries for it. Typically this will correspond directly to a record in Genbank (assembly database).
The data on this model should primarily pertain to the sample and sequencing run itself. It would be updated if for example a better assembly was produced from the original raw data, however more advanced interpretation such as taxonomy assignments belong on an attached
AnnotatedGenome
object.- id
Integer column (primary key).
- Type:
int
- key
String column (unique). Unique “external id” used to reference the genome from outside the SQL database, e.g. from a file containing K-mer signatures.
- Type:
str
- description
String column (optional). Short one-line description. Recommended to be unique but this is not enforced.
- Type:
Optional[str]
- ncbi_db
String column (optional). If the genome corresponds to a record downloaded from an NCBI database this column should be the database name (e.g.
'assembly'
) andncbi_id
should be the entry’s UID. Unique along withncbi_id
.- Type:
Optional[str]
- ncbi_id
Integer column (optional). See previous.
- Type:
Optional[int]
- genbank_acc
String column (optional, unique). GenBank accession number for this genome, if any.
- Type:
Optional[str]
- refseq_acc
String column (optional, unique). RefSeq accession number for this genome, if any.
- Type:
Optional[str]
- extra
JSON column (optional). Additional arbitrary metadata.
- Type:
Optional[dict]
- annotations
One-to-many relationship to
AnnotatedGenome
.- Type:
Collection[.AnnotatedGenome]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- ID_ATTRS = ('key', 'genbank_acc', 'refseq_acc', 'ncbi_id')
Attributes which serve as unique IDs.
- class gambit.db.models.ReferenceGenomeSet
Bases:
Base
A collection of reference genomes along with additional annotations and data. A full GAMBIT database which can be used for queries consists of a genome set plus a set of k-mer signatures for those genomes (stored separately).
Membership of
Genome`s in the set is determined by the presence of an associated :class:
.AnnotatedGenomes` object, which also holds additional annotation data for the genome. The genome set also includes a set of associatedTaxon
entries, which form a taxonomy tree under which all its genomes are categorized.This schema technically allows for multiple genome sets within the same database (which can share
Genome
s but with different annotations), but the GAMBIT application generally expects that genome sets are stored in their own SQLite files.- id
Integer primary key.
- Type:
int
- key
String column. An “external id” used to uniquely identify this genome set. Unique along with
version
.- Type:
str
- version
Optional version string, an updated version of a previous genome set should have the same key with a later version number. Should be in the format defined by PEP 440.
- Type:
str
- name
String column. Unique name.
- Type:
str
- description
Text column. Optional description.
- Type:
Optional[str]
- extra
JSON column. Additional arbitrary data.
- Type:
Optional[dict]
- genomes
Many-to-many relationship with
AnnotatedGenome
, annotated versions of genomes in this set.- Type:
Collection[.AnnotatedGenome]
- base_genomes
Unannotated
Genome
s in this set. Association proxy to thegenome
relationship of members ofgenome
.- Type:
Collection[.Genome]
- taxa
One-to-many relationship to
Taxon
. The taxa that form the classification system for this genome set.- Type:
Collection[.Taxon]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- root_taxa()
Query for root taxa belonging to the set.
- Return type:
sqlalchemy.orm.query.Query
- class gambit.db.models.Taxon
Bases:
Base
A taxon used for classifying genomes.
Taxa are specific to a
ReferenceGenomeSet
and form a tree/forest structure through theparent
andchildren
relationships.- id
Integer column (primary key).
- Type:
int
- key
String column (unique). An “external id” used to uniquely identify this taxon.
- Type:
str
- name
String column. Human-readable name for the taxon, typically the standard scientific name.
- Type:
str
- rank
String column (optional). Taxonomic rank, if any. Species, genus, family, etc.
- Type:
Optional[str]
- description
String column (optional). Optional description of taxon.
- Type:
Optional[str]
- distance_threshold
Float column (optional). Query genomes within this distance of one of the taxon’s reference genomes will be classified as that taxon. If NULL the taxon is just used establish the tree structure and is not used directly in classification.
- Type:
Optional[float]
- report
Boolean column. Whether to report this taxon directly as a match when producing a human-readable query result. Some custom taxa might need to be “hidden” from the user, in which case the value should be false. The application should then ascend the taxon’s lineage and choose the first ancestor where this field is true. Defaults to true.
- Type:
Bool
- extra
JSON column (optional). Additional arbitrary data.
- Type:
Optional[dict]
- genome_set_id
Integer column. ID of
ReferenceGenomeSet
the taxon belongs to.- Type:
int
- parent_id
Integer column. ID of Taxon that is the direct parent of this one.
- Type:
Optional[int]
- ncbi_id
Integer column (optional). ID of the entry in the NCBI taxonomy database this taxon corresponds to, if any.
- Type:
Optional[int]
- parent
Many-to-one relationship with
Taxon
, the parent of this taxon (if any).- Type:
Optional[.Taxon]
- genome_set
Many-to-one relationship to
ReferenceGenomeSet
.- Type:
.ReferenceGenomeSet
- genomes
One-to-many relationship with
AnnotatedGenome
, genomes which are assigned to this taxon.- Type:
Collection[.AnnotatedGenome]
- __init__(**kwargs)
A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
- ancestor_of_rank(rank)
Get the taxon’s ancestor with the given rank, if it exists.
- Parameters:
rank (str) –
- Return type:
Optional[Taxon]
- ancestors(incself=False)
Iterate through the taxon’s ancestors from bottom to top.
- Parameters:
incself (bool) – If True start with self, otherwise start with parent.
- Return type:
Iterable[Taxon]
- classmethod common_ancestors(taxa)
Get list of common ancestors of a set of taxa.
- depth()
The number of ancestors the taxon has.
- Return type:
int
- descendants(postorder=False)
Iterate through taxa all of the taxon’s descendants.
This is the same as
traverse()
except the taxon itself is not included.- Parameters:
postorder (bool) – Iterate in postorder (parents after children) instead of the default preorder (parents before children).
- Return type:
Iterable[Taxon]
- has_genome(genome)
Check whether the given genome is assigned to this taxon or any of its descendants.
- Parameters:
genome (AnnotatedGenome) –
- Return type:
bool
- isleaf()
Check if the taxon is a leaf (has no children).
- Return type:
bool
- isroot()
Check if the taxon is a root (has no parent).
- Return type:
bool
- classmethod lca(taxa)
Find the Least Common Ancestor of a set of taxa.
Returns None if taxa is empty or its members do not all lie in the same tree.
- leaves()
Iterate through all leaves in the taxon’s subtree.
For leaf taxa this will just yield the taxon itself.
- Return type:
Iterable[Taxon]
- lineage(ranks=None)
Get a last of this taxon’s ancestors.
With an argument, gets ancestors with the given ranks. Without, gets a sorted list of the taxon’s ancestors from top to bottom (including itself)
- Parameters:
ranks (Optional[Iterable[str]]) –
- Return type:
List[Optional[Taxon]]
- print_tree(f=None, *, indent=' ', sort_key=None)
Print the taxon’s subtree for debugging.
- Parameters:
f (Optional[Callable[[Taxon], str]]) – A function which takes a taxon and returns the string representation to print for it. Defaults to
short_repr()
.indent (str) – String used to indent each level of descendants.
sort_key (Optional[Callable[[Taxon], Any]]) – A function which takes a taxon and returns a sort key, to determine what order a taxon’s children are printed in. Defaults to the taxon’s name.
- root()
Get the root taxon of this taxon’s tree.
The set of taxa in a
ReferenceGenomeSet
will generally form a forest instead of a single tree, so there can be multiple root taxa.Returns self if the taxon has no parent.
- Return type:
- short_repr()
Get a short string representation of the Taxon for logging and warning/error messages.
- subtree_genomes()
Iterate through all genomes assigned to this taxon or its descendants.
- Return type:
Iterable[AnnotatedGenome]
- gambit.db.models.only_genomeset(session)
Get the only
ReferenceGenomeSet
in a database.The format which is used to distribute GAMBIT databases and is expected by the CLI is an sqlite file containing a single genome set.
- Parameters:
session (Session) – ORM session connected to database.
- Raises:
RuntimeError – If the database does not contain a single genome set.
- Return type:
gambit.db.sqla
Custom types and other utilities for SQLAlchemy.
- class gambit.db.sqla.JsonString
Bases:
TypeDecorator
SQLA column type for JSON data which is stored in the database as a standard string column.
Data is automatically serialized/unserialized when saved/loaded. Important: mutation tracking is not enabled for this type. If the value is a list or dict and you modify it in place these changes will not be detected. Instead, re-assign the attribute.
- class gambit.db.sqla.ReadOnlySession
Bases:
Session
Session class that doesn’t allow flushing/committing.
gambit.db.migrate
Perform genome database migrations with Alembic.
This package contains all Alembic configuration and data files. Revision files are located in
./alembic/versions
.
Note on alembic configuration - seems like normal usage of Alembic involves getting the database URL
from alembic.ini
. Since this application has no fixed location for the database we can’t use
this method. Instead we are following the
Sharing a Connection with a Series of Migration Commands and Environments
recipe in Alembic’s documentation, where the connectable object is generated programmatically
somehow and then attached to the Alembic configuration object’s attributes
dict. The
run_migrations_offline
and run_migrations_online
functions in alembic/env.py
are
modified from the version generated by alembic init
to get their connectable object from this
dict instead of creating it based on the contents of alembic.ini
. Note that this means we
can’t do (online) migration stuff from the standard alembic CLI command, which gets its
connection information only from alembic.ini
.
The way to use this setup is instead to create an alembic.config.Config
instance with
get_alembic_config()
and use the functions in alembic.command
.
- gambit.db.migrate.current_head()
Get the current head revision number.
- Return type:
str
- gambit.db.migrate.current_revision(connectable)
Get the current revision number of a genome database.
- Parameters:
connectable (Connectable) –
- Return type:
str
- gambit.db.migrate.get_alembic_config(connectable=None, **kwargs)
Get an alembic config object to perform migrations.
- Parameters:
connectable (Optional[Connectable]) – SQLAlchemy connectable specifying database connection info (optional). Assigned to
'connectable'
key ofalembic.config.Config.attributes
.**kwargs – Keyword arguments to pass to
alembic.config.Config.__init__()
.
- Return type:
Alembic config object.
- gambit.db.migrate.init_db(connectable)
Initialize the genome database schema by creating all tables and stamping with the latest Alembic revision.
Expects a fresh database that does not already contain any tables for the
gambit.db.models
models and has not had any migrations run on it yet.- Parameters:
connectable (Connectable) – SQLAlchemy connectable specifying database connection info.
- Raises:
RuntimeError – If the database is already stamped with an Alembic revision.
sqlalchemy.exc.OperationalError – If any of the database tables to be created already exist.
- gambit.db.migrate.is_current_revision(connectable)
Check if the current revision of a genome database is the most recent (head) revision.
- Parameters:
connectable (Connectable) –
- gambit.db.migrate.upgrade(connectable, revision='head', tag=None, **kwargs)
Run the alembic upgrade command.
See
alembic.command.upgrade()
for more information on how this works.- Parameters:
connectable (Connectable) – SQLAlchemy connectable specifying genome database connection info.
revision (str) – Revision to upgrade to. Passed to
alembic.command.upgrade()
.tag – Passed to
alembic.command.upgrade()
.**kwargs – Passed to
get_alembic_config()
.