Database

gambit.db

gambit.db.refdb

class gambit.db.refdb.ReferenceDatabase

Bases: object

Object containing reference genomes, their k-mer signatures, and associated data.

This is all that is needed at runtime to run queries.

genomeset

Genome set containing reference genomes.

Type:

gambit.db.models.ReferenceGenomeSet

genomes

List of reference genomes.

Type:

Sequence[gambit.db.models.AnnotatedGenome]

signatures

K-mer signatures for each genome. A subtype of ReferenceSignatures, so contains metadata on signatures as well as the signatures themselves. Type may represent signatures stored on disk (e.g. HDF5Signatures) instead of in memory. OK to contain additional signatures not corresponding to any genome in genomes.

Type:

gambit.sigs.base.ReferenceSignatures

sig_indices

Index of signature in signatures corresponding to each genome in genomes. In sorted order to improve performance when iterating over them (improve locality if in memory and avoid seeking if in file).

Type:

Sequence[int]

session

The SQLAlchemy session genomeset and the elements of genomes belong to. It is important to keep a reference to this, just having references to the ORM objects themselves is not enough to keep the session from being garbage collected.

Parameters:
__init__(genomeset, signatures)
Parameters:
classmethod load(genomes_file, signatures_file)

Load complete database given paths to SQLite genomes database file and HDF5 signatures file.

Parameters:
  • genomes_file (Union[str, PathLike]) –

  • signatures_file (Union[str, PathLike]) –

Return type:

ReferenceDatabase

classmethod load_from_dir(path)

Load complete database given directory containing SQLite genomes database file and HDF5 signatures file.

See locate_db_files() for how these files are located within the directory.

Raises:

RuntimeError – If files cannot be located in directory.

Parameters:

path (Union[str, PathLike]) –

Return type:

ReferenceDatabase

classmethod locate_files(path)

Locate an SQLite genome database file and HDF5 signatures file in a directory.

Files are located by extension, .gdb or .db for SQLite file and .gs or .h5 for signatures file. Does not look in subdirectories.

Parameters:

path (Union[str, PathLike]) – Path to directory to look within.

Return type:

Paths to genomes database file and signatures file.

Raises:

FileNotFoundError – If files could not be located or if multiple files with the same extension exist in the directory.

gambit.db.refdb.genomes_by_id(genomeset, id_attr, ids, strict=True)

Match a ReferenceGenomeSet’s genomes to a set of ID values.

This is primarily used to match genomes to signatures based on the ID values stored in a signature file. It is expected that the signature file may contain signatures for more genomes than are present in the genome set, see also genomes_by_id_subset() for that condition.

Parameters:
  • genomeset (ReferenceGenomeSet) –

  • id_attr (Union[str, InstrumentedAttribute]) – ID attribute of gambit.db.models.Genome to use for lookup. Can be used as the attribute itself (e.g. Genome.refseq_acc) or just the name ('refsec_acc'). See GENOME_IDS for the set of allowed values.

  • ids (Sequence) – Sequence of ID values (strings or integers, matching type of attribute).

  • strict (bool) – Raise an exception if a matching genome cannot be found for any ID value.

Returns:

List of genomes of same length as ids. If strict=False and a genome cannot be found for a given ID the list will contain None at the corresponding position.

Return type:

List[Optional[AnnotatedGenome]]

Raises:

KeyError – If strict=True and any ID value cannot be found.

gambit.db.refdb.genomes_by_id_subset(genomeset, id_attr, ids)

Match a ReferenceGenomeSet’s genomes to a set of ID values, allowing missing genomes.

This calls genomes_by_id() with strict=False and filters any None values from the output. The filtered list is returned along with the indices of all values in ids which were not filtered out. The indices can be used to load only those signatures which have a matched genome from a signature file.

Note that it is not checked that every genome in genomeset is matched by an ID. Check the size of the returned lists for this.

Parameters:
  • genomeset (ReferenceGenomeSet) –

  • id_attr (Union[str, InstrumentedAttribute]) – ID attribute of gambit.db.models.Genome to use for lookup. Can be used as the attribute itself (e.g. Genome.refseq_acc) or just the name ('refsec_acc'). See GENOME_IDS for the set of allowed values.

  • ids (Sequence) – Sequence of ID values (strings or integers, matching type of attribute).

Return type:

Tuple[List[AnnotatedGenome], List[int]]

gambit.db.refdb.load_genomeset(db_file)

Get the only gambit.db.models.ReferenceGenomeSet from a genomes database file.

Parameters:

db_file (Union[str, PathLike]) –

Return type:

Tuple[Session, ReferenceGenomeSet]

gambit.db.models

SQLAlchemy models for storing reference genomes and taxonomy information.

class gambit.db.models.AnnotatedGenome

Bases: Base

A genome with additional annotations as part of a genome set.

This object serves to attach a genome to a ReferenceGenomeSet, and to assign a taxonomy classification to that genome. Hybrid attributes mirroring the attributes of the connected genome effectively make this behave as an extended Genome object.

genome_id

Integer column, part of composite primary key. ID of Genome the annotations are for.

Type:

int

genome_set_id

Integer column, part of composite primary key. ID of the ReferenceGenomeSet the annotations are under.

Type:

int

organism

String column. Single string describing the organism. May be “Genus species [strain]” but could contain more specific information. Intended to be human-readable and shouldn’t have any semantic meaning for the application (in contrast to the taxa relationship).

Type:

str

taxon_id

Integer column. ID of the Taxon this genome is classified as.

Type:

int

genome

Many-to-one relationship to Genome.

Type:

.Genome

genome_set

Many-to-one relationship to ReferenceGenomeSet.

Type:

.ReferenceGenomeSet

taxon

Many-to-one relationship to Taxon. The primary taxon this genome is classified as under the associated ReferenceGenomeSet. Should be the most specific and “regular” (ideally defined on NCBI) taxon this genome belongs to.

Type:

.Taxon

key

Hybrid property connected to attribute on genome.

Type:

str

description

Hybrid property connected to attribute on genome.

Type:

Optional[str]

ncbi_db

Hybrid property connected to attribute on genome.

Type:

Optional[str]

ncbi_id

Hybrid property connected to attribute on genome.

Type:

Optional[int]

genbank_acc

Hybrid property connected to attribute on genome.

Type:

Optional[str]

refseq_acc

Hybrid property connected to attribute on genome.

Type:

Optional[str]

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class gambit.db.models.Genome

Bases: Base

Base model for a reference genome that can be compared to query.

Corresponds to a single assembly (one or more contigs, but at least partially assembled) from what should be a single sequencing run. The same organism or strain may have several genome entries for it. Typically this will correspond directly to a record in Genbank (assembly database).

The data on this model should primarily pertain to the sample and sequencing run itself. It would be updated if for example a better assembly was produced from the original raw data, however more advanced interpretation such as taxonomy assignments belong on an attached AnnotatedGenome object.

id

Integer column (primary key).

Type:

int

key

String column (unique). Unique “external id” used to reference the genome from outside the SQL database, e.g. from a file containing K-mer signatures.

Type:

str

description

String column (optional). Short one-line description. Recommended to be unique but this is not enforced.

Type:

Optional[str]

ncbi_db

String column (optional). If the genome corresponds to a record downloaded from an NCBI database this column should be the database name (e.g. 'assembly') and ncbi_id should be the entry’s UID. Unique along with ncbi_id.

Type:

Optional[str]

ncbi_id

Integer column (optional). See previous.

Type:

Optional[int]

genbank_acc

String column (optional, unique). GenBank accession number for this genome, if any.

Type:

Optional[str]

refseq_acc

String column (optional, unique). RefSeq accession number for this genome, if any.

Type:

Optional[str]

extra

JSON column (optional). Additional arbitrary metadata.

Type:

Optional[dict]

annotations

One-to-many relationship to AnnotatedGenome.

Type:

Collection[.AnnotatedGenome]

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

ID_ATTRS = ('key', 'genbank_acc', 'refseq_acc', 'ncbi_id')

Attributes which serve as unique IDs.

class gambit.db.models.ReferenceGenomeSet

Bases: Base

A collection of reference genomes along with additional annotations and data. A full GAMBIT database which can be used for queries consists of a genome set plus a set of k-mer signatures for those genomes (stored separately).

Membership of Genome`s in the set is determined by the presence of an associated :class:.AnnotatedGenomes` object, which also holds additional annotation data for the genome. The genome set also includes a set of associated Taxon entries, which form a taxonomy tree under which all its genomes are categorized.

This schema technically allows for multiple genome sets within the same database (which can share Genomes but with different annotations), but the GAMBIT application generally expects that genome sets are stored in their own SQLite files.

id

Integer primary key.

Type:

int

key

String column. An “external id” used to uniquely identify this genome set. Unique along with version.

Type:

str

version

Optional version string, an updated version of a previous genome set should have the same key with a later version number. Should be in the format defined by PEP 440.

Type:

str

name

String column. Unique name.

Type:

str

description

Text column. Optional description.

Type:

Optional[str]

extra

JSON column. Additional arbitrary data.

Type:

Optional[dict]

genomes

Many-to-many relationship with AnnotatedGenome, annotated versions of genomes in this set.

Type:

Collection[.AnnotatedGenome]

base_genomes

Unannotated Genomes in this set. Association proxy to the genome relationship of members of genome.

Type:

Collection[.Genome]

taxa

One-to-many relationship to Taxon. The taxa that form the classification system for this genome set.

Type:

Collection[.Taxon]

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

root_taxa()

Query for root taxa belonging to the set.

Return type:

sqlalchemy.orm.query.Query

class gambit.db.models.Taxon

Bases: Base

A taxon used for classifying genomes.

Taxa are specific to a ReferenceGenomeSet and form a tree/forest structure through the parent and children relationships.

id

Integer column (primary key).

Type:

int

key

String column (unique). An “external id” used to uniquely identify this taxon.

Type:

str

name

String column. Human-readable name for the taxon, typically the standard scientific name.

Type:

str

rank

String column (optional). Taxonomic rank, if any. Species, genus, family, etc.

Type:

Optional[str]

description

String column (optional). Optional description of taxon.

Type:

Optional[str]

distance_threshold

Float column (optional). Query genomes within this distance of one of the taxon’s reference genomes will be classified as that taxon. If NULL the taxon is just used establish the tree structure and is not used directly in classification.

Type:

Optional[float]

report

Boolean column. Whether to report this taxon directly as a match when producing a human-readable query result. Some custom taxa might need to be “hidden” from the user, in which case the value should be false. The application should then ascend the taxon’s lineage and choose the first ancestor where this field is true. Defaults to true.

Type:

Bool

extra

JSON column (optional). Additional arbitrary data.

Type:

Optional[dict]

genome_set_id

Integer column. ID of ReferenceGenomeSet the taxon belongs to.

Type:

int

parent_id

Integer column. ID of Taxon that is the direct parent of this one.

Type:

Optional[int]

ncbi_id

Integer column (optional). ID of the entry in the NCBI taxonomy database this taxon corresponds to, if any.

Type:

Optional[int]

parent

Many-to-one relationship with Taxon, the parent of this taxon (if any).

Type:

Optional[.Taxon]

children

One-to-many relationship with Taxon, the children of this taxon.

Type:

Collection[.Taxon]

genome_set

Many-to-one relationship to ReferenceGenomeSet.

Type:

.ReferenceGenomeSet

genomes

One-to-many relationship with AnnotatedGenome, genomes which are assigned to this taxon.

Type:

Collection[.AnnotatedGenome]

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

ancestor_of_rank(rank)

Get the taxon’s ancestor with the given rank, if it exists.

Parameters:

rank (str) –

Return type:

Optional[Taxon]

ancestors(incself=False)

Iterate through the taxon’s ancestors from bottom to top.

Parameters:

incself (bool) – If True start with self, otherwise start with parent.

Return type:

Iterable[Taxon]

classmethod common_ancestors(taxa)

Get list of common ancestors of a set of taxa.

Returns:

Common ancestors from top to bottom (same order as lineage(). Will be empty if

Return type:

List[.Taxon]

Parameters:

taxa (Iterable[Taxon]) –

depth()

The number of ancestors the taxon has.

Return type:

int

descendants(postorder=False)

Iterate through taxa all of the taxon’s descendants.

This is the same as traverse() except the taxon itself is not included.

Parameters:

postorder (bool) – Iterate in postorder (parents after children) instead of the default preorder (parents before children).

Return type:

Iterable[Taxon]

has_genome(genome)

Check whether the given genome is assigned to this taxon or any of its descendants.

Parameters:

genome (AnnotatedGenome) –

Return type:

bool

isleaf()

Check if the taxon is a leaf (has no children).

Return type:

bool

isroot()

Check if the taxon is a root (has no parent).

Return type:

bool

classmethod lca(taxa)

Find the Least Common Ancestor of a set of taxa.

Returns None if taxa is empty or its members do not all lie in the same tree.

Parameters:

taxa (Iterable[Taxon]) –

Return type:

List[Taxon]

leaves()

Iterate through all leaves in the taxon’s subtree.

For leaf taxa this will just yield the taxon itself.

Return type:

Iterable[Taxon]

lineage(ranks=None)

Get a last of this taxon’s ancestors.

With an argument, gets ancestors with the given ranks. Without, gets a sorted list of the taxon’s ancestors from top to bottom (including itself)

Parameters:

ranks (Optional[Iterable[str]]) –

Return type:

List[Optional[Taxon]]

print_tree(f=None, *, indent='  ', sort_key=None)

Print the taxon’s subtree for debugging.

Parameters:
  • f (Optional[Callable[[Taxon], str]]) – A function which takes a taxon and returns the string representation to print for it. Defaults to short_repr().

  • indent (str) – String used to indent each level of descendants.

  • sort_key (Optional[Callable[[Taxon], Any]]) – A function which takes a taxon and returns a sort key, to determine what order a taxon’s children are printed in. Defaults to the taxon’s name.

root()

Get the root taxon of this taxon’s tree.

The set of taxa in a ReferenceGenomeSet will generally form a forest instead of a single tree, so there can be multiple root taxa.

Returns self if the taxon has no parent.

Return type:

Taxon

short_repr()

Get a short string representation of the Taxon for logging and warning/error messages.

subtree_genomes()

Iterate through all genomes assigned to this taxon or its descendants.

Return type:

Iterable[AnnotatedGenome]

traverse(postorder=False)

Iterate through all nodes in this taxon’s subtree.

Parameters:

postorder (bool) – Iterate in postorder (parents after children) instead of the default preorder (parents before children).

Return type:

Iterable[Taxon]

gambit.db.models.only_genomeset(session)

Get the only ReferenceGenomeSet in a database.

The format which is used to distribute GAMBIT databases and is expected by the CLI is an sqlite file containing a single genome set.

Parameters:

session (Session) – ORM session connected to database.

Raises:

RuntimeError – If the database does not contain a single genome set.

Return type:

ReferenceGenomeSet

gambit.db.models.reportable_taxon(taxon)

Find the first reportable taxon in a linage.

Parameters:

taxon (Optional[Taxon]) – Taxon to start looking from. None values are passed through.

Returns:

Most specific taxon in ancestry with report=True, or None if none found.

Return type:

Optional[Taxon]

gambit.db.sqla

Custom types and other utilities for SQLAlchemy.

class gambit.db.sqla.JsonString

Bases: TypeDecorator

SQLA column type for JSON data which is stored in the database as a standard string column.

Data is automatically serialized/unserialized when saved/loaded. Important: mutation tracking is not enabled for this type. If the value is a list or dict and you modify it in place these changes will not be detected. Instead, re-assign the attribute.

class gambit.db.sqla.ReadOnlySession

Bases: Session

Session class that doesn’t allow flushing/committing.

gambit.db.migrate

Perform genome database migrations with Alembic.

This package contains all Alembic configuration and data files. Revision files are located in ./alembic/versions.

Note on alembic configuration - seems like normal usage of Alembic involves getting the database URL from alembic.ini. Since this application has no fixed location for the database we can’t use this method. Instead we are following the Sharing a Connection with a Series of Migration Commands and Environments recipe in Alembic’s documentation, where the connectable object is generated programmatically somehow and then attached to the Alembic configuration object’s attributes dict. The run_migrations_offline and run_migrations_online functions in alembic/env.py are modified from the version generated by alembic init to get their connectable object from this dict instead of creating it based on the contents of alembic.ini. Note that this means we can’t do (online) migration stuff from the standard alembic CLI command, which gets its connection information only from alembic.ini.

The way to use this setup is instead to create an alembic.config.Config instance with get_alembic_config() and use the functions in alembic.command.

gambit.db.migrate.current_head()

Get the current head revision number.

Return type:

str

gambit.db.migrate.current_revision(connectable)

Get the current revision number of a genome database.

Parameters:

connectable (Connectable) –

Return type:

str

gambit.db.migrate.get_alembic_config(connectable=None, **kwargs)

Get an alembic config object to perform migrations.

Parameters:
  • connectable (Optional[Connectable]) – SQLAlchemy connectable specifying database connection info (optional). Assigned to 'connectable' key of alembic.config.Config.attributes.

  • **kwargs – Keyword arguments to pass to alembic.config.Config.__init__().

Return type:

Alembic config object.

gambit.db.migrate.init_db(connectable)

Initialize the genome database schema by creating all tables and stamping with the latest Alembic revision.

Expects a fresh database that does not already contain any tables for the gambit.db.models models and has not had any migrations run on it yet.

Parameters:

connectable (Connectable) – SQLAlchemy connectable specifying database connection info.

Raises:
  • RuntimeError – If the database is already stamped with an Alembic revision.

  • sqlalchemy.exc.OperationalError – If any of the database tables to be created already exist.

gambit.db.migrate.is_current_revision(connectable)

Check if the current revision of a genome database is the most recent (head) revision.

Parameters:

connectable (Connectable) –

gambit.db.migrate.upgrade(connectable, revision='head', tag=None, **kwargs)

Run the alembic upgrade command.

See alembic.command.upgrade() for more information on how this works.

Parameters:
  • connectable (Connectable) – SQLAlchemy connectable specifying genome database connection info.

  • revision (str) – Revision to upgrade to. Passed to alembic.command.upgrade().

  • tag – Passed to alembic.command.upgrade().

  • **kwargs – Passed to get_alembic_config().