Database

gambit.db

gambit.db.refdb

class gambit.db.refdb.ReferenceDatabase

Bases: object

Object containing reference genomes, their k-mer signatures, and associated data.

This is all that is needed at runtime to run queries.

genomeset

Genome set containing reference genomes.

Type:: gambit.db.models.ReferenceGenomeSet

genomes

List of reference genomes.

Type:: Sequence[gambit.db.models.AnnotatedGenome]

signatures

K-mer signatures for each genome. A subtype of ReferenceSignatures, so contains metadata on signatures as well as the signatures themselves. Type may represent signatures stored on disk (e.g. HDF5Signatures) instead of in memory. OK to contain additional signatures not corresponding to any genome in genomes.

Type:: gambit.sigs.base.ReferenceSignatures

sig_indices

Index of signature in signatures corresponding to each genome in genomes. In sorted order to improve performance when iterating over them (improve locality if in memory and avoid seeking if in file).

Type:: Sequence[int]

session: The SQLAlchemy session genomeset and the elements of genomes belong to. It is important to keep a reference to this, just having references to the ORM objects themselves is not enough to keep the session from being garbage collected.

Parameters:

genomeset (gambit.db.models.ReferenceGenomeSet) –
signatures (gambit.sigs.base.ReferenceSignatures) –

__init__(genomeset, signatures)

Parameters:

genomeset (ReferenceGenomeSet) –
signatures (ReferenceSignatures) –

classmethod load(genomes_file, signatures_file)

Load complete database given paths to SQLite genomes database file and HDF5 signatures file.

Parameters:

genomes_file (Union[str, PathLike]) –
signatures_file (Union[str, PathLike]) –

Return type:

ReferenceDatabase

classmethod load_from_dir(path)

Load complete database given directory containing SQLite genomes database file and HDF5 signatures file.

See locate_db_files() for how these files are located within the directory.

Raises:: RuntimeError – If files cannot be located in directory.
Parameters:: path (Union[str, PathLike]) –
Return type:: ReferenceDatabase

classmethod locate_files(path)

Locate an SQLite genome database file and HDF5 signatures file in a directory.

Files are located by extension, .gdb or .db for SQLite file and .gs or .h5 for signatures file. Does not look in subdirectories.

Parameters:: path (Union[str, PathLike]) – Path to directory to look within.
Return type:: Paths to genomes database file and signatures file.
Raises:: FileNotFoundError – If files could not be located or if multiple files with the same extension exist in the directory.

gambit.db.refdb.genomes_by_id(genomeset, id_attr, ids, strict=True)

Match a ReferenceGenomeSet’s genomes to a set of ID values.

This is primarily used to match genomes to signatures based on the ID values stored in a signature file. It is expected that the signature file may contain signatures for more genomes than are present in the genome set, see also genomes_by_id_subset() for that condition.

Parameters:

genomeset (ReferenceGenomeSet) –
id_attr (Union[str, InstrumentedAttribute]) – ID attribute of gambit.db.models.Genome to use for lookup. Can be used as the attribute itself (e.g. Genome.refseq_acc) or just the name ('refsec_acc'). See GENOME_IDS for the set of allowed values.
ids (Sequence) – Sequence of ID values (strings or integers, matching type of attribute).
strict (bool) – Raise an exception if a matching genome cannot be found for any ID value.

Returns:

List of genomes of same length as ids. If strict=False and a genome cannot be found for a given ID the list will contain None at the corresponding position.

Return type:

List[Optional[AnnotatedGenome]]

Raises:

KeyError – If strict=True and any ID value cannot be found.

gambit.db.refdb.genomes_by_id_subset(genomeset, id_attr, ids)

Match a ReferenceGenomeSet’s genomes to a set of ID values, allowing missing genomes.

This calls genomes_by_id() with strict=False and filters any None values from the output. The filtered list is returned along with the indices of all values in ids which were not filtered out. The indices can be used to load only those signatures which have a matched genome from a signature file.

Note that it is not checked that every genome in genomeset is matched by an ID. Check the size of the returned lists for this.

Parameters:

genomeset (ReferenceGenomeSet) –
id_attr (Union[str, InstrumentedAttribute]) – ID attribute of gambit.db.models.Genome to use for lookup. Can be used as the attribute itself (e.g. Genome.refseq_acc) or just the name ('refsec_acc'). See GENOME_IDS for the set of allowed values.
ids (Sequence) – Sequence of ID values (strings or integers, matching type of attribute).

Return type:

Tuple[List[AnnotatedGenome], List[int]]

gambit.db.refdb.load_genomeset(db_file)

Get the only gambit.db.models.ReferenceGenomeSet from a genomes database file.

Parameters:: db_file (Union[str, PathLike]) –
Return type:: Tuple[Session, ReferenceGenomeSet]

gambit.db.models

SQLAlchemy models for storing reference genomes and taxonomy information.

class gambit.db.models.AnnotatedGenome

Bases: Base

A genome with additional annotations as part of a genome set.

This object serves to attach a genome to a ReferenceGenomeSet, and to assign a taxonomy classification to that genome. Hybrid attributes mirroring the attributes of the connected genome effectively make this behave as an extended Genome object.

genome_id

Integer column, part of composite primary key. ID of Genome the annotations are for.

Type:: int

genome_set_id

Integer column, part of composite primary key. ID of the ReferenceGenomeSet the annotations are under.

Type:: int

organism

String column. Single string describing the organism. May be “Genus species [strain]” but could contain more specific information. Intended to be human-readable and shouldn’t have any semantic meaning for the application (in contrast to the taxa relationship).

Type:: str

taxon_id

Integer column. ID of the Taxon this genome is classified as.

Type:: int

genome

Many-to-one relationship to Genome.

Type:: .Genome

genome_set

Many-to-one relationship to ReferenceGenomeSet.

Type:: .ReferenceGenomeSet

taxon

Many-to-one relationship to Taxon. The primary taxon this genome is classified as under the associated ReferenceGenomeSet. Should be the most specific and “regular” (ideally defined on NCBI) taxon this genome belongs to.

Type:: .Taxon

key