K-mer signatures
gambit.seq
Generic code for working with sequence data.
Note that all code in this package operates on DNA sequences as sequences of bytes containing ascii-encoded nucleotide codes.
- gambit.seq.NUCLEOTIDES
bytescorresponding to the four DNA nucleotides. Ascii-encoded upper case lettersACGT. Note that the order, while arbitrary, is important in this variable as it defines how unique indices are assigned to k-mer sequences.
- class gambit.seq.DNASeq
Type alias for DNA sequence types accepted for k-mer search / signature calculation (
str,bytes,bytearray, orBio.Seq.Seq).
- gambit.seq.parse_seqs(path, format='fasta', compression='auto', **kwargs)
Open a sequence file and lazily parse its contents.
This is essentially a wrapper over BioPython’s
Bio.SeqIO.parse()function that transparently handles compressed files.Returns iterator over sequence data in file. File is parsed lazily, and so must be kept open. The returned iterator is of type
gambit.util.io.ClosingIteratorso it will close the file stream automatically when it finishes. It may also be used as a context manager that closes the stream on exit. You may also close the stream explicitly using the iterator’sclosemethod.- Parameters:
format (str) – String describing the file format as interpreted by
Bio.SeqIO.parse().compression (str) – String describing compression method of the file, e.g.
'gzip'.nonemeans no compression. Default is to determine compression automatically (can only detect gzip or none). Seegambit.util.io.open_compressed().kwargs – Keyword arguments to
gambit.util.io.open_compressed().
- Returns:
Iterator yielding
Bio.SeqIO.SeqRecordinstances for each sequence in the file.- Return type:
- gambit.seq.seq_to_bytes(seq)
Convert generic DNA sequence to byte string representation.
This is for passing sequence data to Cython functions.
- gambit.seq.validate_dna_seq_bytes(seq)
Check that a sequence contains only valid nucleotide codes (upper case).
- Parameters:
seq (bytes) – ASCII-encoded nucleotide sequence.
- Raises:
ValueError – If the sequence contains an invalid nucleotide.
gambit.kmers
Core functions for searching for and working with k-mers.
- class gambit.kmers.KmerMatch
Bases:
objectRepresents the location of a k-mer prefix found within a DNA sequence.
- kmerspec
K-mer spec used for search.
- Type:
- __init__(kmerspec, seq, pos, reverse)
Method generated by attrs for class KmerMatch.
- kmer_index()
Get index of matched k-mer.
- Raises:
ValueError – If the k-mer contains invalid nucleotides.
- Return type:
- class gambit.kmers.KmerSpec
Bases:
JsonableSpecifications for a k-mer search operation.
- prefix
Constant prefix of k-mers to search for, upper-case nucleotide codes as ascii-encoded
bytes.- Type:
- idx_len
Maximum value (plus one) of integer needed to index one of the found k-mers. Also the number of possible k-mers fitting the spec. Equal to
4 ** k.
- index_dtype
Smallest unsigned integer dtype that can store k-mer indices.
- Type:
- gambit.kmers.find_kmers(kmerspec, seq)
Locate k-mers with the given prefix in a DNA sequence.
Searches sequence both backwards and forwards (reverse complement). The sequence may contain invalid characters (not one of the four nucleotide codes) which will simply not be matched.
- gambit.kmers.index_dtype(k)
Get the smallest unsigned integer dtype that can store k-mer indices for the given
k.
- gambit.kmers.kmer_to_index(kmer)
Convert a k-mer to its integer index.
- Raises:
ValueError – If an invalid nucleotide code is encountered.
- Parameters:
kmer (DNASeq)
- Return type:
- gambit.kmers.kmer_to_index_rc(kmer)
Get the integer index of a k-mer’s reverse complement.
- Raises:
ValueError – If an invalid nucleotide code is encountered.
- Parameters:
kmer (DNASeq)
- Return type:
- gambit.kmers.nkmers(k)
Get the number of possible distinct k-mers for a given value of
k.
- gambit.kmers.DEFAULT_KMERSPEC = KmerSpec(11, 'ATGAC')
Default settings for k-mer search
gambit.sigs
Calculate and store collections of k-mer signatures.
gambit.sigs.base
- class gambit.sigs.base.AbstractSignatureArray
Bases:
Sequence[KmerSignature]Abstract base class for types which behave as a (non-mutable) sequence of k-mer signatures (k-mer sets in sparse coordinate format).
The signature data itself may already be present in memory or may be loaded lazily from the file system when the object is indexed.
Elements should be Numpy arrays with integer data type. Should implement numpy-style advanced indexing, see
gambit.util.indexing.AdvancedIndexingMixin. Slicing and advanced indexing should return another instance ofAbstractSignatureArray.- kmerspec
K-mer spec used to calculate signatures.
- Type:
gambit.kmers.KmerSpec | None
- dtype
Numpy data type of signatures.
- Type:
- __eq__(other)
Compare two
AbstractSignatureArrayinstances for equality.Two instances are considered equal if they are equivalent as sequences (see
sigarray_eq()) and have the samekmerspec.
- sizeof(index)
Get the size/length of the signature at the given index.
Should be the case that
sigarray.size_of(i) == len(sigarray[i])
- exception gambit.sigs.base.SignaturesFileError
Bases:
ExceptionIndicates an error attempting to open a signatures file.
- class gambit.sigs.base.AnnotatedSignatures
Bases:
ReferenceSignaturesWrapper around a signature array which adds
idandmetaattributes.- __init__(signatures, ids=None, meta=None)
- Parameters:
signatures (AbstractSignatureArray) – Signature array to wrap.
ids (Sequence | None) – Unique IDs for signatures. Defaults to consecutive integers starting from zero.
meta (SignaturesMeta | None) – Additional metadata describing signatures.
- class gambit.sigs.base.ConcatenatedSignatureArray
Bases:
AdvancedIndexingMixin,AbstractSignatureArrayBase class for signature arrays which store signatures in a single data array.
- values
K-mer signatures concatenated into single numpy-like array.
- bounds
Numpy-like array storing indices bounding each individual k-mer signature in
values. Theith signature is atvalues[bounds[i]:bounds[i + 1]].
- sizeof(index)
Get the size/length of the signature at the given index.
Should be the case that
sigarray.size_of(i) == len(sigarray[i])
- Parameters:
index – Index of signature in array.
- sizes()
Get the sizes of all signatures in the array.
- class gambit.sigs.base.KmerSignature
Type for k-mer signatures (k-mer sets in sparse coordinate format)
alias of
ndarray
- class gambit.sigs.base.ReferenceSignatures
Bases:
AbstractSignatureArrayBase class for an array of reference genome signatures plus metadata.
This contains the extra data needed for the signatures to be used for running queries.
- ids
Array of unique string or integer IDs for each signature. Length should be equal to length of
ReferenceSignaturesobject.- Type:
Sequence
- meta
Other metadata describing signatures.
- class gambit.sigs.base.SignatureArray
Bases:
ConcatenatedSignatureArrayStores a collection of k-mer signatures in a single contiguous Numpy array.
This format enables the calculation of many Jaccard scores in parallel, see
gambit.metric.jaccarddist_array().Numpy-style indexing with an array of integers or bools is supported and will return another
SignatureArray. If indexed with a contiguous slice thevaluesof the returned array will be a view of the original instead of a copy.- values
K-mer signatures concatenated into single Numpy array.
- Type:
- bounds
Array storing indices bounding each individual k-mer signature in
values. Theith signature is atvalues[bounds[i]:bounds[i + 1]].- Type:
- __init__(signatures, kmerspec=None, dtype=None)
- Parameters:
signatures (Sequence[KmerSignature]) – Sequence of k-mer signatures.
kmerspec (KmerSpec | None) – K-mer spec used to calculate signatures. If None will take from
signaturesif it is anAbstractSignatureArrayinstance.dtype (dtype | None) – Numpy dtype of
valuesarray. If None will use dtype of first element ofsignatures.
- classmethod from_arrays(values, bounds, kmerspec)
Create directly from values and bounds arrays.
- Parameters:
- Return type:
- classmethod uninitialized(lengths, kmerspec, dtype=None)
Create with an uninitialized values array.
- Parameters:
- Return type:
- class gambit.sigs.base.SignatureList
Bases:
AdvancedIndexingMixin,AbstractSignatureArray,MutableSequence[KmerSignature]Stores a collection of k-mer signatures in a standard Python list.
Compared to
SignatureArraythis isn’t as efficient to calculate Jaccard scores with, but supports mutation and won’t have to copy signatures to a new array on creation.- __init__(signatures, kmerspec=None, dtype=None)
- Parameters:
signatures (Iterable[KmerSignature]) – Iterable of k-mer signatures.
kmerspec (KmerSpec | None) – K-mer spec used to calculate signatures. If None will take from
signaturesif it is anAbstractSignatureArrayinstance.dtype (dtype | None) – Numpy dtype of signatures. If None will use dtype of first element of
signatures.
- insert(i, sig)
S.insert(index, value) – insert value before index
- Parameters:
i (int)
sig (KmerSignature)
- class gambit.sigs.base.SignaturesMeta
Bases:
objectMetadata describing a set of k-mer signatures.
All attributes are optional.
- id_attr
Name of
Genomeattribute the IDs correspond to (seeID_ATTRS). Optional, but signature set cannot be used as a reference for queries without it.- Type:
str | None
- extra
Extra arbitrary metadata. Should be a
dictor other mapping which can be converted to JSON.- Type:
Mapping[str, Any]
- __init__(*, id=None, name=None, version=None, id_attr=None, description=None, extra=NOTHING)
Method generated by attrs for class SignaturesMeta.
- gambit.sigs.base.dump_signatures(path, signatures, format='hdf5', **kw)
Write k-mer signatures and associated metadata to a file.
- Parameters:
path (FilePath) – File to write to.
signatures (AbstractSignatureArray) – Array of signatures to store.
format (str) – Format to use. Currently the only valid value is ‘hdf5’.
**kw – Additional keyword arguments depending on format.
- gambit.sigs.base.load_signatures(path, **kw)
Load signatures from file.
Currently the only format used to store signatures is the one in
gambit.sigs.hdf5, but there may be more in the future. The format should be determined automatically.- Parameters:
path (FilePath) – File to open.
**kw – Additional keyword arguments to
h5py.File().
- Return type:
- gambit.sigs.base.sigarray_eq(a1, a2)
Check two sequences of sparse k-mer signatures for equality.
Unlike
AbstractSignatureArray.__eq__()this works on any sequence type containing signatures and does not use theAbstractSignatureArray.kmerspecattribute.- Parameters:
a1 (Sequence[KmerSignature])
a2 (Sequence[KmerSignature])
- Return type:
gambit.sigs.calc
Calculate k-mer signatures from sequence data.
- class gambit.sigs.calc.ArrayAccumulator
Bases:
KmerAccumulatorK-mer accumulator implemented as a dense boolean array.
This is pretty efficient for smaller values of
k, but time and space requirements increase exponentially with larger values.- clear()
This is slow (creates N new iterators!) but effective.
- signature()
Get signature for accumulated k-mers.
- Return type:
- class gambit.sigs.calc.KmerAccumulator
Bases:
MutableSet[int]Base class for data structures which track k-mers as they are found in sequences.
Implements the
MutableSetinterface for k-mer indices. Indices are added via theaddoradd_kmer()methods, when finished a sparse k-mer signature can be obtained fromsignature().- add_kmer(kmer)
Add a k-mer by its sequence rather than its index.
Argument may contain invalid (non-nucleotide) bytes, in which case it is ignored.
- Parameters:
kmer (bytes)
- abstract signature()
Get signature for accumulated k-mers.
- Return type:
- class gambit.sigs.calc.SetAccumulator
Bases:
KmerAccumulatorAccumulator which uses the builtin Python
setclass.This has more overhead than the array version for smaller values of
kbut behaves much better asymptotically.- clear()
This is slow (creates N new iterators!) but effective.
- signature()
Get signature for accumulated k-mers.
- Return type:
- gambit.sigs.calc.accumulate_kmers(accumulator, kmerspec, seq)
Find k-mer matches in sequence and add their indices to an accumulator.
- Parameters:
accumulator (KmerAccumulator)
kmerspec (KmerSpec)
seq (DNASeq)
- gambit.sigs.calc.calc_file_signature(kspec, seqfile, *, accumulator=None)
Open a sequence file on disk and calculate its k-mer signature.
- Parameters:
kspec (KmerSpec) – Spec for k-mer search.
accumulator (KmerAccumulator | None) – TODO
- Returns:
K-mer signature in sparse coordinate format (dtype will match
dense_to_sparse()).- Return type:
See also
- gambit.sigs.calc.calc_file_signatures(kspec, files, progress=None, concurrency='processes', max_workers=None, executor=None)
Parse and calculate k-mer signatures for multiple sequence files.
- Parameters:
kspec (KmerSpec) – Spec for k-mer search.
seqfile – Files to read.
progress – Display a progress meter. See
gambit.util.progress.get_progress()for allowed values.concurrency (str | None) – Process files concurrently.
"processes"for process-based (default),"threads"for threads-based,Nonefor no concurrency.max_workers (int | None) – Number of worker threads/processes to use if
concurrencyis not None.executor (Executor | None) – Instance of class:concurrent.futures.Executor to use for concurrency. Overrides the
concurrencyandmax_workersarguments.
- Return type:
See also
- gambit.sigs.calc.calc_signature(kmerspec, seqs, *, accumulator=None)
Calculate the k-mer signature of a DNA sequence or set of sequences.
Searches sequences both backwards and forwards (reverse complement). Sequences may contain invalid characters (not one of the four nucleotide codes) which will simply not be matched.
- Parameters:
kmerspec (KmerSpec) – K-mer spec to use for search.
seqs (DNASeq | Iterable[DNASeq]) – Sequence or sequences to search within. Lowercase characters are OK.
accumulator (KmerAccumulator | None) – TODO
- Returns:
K-mer signature in sparse coordinate format. Data type will be
kspec.index_dtype.- Return type:
See also
- gambit.sigs.calc.default_accumulator(k)
Get a default k-mer accumulator instance for the given value of
k.Returns a
ArrayAccumulatorfork <= 11and aSetAccumulatorfork > 11.- Parameters:
k (int)
- Return type:
- gambit.sigs.calc.dense_to_sparse(vec)
Convert k-mer set from dense bit vector to sparse coordinate representation.
- Parameters:
vec (Sequence[bool]) – Boolean vector indicating which k-mers are present.
- Returns:
Sorted array of coordinates of k-mers present in vector. Data type will be
numpy.intp.- Return type:
See also
- gambit.sigs.calc.sparse_to_dense(k_or_kspec, coords)
Convert k-mer set from sparse coordinate representation back to dense bit vector.
- Parameters:
k_or_kspec (int | KmerSpec) – Value of k or a
KmerSpecinstance.coords (KmerSignature) – Sparse coordinate array.
- Returns:
Dense k-mer bit vector.
- Return type:
See also
gambit.sigs.hdf5
Store k-mer signature sets in HDF5 format.
- class gambit.sigs.hdf5.HDF5Signatures
Bases:
ConcatenatedSignatureArray,ReferenceSignaturesStores a set of k-mer signatures and associated metadata in an HDF5 group.
Inherits from
gambit.sigs.base.AbstractSignatureArray, so behaves as a sequence of k-mer signatures supporting Numpy-style advanced indexing.Behaves as a context manager which yields itself on enter and closes the underlying HDF5 file object on exit. The
__bool__()method can be used to check whether the file is currently open and valid.- group
HDF5 group object data is read from.
- Type:
h5py._hl.group.Group
- Parameters:
group (h5py._hl.group.Group) – Open, readable
h5py.Grouporh5py.Fileobject.
- __bool__()
Check whether the underlying HDF5 file object is open.
- __init__(group)
- Parameters:
group (Group)
- close()
Close the underlying HDF5 file.
- classmethod create(group, signatures, *, compression=None, compression_opts=None)
Store k-mer signatures and associated metadata in an HDF5 group.
- Parameters:
group (Group) – HDF5 group to store data in.
signatures (AbstractSignatureArray) – Array of signatures to store. If an instance of
gambit.sigs.base.ReferenceSignaturesits metadata will be stored as well, otherwise default/empty values will be used.compression (str | None) – Compression type for values array. One of
['gzip', 'lzf', 'szip']. See the section on compression filters inh5py’s documentation.compression_opts – Sets compression level (0-9) for gzip compression, no effect for other types.
- Return type:
- gambit.sigs.hdf5.dump_signatures_hdf5(path, signatures, **kw)
Write k-mer signatures and associated metadata to an HDF5 file.
- Parameters:
path (FilePath) – File to write to.
signatures (AbstractSignatureArray) – Array of signatures to store.
**kw – Additional keyword arguments to
HDF5Signatures.create().
- gambit.sigs.hdf5.empty_to_none(value)
Convert
h5py.Emptyinstances to None, passing other types through.
- gambit.sigs.hdf5.load_signatures_hdf5(path, **kw)
Open HDF5 signature file.
- Parameters:
path (FilePath) – File to open.
**kw – Additional keyword arguments to
h5py.File().
- Return type:
- gambit.sigs.hdf5.none_to_empty(value, dtype)
Convert None values to
h5py.Empty, passing other types through.- Parameters:
dtype (dtype)
- gambit.sigs.hdf5.read_metadata(group)
Read signature set metadata from HDF5 group attributes.
- Parameters:
group (Group)
- Return type:
- gambit.sigs.hdf5.write_metadata(group, meta)
Write signature set metadata to HDF5 group attributes.
- Parameters:
group (Group)
meta (SignaturesMeta)
- gambit.sigs.hdf5.CURRENT_FMT_VERSION = 1
Current version of the data format. Integer which should be incremented each time the format changes.
- gambit.sigs.hdf5.FMT_VERSION_ATTR = 'gambit_signatures_version'
Name of HDF5 group attribute which both stores the format version and also identifies the group as containing signature data.