K-mer signatures

gambit.seq

Generic code for working with sequence data.

Note that all code in this package operates on DNA sequences as sequences of bytes containing ascii-encoded nucleotide codes.

gambit.seq.NUCLEOTIDES

bytes corresponding to the four DNA nucleotides. Ascii-encoded upper case letters ACGT. Note that the order, while arbitrary, is important in this variable as it defines how unique indices are assigned to k-mer sequences.

gambit.seq.revcomp(seq: bytes) bytes

Get the reverse complement of a nucleotide sequence.

Parameters:

seq (bytes) – ASCII-encoded nucleotide sequence. Case does not matter.

Returns:

Reverse complement sequence. All characters in the input which are not valid nucleotide codes will appear unchanged in the corresponding reverse position.

Return type:

bytes

class gambit.seq.SequenceFile

Bases: PathLike

A reference to a DNA sequence file stored in the file system.

Contains all the information needed to read and parse the file. Implements the os.PathLike interface, so it can be substituted for a str or pathlib.Path in most function arguments that take a file path to open.

Parameters:
  • path (Union[os.PathLike, str]) – Value of path attribute. May be string or path-like object.

  • format (str) – Value of format attribute.

  • compression (Optional[str]) – Value of compression attribute.

path

Path to the file.

Type:

pathlib.Path

format

String describing the file format as interpreted by Bio.SeqIO.parse(), e.g. 'fasta'.

Type:

str

compression

String describing compression method of the file, e.g. 'gzip'. None means no compression. See gambit.util.io.open_compressed().

Type:

Optional[str]

__init__(path, format, compression=None)

Method generated by attrs for class SequenceFile.

Parameters:
  • format (str) –

  • compression (Optional[str]) –

Return type:

None

absolute()

Make a copy of the instance with an absolute path.

Return type:

SequenceFile

classmethod from_paths(paths, format, compression=None)

Create many instances at once from a collection of paths and a single format and compression type.

Parameters:
  • paths (Iterable[Union[str, PathLike]]) – Collection of paths as strings or path-like objects.

  • format (str) – Sequence file format of files.

  • compression (Optional[str]) – Compression method of files.

Return type:

List[SequenceFile]

open(mode='r', **kwargs)

Open a stream to the file, with compression/decompression applied transparently.

Parameters:
  • mode (str) – Same as equivalent argument to the built-in :func:open`. Some modes may not be supported by all compression types.

  • **kwargs – Additional text mode specific keyword arguments to pass to opener. Equivalent to the following arguments of the built-in open(): encoding, errors, and newlines. May not be supported by all compression types.

Returns:

Stream to file in given mode.

Return type:

IO

parse(**kwargs)

Open the file and lazily parse its contents.

Returns iterator over sequence data in file. File is parsed lazily, and so must be kept open. The returned iterator is of type gambit.util.io.ClosingIterator so it will close the file stream automatically when it finishes. It may also be used as a context manager that closes the stream on exit. You may also close the stream explicitly using the iterator’s close method.

Parameters:

**kwargs – Keyword arguments to open().

Returns:

Iterator yielding Bio.SeqIO.SeqRecord instances for each sequence in the file.

Return type:

gambit.util.io.ClosingIterator

gambit.seq.seq_to_bytes(seq)

Convert generic DNA sequence to byte string representation.

This is for passing sequence data to Cython functions.

Parameters:

seq (Union[str, bytes, bytearray, Seq]) –

Return type:

Union[bytes, bytearray]

gambit.seq.validate_dna_seq_bytes(seq)

Check that a sequence contains only valid nucleotide codes (upper case).

Parameters:

seq (bytes) – ASCII-encoded nucleotide sequence.

Raises:

ValueError – If the sequence contains an invalid nucleotide.

gambit.seq.DNASeq

Union of DNA sequence types accepted for k-mer search / signature calculation.

alias of Union[str, bytes, bytearray, Seq]

gambit.seq.DNASeqBytes

Sequence types accepted directly by native (Cython) code.

alias of Union[bytes, bytearray]

gambit.kmers

Core functions for searching for and working with k-mers.

gambit.kmers.index_to_kmer(index: int, kmer: int) bytes

Convert k-mer index to sequence.

class gambit.kmers.KmerMatch

Bases: object

Represents a

kmerspec

K-mer spec used for search.

Type:

gambit.kmers.KmerSpec

seq

The sequence searched within.

Type:

Union[str, bytes, bytearray, Bio.Seq.Seq]

pos

Index of first nucleotide of prefix in seq.

Type:

int

reverse

If the match is on the reverse strand.

Type:

bool

__init__(kmerspec, seq, pos, reverse)

Method generated by attrs for class KmerMatch.

Parameters:
  • kmerspec (KmerSpec) –

  • seq (Union[str, bytes, bytearray, Seq]) –

  • pos (int) –

  • reverse (bool) –

Return type:

None

full_indices()

Index range for prefix plus k-mer in sequence.

Return type:

slice

kmer()

Get matched k-mer sequence.

Return type:

bytes

kmer_index()

Get index of matched k-mer.

Raises:

ValueError – If the k-mer contains invalid nucleotides.

Return type:

int

kmer_indices()

Index range for k-mer in sequence (without prefix).

Return type:

slice

class gambit.kmers.KmerSpec

Bases: Jsonable

Specifications for a k-mer search operation.

k

Number of nucleotides in k-mer after prefix.

Type:

int

prefix

Constant prefix of k-mers to search for, upper-case nucleotide codes as ascii-encoded bytes.

Type:

bytes

prefix_str

Prefix as string.

Type:

str

prefix_len

Number of nucleotides in prefix.

Type:

int

total_len

Sum of prefix_len and k.

Type:

int

idx_len

Maximum value (plus one) of integer needed to index one of the found k-mers. Also the number of possible k-mers fitting the spec. Equal to 4 ** k.

index_dtype

Smallest unsigned integer dtype that can store k-mer indices.

Type:

numpy.dtype

__init__(k, prefix)
Parameters:
  • k (int) – Value of k attribute.

  • prefix (Union[str, bytes, bytearray, Seq]) – Value of prefix attribute. Will be converted to bytes.

gambit.kmers.find_kmers(kmerspec, seq)

Locate k-mers with the given prefix in a DNA sequence.

Searches sequence both backwards and forwards (reverse complement). The sequence may contain invalid characters (not one of the four nucleotide codes) which will simply not be matched.

Parameters:
  • kmerspec (KmerSpec) – K-mer spec to use for search.

  • seq (Union[str, bytes, bytearray, Seq]) – Sequence to search within. Lowercase characters are OK and will be matched as uppercase.

Returns:

Iterator of KmerMatch objects.

Return type:

Iterator[KmerMatch]

gambit.kmers.index_dtype(k)

Get the smallest unsigned integer dtype that can store k-mer indices for the given k.

Parameters:

k (int) –

Return type:

dtype

gambit.kmers.kmer_to_index(kmer)

Convert a k-mer to its integer index.

Raises:

ValueError – If an invalid nucleotide code is encountered.

Parameters:

kmer (Union[str, bytes, bytearray, Seq]) –

Return type:

int

gambit.kmers.kmer_to_index_rc(kmer)

Get the integer index of a k-mer’s reverse complement.

Raises:

ValueError – If an invalid nucleotide code is encountered.

Parameters:

kmer (Union[str, bytes, bytearray, Seq]) –

Return type:

int

gambit.kmers.nkmers(k)

Get the number of possible distinct k-mers for a given value of k.

Parameters:

k (int) –

Return type:

int

gambit.kmers.DEFAULT_KMERSPEC = KmerSpec(11, 'ATGAC')

Default settings for k-mer search

gambit.sigs

Calculate and store collections of k-mer signatures.

gambit.sigs.base

class gambit.sigs.base.AbstractSignatureArray

Bases: Sequence[KmerSignature]

Abstract base class for types which behave as a (non-mutable) sequence of k-mer signatures (k-mer sets in sparse coordinate format).

The signature data itself may already be present in memory or may be loaded lazily from the file system when the object is indexed.

Elements should be Numpy arrays with integer data type. Should implement numpy-style advanced indexing, see gambit.util.indexing.AdvancedIndexingMixin. Slicing and advanced indexing should return another instance of AbstractSignatureArray.

kmerspec

K-mer spec used to calculate signatures.

Type:

Optional[gambit.kmers.KmerSpec]

dtype

Numpy data type of signatures.

Type:

numpy.dtype

__eq__(other)

Compare two AbstractSignatureArray instances for equality.

Two instances are considered equal if they are equivalent as sequences (see sigarray_eq()) and have the same kmerspec.

sizeof(index)

Get the size/length of the signature at the given index.

Should be the case that

sigarray.size_of(i) == len(sigarray[i])

Parameters:

index (int) – Index of signature in array.

Return type:

int

sizes()

Get the sizes of all signatures in the array.

Return type:

Sequence[int]

class gambit.sigs.base.AbstractSignatureArray

Bases: Sequence[KmerSignature]

Abstract base class for types which behave as a (non-mutable) sequence of k-mer signatures (k-mer sets in sparse coordinate format).

The signature data itself may already be present in memory or may be loaded lazily from the file system when the object is indexed.

Elements should be Numpy arrays with integer data type. Should implement numpy-style advanced indexing, see gambit.util.indexing.AdvancedIndexingMixin. Slicing and advanced indexing should return another instance of AbstractSignatureArray.

kmerspec

K-mer spec used to calculate signatures.

Type:

Optional[gambit.kmers.KmerSpec]

dtype

Numpy data type of signatures.

Type:

numpy.dtype

sizeof(index)

Get the size/length of the signature at the given index.

Should be the case that

sigarray.size_of(i) == len(sigarray[i])

Parameters:

index (int) – Index of signature in array.

Return type:

int

sizes()

Get the sizes of all signatures in the array.

Return type:

Sequence[int]

class gambit.sigs.base.AnnotatedSignatures

Bases: ReferenceSignatures

Wrapper around a signature array which adds id and meta attributes.

__init__(signatures, ids=None, meta=None)
Parameters:
  • signatures (AbstractSignatureArray) – Signature array to wrap.

  • ids (Optional[Sequence]) – Unique IDs for signatures. Defaults to consecutive integers starting from zero.

  • meta (Optional[SignaturesMeta]) – Additional metadata describing signatures.

class gambit.sigs.base.ConcatenatedSignatureArray

Bases: AdvancedIndexingMixin, AbstractSignatureArray

Base class for signature arrays which store signatures in a single data array.

values

K-mer signatures concatenated into single numpy-like array.

bounds

Numpy-like array storing indices bounding each individual k-mer signature in values. The ith signature is at values[bounds[i]:bounds[i + 1]].

sizeof(index)

Get the size/length of the signature at the given index.

Should be the case that

sigarray.size_of(i) == len(sigarray[i])

Parameters:

index – Index of signature in array.

sizes()

Get the sizes of all signatures in the array.

class gambit.sigs.base.ReferenceSignatures

Bases: AbstractSignatureArray

Base class for an array of reference genome signatures plus metadata.

This contains the extra data needed for the signatures to be used for running queries.

ids

Array of unique string or integer IDs for each signature. Length should be equal to length of ReferenceSignatures object.

Type:

Sequence

meta

Other metadata describing signatures.

Type:

gambit.sigs.base.SignaturesMeta

class gambit.sigs.base.SignatureArray

Bases: ConcatenatedSignatureArray

Stores a collection of k-mer signatures in a single contiguous Numpy array.

This format enables the calculation of many Jaccard scores in parallel, see gambit.metric.jaccarddist_array().

Numpy-style indexing with an array of integers or bools is supported and will return another SignatureArray. If indexed with a contiguous slice the values of the returned array will be a view of the original instead of a copy.

values

K-mer signatures concatenated into single Numpy array.

Type:

numpy.ndarray

bounds

Array storing indices bounding each individual k-mer signature in values. The ith signature is at values[bounds[i]:bounds[i + 1]].

Type:

numpy.ndarray

__init__(signatures, kmerspec=None, dtype=None)
Parameters:
  • signatures (Sequence[KmerSignature]) – Sequence of k-mer signatures.

  • kmerspec (Optional[KmerSpec]) – K-mer spec used to calculate signatures. If None will take from signatures if it is an AbstractSignatureArray instance.

  • dtype (Optional[dtype]) – Numpy dtype of values array. If None will use dtype of first element of signatures.

classmethod from_arrays(values, bounds, kmerspec)

Create directly from values and bounds arrays.

Parameters:
  • values (ndarray) –

  • bounds (ndarray) –

  • kmerspec (Optional[KmerSpec]) –

Return type:

SignatureArray

classmethod uninitialized(lengths, kmerspec, dtype=None)

Create with an uninitialized values array.

Parameters:
  • lengths (Sequence[int]) – Sequence of lengths for each sub-array/signature.

  • kmerspec (Optional[KmerSpec]) –

  • dtype (Optional[dtype]) – Numpy dtype of shared coordinates array.

Return type:

SignatureArray

class gambit.sigs.base.SignatureList

Bases: AdvancedIndexingMixin, AbstractSignatureArray, MutableSequence[KmerSignature]

Stores a collection of k-mer signatures in a standard Python list.

Compared to SignatureArray this isn’t as efficient to calculate Jaccard scores with, but supports mutation and won’t have to copy signatures to a new array on creation.

__init__(signatures, kmerspec=None, dtype=None)
Parameters:
  • signatures (Iterable[KmerSignature]) – Iterable of k-mer signatures.

  • kmerspec (Optional[KmerSpec]) – K-mer spec used to calculate signatures. If None will take from signatures if it is an AbstractSignatureArray instance.

  • dtype (Optional[dtype]) – Numpy dtype of signatures. If None will use dtype of first element of signatures.

insert(i, sig)

S.insert(index, value) – insert value before index

Parameters:
  • i (int) –

  • sig (KmerSignature) –

class gambit.sigs.base.SignaturesMeta

Bases: object

Metadata describing a set of k-mer signatures.

All attributes are optional.

id

Any kind of string ID that can be used to uniquely identify the signature set.

Type:

Optional[str]

version

Version string (ideally PEP 440-compliant).

Type:

Optional[str]

name

Short human-readable name.

Type:

Optional[str]

id_attr

Name of Genome attribute the IDs correspond to (see gambit.db.models.GENOME_ID_ATTRS). Optional, but signature set cannot be used as a reference for queries without it.

Type:

Optional[str]

description

Human-readable description.

Type:

Optional[str]

extra

Extra arbitrary metadata. Should be a dict or other mapping which can be converted to JSON.

Type:

Mapping[str, Any]

__init__(*, id=None, name=None, version=None, id_attr=None, description=None, extra=_Nothing.NOTHING)

Method generated by attrs for class SignaturesMeta.

Parameters:
  • id (Optional[str]) –

  • name (Optional[str]) –

  • version (Optional[str]) –

  • id_attr (Optional[str]) –

  • description (Optional[str]) –

  • extra (Mapping[str, Any]) –

Return type:

None

gambit.sigs.base.dump_signatures(path, signatures, format='hdf5', **kw)

Write k-mer signatures and associated metadata to a file.

Parameters:
  • path (Union[str, PathLike]) – File to write to.

  • signatures (AbstractSignatureArray) – Array of signatures to store.

  • format (str) – Format to use. Currently the only valid value is ‘hdf5’.

  • **kw – Additional keyword arguments depending on format.

gambit.sigs.base.load_signatures(path, **kw)

Load signatures from file.

Currently the only format used to store signatures is the one in gambit.sigs.hdf5, but there may be more in the future. The format should be determined automatically.

Parameters:
  • path (Union[str, PathLike]) – File to open.

  • **kw – Additional keyword arguments to h5py.File().

Return type:

AbstractSignatureArray

gambit.sigs.base.sigarray_eq(a1, a2)

Check two sequences of sparse k-mer signatures for equality.

Unlike AbstractSignatureArray.__eq__() this works on any sequence type containing signatures and does not use the AbstractSignatureArray.kmerspec attribute.

Parameters:
  • a1 (Sequence[KmerSignature]) –

  • a2 (Sequence[KmerSignature]) –

Return type:

bool

gambit.sigs.base.KmerSignature

Type for k-mer signatures (k-mer sets in sparse coordinate format)

alias of ndarray

gambit.sigs.calc

Calculate k-mer signatures from sequence data.

class gambit.sigs.calc.ArrayAccumulator

Bases: KmerAccumulator

K-mer accumulator implemented as a dense boolean array.

This is pretty efficient for smaller values of k, but time and space requirements increase exponentially with larger values.

__init__(k)
Parameters:

k (int) –

add(i)

Add an element.

Parameters:

i (int) –

clear()

This is slow (creates N new iterators!) but effective.

discard(i)

Remove an element. Do not raise an exception if absent.

Parameters:

i (int) –

signature()

Get signature for accumulated k-mers.

Return type:

KmerSignature

class gambit.sigs.calc.KmerAccumulator

Bases: MutableSet[int]

Base class for data structures which track k-mers as they are found in sequences.

Implements the MutableSet interface for k-mer indices. Indices are added via add() or add_kmer() methods, when finished a sparse k-mer signature can be obtained from signature().

add_kmer(kmer)

Add a k-mer by its sequence rather than its index.

Argument may contain invalid (non-nucleotide) bytes, in which case it is ignored.

Parameters:

kmer (bytes) –

abstract signature()

Get signature for accumulated k-mers.

Return type:

KmerSignature

class gambit.sigs.calc.SetAccumulator

Bases: KmerAccumulator

Accumulator which uses the builtin Python set class.

This has more overhead than the array version for smaller values of k but behaves much better asymptotically.

__init__(k)
Parameters:

k (int) –

add(index)

Add an element.

Parameters:

index (int) –

clear()

This is slow (creates N new iterators!) but effective.

discard(index)

Remove an element. Do not raise an exception if absent.

Parameters:

index (int) –

signature()

Get signature for accumulated k-mers.

Return type:

KmerSignature

gambit.sigs.calc.accumulate_kmers(accumulator, kmerspec, seq)

Find k-mer matches in sequence and add their indices to an accumulator.

Parameters:
gambit.sigs.calc.calc_file_signature(kspec, seqfile, *, accumulator=None)

Open a sequence file on disk and calculate its k-mer signature.

This works identically to calc_signature_parse() but takes a SequenceFile as input instead of a data stream.

Parameters:
Returns:

K-mer signature in sparse coordinate format (dtype will match gambit.kmers.dense_to_sparse()).

Return type:

numpy.ndarray

gambit.sigs.calc.calc_file_signatures(kspec, files, progress=None, concurrency='processes', max_workers=None, executor=None)

Parse and calculate k-mer signatures for multiple sequence files.

Parameters:
  • kspec (KmerSpec) – Spec for k-mer search.

  • seqfile – Files to read.

  • progress – Display a progress meter. See gambit.util.progress.get_progress() for allowed values.

  • concurrency (Optional[str]) – Process files concurrently. "processes" for process-based (default), "threads" for threads-based, None for no concurrency.

  • max_workers (Optional[int]) – Number of worker threads/processes to use if concurrency is not None.

  • executor (Optional[Executor]) – Instance of class:concurrent.futures.Executor to use for concurrency. Overrides the concurrency and max_workers arguments.

  • files (Sequence[SequenceFile]) –

Return type:

SignatureList

gambit.sigs.calc.calc_signature(kmerspec, seqs, *, accumulator=None)

Calculate the k-mer signature of a DNA sequence or set of sequences.

Searches sequences both backwards and forwards (reverse complement). Sequences may contain invalid characters (not one of the four nucleotide codes) which will simply not be matched.

Parameters:
  • kmerspec (KmerSpec) – K-mer spec to use for search.

  • seqs (Union[str, bytes, bytearray, Seq, Iterable[Union[str, bytes, bytearray, Seq]]]) – Sequence or sequences to search within. Lowercase characters are OK.

  • accumulator (Optional[KmerAccumulator]) – TODO

Returns:

K-mer signature in sparse coordinate format. Data type will be kspec.index_dtype.

Return type:

numpy.ndarray

gambit.sigs.calc.default_accumulator(k)

Get a default k-mer accumulator instance for the given value of k.

Returns a ArrayAccumulator for k <= 11 and a SetAccumulator for k > 11.

Parameters:

k (int) –

Return type:

KmerAccumulator

gambit.sigs.convert

Convert signatures between representations or from one KmerSpec to another.

gambit.sigs.convert.can_convert(from_kspec, to_kspec)

Check if signatures from one KmerSpec can be converted to another.

Conversion is possible if to_kspec.prefix is equal to or starts with from_kspec.prefix and to_kspec.total_len <= from_kspec.total_len.

Parameters:
Return type:

bool

gambit.sigs.convert.check_can_convert(from_kspec, to_kspec)

Check that signatures can be converted from one KmerSpec to another or raise an error with an informative message.

Raises:

ValueError – If conversion is not possible.

Parameters:
gambit.sigs.convert.convert_dense(from_kspec, to_kspec, vec)

Convert a k-mer signature in dense format from one KmerSpec to another.

In the ideal case, if vec is the result of calc_signature(from_kspec, seq, sparse=False) the output of this function should be identical to calc_signature(to_kspec, seq, sparse=False). In reality this may not hold if any potential matches of from_kspec in seq are discarded due to an invalid nucleotide which is not included in the corresponding to_kspec match.

Parameters:
Return type:

ndarray

gambit.sigs.convert.convert_sparse(from_kspec, to_kspec, sig)

Convert a k-mer signature in sparse format from one KmerSpec to another.

In the ideal case, if sig is the result of calc_signature(from_kspec, seq) the output of this function should be identical to calc_signature(to_kspec, seq). In reality this may not hold if any potential matches of from_kspec in seq are discarded due to an invalid nucleotide which is not included in the corresponding to_kspec match.

Parameters:
Return type:

KmerSignature

gambit.sigs.convert.dense_to_sparse(vec)

Convert k-mer set from dense bit vector to sparse coordinate representation.

Parameters:

vec (Sequence[bool]) – Boolean vector indicating which k-mers are present.

Returns:

Sorted array of coordinates of k-mers present in vector. Data type will be numpy.intp.

Return type:

numpy.ndarray

See also

sparse_to_dense

gambit.sigs.convert.sparse_to_dense(k_or_kspec, coords)

Convert k-mer set from sparse coordinate representation back to dense bit vector.

Parameters:
  • k_or_kspec (Union[int, KmerSpec]) – Value of k or a KmerSpec instance.

  • coords (KmerSignature) – Sparse coordinate array.

Returns:

Dense k-mer bit vector.

Return type:

numpy.ndarray

See also

dense_to_sparse

gambit.sigs.hdf5

Store k-mer signature sets in HDF5 format.

class gambit.sigs.hdf5.HDF5Signatures

Bases: ConcatenatedSignatureArray, ReferenceSignatures

Stores a set of k-mer signatures and associated metadata in an HDF5 group.

Inherits from gambit.sigs.base.AbstractSignatureArray, so behaves as a sequence of k-mer signatures supporting Numpy-style advanced indexing.

Behaves as a context manager which yields itself on enter and closes the underlying HDF5 file object on exit. The __bool__() method can be used to check whether the file is currently open and valid.

group

HDF5 group object data is read from.

Type:

h5py._hl.group.Group

Parameters:

group (h5py._hl.group.Group) – Open, readable h5py.Group or h5py.File object.

__bool__()

Check whether the underlying HDF5 file object is open.

__init__(group)
Parameters:

group (Group) –

close()

Close the underlying HDF5 file.

classmethod create(group, signatures, *, compression=None, compression_opts=None)

Store k-mer signatures and associated metadata in an HDF5 group.

Parameters:
  • group (Group) – HDF5 group to store data in.

  • signatures (AbstractSignatureArray) – Array of signatures to store. If an instance of gambit.sigs.base.ReferenceSignatures its metadata will be stored as well, otherwise default/empty values will be used.

  • compression (Optional[str]) – Compression type for values array. One of ['gzip', 'lzf', 'szip']. See the section on compression filters in h5py’s documentation.

  • compression_opts – Sets compression level (0-9) for gzip compression, no effect for other types.

Return type:

HDF5Signatures

class gambit.sigs.hdf5.HDF5Signatures

Bases: ConcatenatedSignatureArray, ReferenceSignatures

Stores a set of k-mer signatures and associated metadata in an HDF5 group.

Inherits from gambit.sigs.base.AbstractSignatureArray, so behaves as a sequence of k-mer signatures supporting Numpy-style advanced indexing.

Behaves as a context manager which yields itself on enter and closes the underlying HDF5 file object on exit. The __bool__() method can be used to check whether the file is currently open and valid.

group

HDF5 group object data is read from.

Type:

h5py._hl.group.Group

Parameters:

group (h5py._hl.group.Group) – Open, readable h5py.Group or h5py.File object.

__init__(group)
Parameters:

group (Group) –

close()

Close the underlying HDF5 file.

classmethod create(group, signatures, *, compression=None, compression_opts=None)

Store k-mer signatures and associated metadata in an HDF5 group.

Parameters:
  • group (Group) – HDF5 group to store data in.

  • signatures (AbstractSignatureArray) – Array of signatures to store. If an instance of gambit.sigs.base.ReferenceSignatures its metadata will be stored as well, otherwise default/empty values will be used.

  • compression (Optional[str]) –

    Compression type for values array. One of ['gzip', 'lzf', 'szip']. See the section on compression filters in h5py’s documentation.

  • compression_opts – Sets compression level (0-9) for gzip compression, no effect for other types.

Return type:

HDF5Signatures

gambit.sigs.hdf5.dump_signatures_hdf5(path, signatures, **kw)

Write k-mer signatures and associated metadata to an HDF5 file.

Parameters:
gambit.sigs.hdf5.empty_to_none(value)

Convert h5py.Empty instances to None, passing other types through.

gambit.sigs.hdf5.load_signatures_hdf5(path, **kw)

Open HDF5 signature file.

Parameters:
  • path (Union[str, PathLike]) – File to open.

  • **kw – Additional keyword arguments to h5py.File().

Return type:

HDF5Signatures

gambit.sigs.hdf5.none_to_empty(value, dtype)

Convert None values to h5py.Empty, passing other types through.

Parameters:

dtype (dtype) –

gambit.sigs.hdf5.read_metadata(group)

Read signature set metadata from HDF5 group attributes.

Parameters:

group (Group) –

Return type:

SignaturesMeta

gambit.sigs.hdf5.write_metadata(group, meta)

Write signature set metadata to HDF5 group attributes.

Parameters:
gambit.sigs.hdf5.CURRENT_FMT_VERSION = 1

Current version of the data format. Integer which should be incremented each time the format changes.

gambit.sigs.hdf5.FMT_VERSION_ATTR = 'gambit_signatures_version'

Name of HDF5 group attribute which both stores the format version and also identifies the group as containing signature data.