Distance Metric
gambit.metric
Calculate the Jaccard index/distance between sets.
- gambit.metric.jaccard(coords1, coords2)
Compute the Jaccard index between two k-mer sets in sparse coordinate format.
Arguments are Numpy arrays containing k-mer indices in sorted order. Data types must be 16, 32, or 64-bit signed or unsigned integers, but do not need to match.
This is by far the most efficient way to calculate the metric (this is a native function) and should be used wherever possible.
- Parameters:
coords1 (numpy.ndarray) – K-mer set in sparse coordinate format.
coords2 (numpy.ndarray) – K-mer set in sparse coordinate format.
- Returns:
Jaccard index between the two sets, a real number between 0 and 1.
- Return type:
numpy.float32
See also
- gambit.metric.jaccarddist(coords1, coords2)
Compute the Jaccard distance between two k-mer sets in sparse coordinate format.
The Jaccard distance is equal to one minus the Jaccard index.
Arguments are Numpy arrays containing k-mer indices in sorted order. Data types must be 16, 32, or 64-bit signed or unsigned integers, but do not need to match.
This is by far the most efficient way to calculate the metric (this is a native function) and should be used wherever possible.
- Parameters:
coords1 (numpy.ndarray) – K-mer set in sparse coordinate format.
coords2 (numpy.ndarray) – K-mer set in sparse coordinate format.
- Returns:
Jaccard distance between the two sets, a real number between 0 and 1.
- Return type:
numpy.float32
See also
- gambit.metric.jaccard_bits(bits1, bits2)
Calculate the Jaccard index between two sets represented as bit arrays (“dense” format for k-mer sets).
See also
- Parameters:
bits1 (ndarray) –
bits2 (ndarray) –
- Return type:
float
- gambit.metric.jaccard_generic(set1, set2)
Get the Jaccard index of of two arbitrary sets.
This is primarily used as a slow, pure-Python alternative to
jaccard()
to be used for testing, but can also be used as a generic way to calculate the Jaccard index which works with any collection or element type.See also
- Parameters:
set1 (Iterable) –
set2 (Iterable) –
- Return type:
float
- gambit.metric.jaccarddist_array(query, refs, out=None)
Calculate Jaccard distances between a query k-mer signature and a list of reference signatures.
For enhanced performance
refs
should be an instance ofgambit.sigs.base.SignatureArray
. This allows use of optimized Cython code that runs in parallel over all signatures inrefs
. In that case, because of Cython limitationsrefs.bounds.dtype
must benp.intp
, which is usually a 64-bit signed integer. If it is not it will be converted automatically.- Parameters:
query (KmerSignature) – Query k-mer signature in sparse coordinate format (sorted array of k-mer indices).
refs (Sequence[KmerSignature]) – List of reference signatures.
out (ndarray) – Optional pre-allocated array to write results to. Should be the same length as
refs
with dtypenp.float32
.
- Returns:
Jaccard distance for
query
against each element ofrefs
.- Return type:
numpy.ndarray
See also
- gambit.metric.jaccarddist_matrix(queries, refs, ref_indices=None, out=None, chunksize=None, progress=None)
Calculate a Jaccard distance matrix between a list of query signatures and a list of reference signatures.
This function improves querying performance when the reference signatures are stored in a file (e.g. using
gambit.sigs.hdf5.HDF5Signatures
) by loading them in chunks (via thechunksize
parameter) instead of all in one go.Performance is greatly improved if
refs
is a type that yields instances ofSignatureArray
when indexed with a slice object (SignatureArray
orHDF5Signatures
), seejaccarddist_array()
. There is no such dependence on the type ofqueries
, which can be a simple list.- Parameters:
queries (Sequence[KmerSignature]) – Query signatures in sparse coordinate format.
refs (Sequence[KmerSignature]) – Reference signatures in sparse coordinate format.
ref_indices (Sequence[int] | None) – Optional, indices of
refs
to use.out (ndarray | None) – (Optional) pre-allocated array to write output to.
chunksize (int | None) – Divide
refs
into chunks of this size.progress – Display a progress meter of the number of elements of the output array calculated so far. See
gambit.util.progress.get_progress()
for a description of allowed values.
- Returns:
Matrix of distances between query signatures in rows and reference signatures in columns.
- Return type:
np.ndarray
See also
- gambit.metric.jaccarddist_pairwise(sigs, indices=None, flat=False, out=None, progress=None)
Calculate all pairwise Jaccard distances for a list of signatures.
This should be roughly twice as fast as calling
jaccarddist_flat()
with the same array for the first and second arguments, because each pairwise distance is computed once instead of twice.For optimal performance the type of
sigs
is subject to the same requirements asjaccarddist_array()
andjaccarddist_matrix()
.- Parameters:
sigs (Sequence[KmerSignature]) – List of signatures in sparse coordinate format.
indices (Sequence[int] | None) – Optional, indices of
sigs
to use.flat (bool) – If True the output is a non-redundant flat (1D) array with exactly one element per pair of signatures. This format can be converted to/from the equivalent full distance matrix with
scipy.spatial.distance.squareform()
.out (ndarray | None) – (Optional) pre-allocated array to write output to.
progress – Display a progress meter of the number of elements of the output array calculated so far. See
gambit.util.progress.get_progress()
for a description of allowed values.
- Returns:
Pairwise distances in matrix (if
flat=False
) or condensed (flat=True
) format.- Return type:
np.ndarray
See also
- gambit.metric.num_pairs(n)
Get the number of distinct (unordered) pairs of
n
objects.- Parameters:
n (int) –
- Return type:
int