Distance Metric
gambit.metric
Calculate the Jaccard index/distance between sets.
- gambit.metric.jaccard(coords1, coords2)
Compute the Jaccard index between two k-mer sets in sparse coordinate format.
Arguments are Numpy arrays containing k-mer indices in sorted order. Data types must be 16, 32, or 64-bit signed or unsigned integers, but do not need to match.
This is by far the most efficient way to calculate the metric (this is a native function) and should be used wherever possible.
- Parameters:
- Returns:
Jaccard index between the two sets, a real number between 0 and 1.
- Return type:
numpy.float32
See also
- gambit.metric.jaccard_bits(bits1, bits2)
Calculate the Jaccard index between two sets represented as bit arrays (“dense” format for k-mer sets).
See also
- gambit.metric.jaccard_generic(set1, set2)
Get the Jaccard index of of two arbitrary sets.
This is primarily used as a slow, pure-Python alternative to
jaccard()to be used for testing, but can also be used as a generic way to calculate the Jaccard index which works with any collection or element type.See also
- gambit.metric.jaccarddist(coords1, coords2)
Compute the Jaccard distance between two k-mer sets in sparse coordinate format.
The Jaccard distance is equal to one minus the Jaccard index.
Arguments are Numpy arrays containing k-mer indices in sorted order. Data types must be 16, 32, or 64-bit signed or unsigned integers, but do not need to match.
This is by far the most efficient way to calculate the metric (this is a native function) and should be used wherever possible.
- Parameters:
- Returns:
Jaccard distance between the two sets, a real number between 0 and 1.
- Return type:
numpy.float32
See also
- gambit.metric.jaccarddist_array(query, refs, out=None)
Calculate Jaccard distances between a query k-mer signature and a list of reference signatures.
For enhanced performance
refsshould be an instance ofgambit.sigs.base.SignatureArray. This allows use of optimized Cython code that runs in parallel over all signatures inrefs. In that case, because of Cython limitationsrefs.bounds.dtypemust benp.intp, which is usually a 64-bit signed integer. If it is not it will be converted automatically.- Parameters:
query (KmerSignature) – Query k-mer signature in sparse coordinate format (sorted array of k-mer indices).
refs (Sequence[KmerSignature]) – List of reference signatures.
out (ndarray | None) – Optional pre-allocated array to write results to. Should be the same length as
refswith dtypenp.float32.
- Returns:
Jaccard distance for
queryagainst each element ofrefs.- Return type:
See also
- gambit.metric.jaccarddist_matrix(queries, refs, ref_indices=None, out=None, chunksize=None, progress=None)
Calculate a Jaccard distance matrix between a list of query signatures and a list of reference signatures.
This function improves querying performance when the reference signatures are stored in a file (e.g. using
gambit.sigs.hdf5.HDF5Signatures) by loading them in chunks (via thechunksizeparameter) instead of all in one go.Performance is greatly improved if
refsis a type that yields instances ofSignatureArraywhen indexed with a slice object (SignatureArrayorHDF5Signatures), seejaccarddist_array(). There is no such dependence on the type ofqueries, which can be a simple list.- Parameters:
queries (Sequence[KmerSignature]) – Query signatures in sparse coordinate format.
refs (Sequence[KmerSignature]) – Reference signatures in sparse coordinate format.
ref_indices (Sequence[int] | None) – Optional, indices of
refsto use.out (ndarray | None) – (Optional) pre-allocated array to write output to.
chunksize (int | None) – Divide
refsinto chunks of this size.progress – Display a progress meter of the number of elements of the output array calculated so far. See
gambit.util.progress.get_progress()for a description of allowed values.
- Returns:
Matrix of distances between query signatures in rows and reference signatures in columns.
- Return type:
See also
- gambit.metric.jaccarddist_pairwise(sigs, indices=None, flat=False, out=None, progress=None)
Calculate all pairwise Jaccard distances for a list of signatures.
This should be roughly twice as fast as calling
jaccarddist_matrix()with the same array for the first and second arguments, because each pairwise distance is computed once instead of twice.For optimal performance the type of
sigsis subject to the same requirements asjaccarddist_array()andjaccarddist_matrix().- Parameters:
sigs (Sequence[KmerSignature]) – List of signatures in sparse coordinate format.
indices (Sequence[int] | None) – Optional, indices of
sigsto use.flat (bool) – If True the output is a non-redundant flat (1D) array with exactly one element per pair of signatures. This format can be converted to/from the equivalent full distance matrix with
scipy.spatial.distance.squareform().out (ndarray | None) – (Optional) pre-allocated array to write output to.
progress – Display a progress meter of the number of elements of the output array calculated so far. See
gambit.util.progress.get_progress()for a description of allowed values.
- Returns:
Pairwise distances in matrix (if
flat=False) or condensed (flat=True) format.- Return type:
See also