Distance Metric

gambit.metric

Calculate the Jaccard index/distance between sets.

gambit.metric.jaccard(coords1, coords2)

Compute the Jaccard index between two k-mer sets in sparse coordinate format.

Arguments are Numpy arrays containing k-mer indices in sorted order. Data types must be 16, 32, or 64-bit signed or unsigned integers, but do not need to match.

This is by far the most efficient way to calculate the metric (this is a native function) and should be used wherever possible.

Parameters:
  • coords1 (numpy.ndarray) – K-mer set in sparse coordinate format.

  • coords2 (numpy.ndarray) – K-mer set in sparse coordinate format.

Returns:

Jaccard index between the two sets, a real number between 0 and 1.

Return type:

numpy.float32

See also

jaccarddist

gambit.metric.jaccarddist(coords1, coords2)

Compute the Jaccard distance between two k-mer sets in sparse coordinate format.

The Jaccard distance is equal to one minus the Jaccard index.

Arguments are Numpy arrays containing k-mer indices in sorted order. Data types must be 16, 32, or 64-bit signed or unsigned integers, but do not need to match.

This is by far the most efficient way to calculate the metric (this is a native function) and should be used wherever possible.

Parameters:
  • coords1 (numpy.ndarray) – K-mer set in sparse coordinate format.

  • coords2 (numpy.ndarray) – K-mer set in sparse coordinate format.

Returns:

Jaccard distance between the two sets, a real number between 0 and 1.

Return type:

numpy.float32

See also

jaccard

gambit.metric.jaccard_bits(bits1, bits2)

Calculate the Jaccard index between two sets represented as bit arrays (“dense” format for k-mer sets).

See also

jaccard

Parameters:
  • bits1 (ndarray) –

  • bits2 (ndarray) –

Return type:

float

gambit.metric.jaccard_generic(set1, set2)

Get the Jaccard index of of two arbitrary sets.

This is primarily used as a slow, pure-Python alternative to jaccard() to be used for testing, but can also be used as a generic way to calculate the Jaccard index which works with any collection or element type.

See also

jaccard, jaccard_bits

Parameters:
  • set1 (Iterable) –

  • set2 (Iterable) –

Return type:

float

gambit.metric.jaccarddist_array(query, refs, out=None)

Calculate Jaccard distances between a query k-mer signature and a list of reference signatures.

For enhanced performance refs should be an instance of gambit.sigs.base.SignatureArray. This allows use of optimized Cython code that runs in parallel over all signatures in refs. In that case, because of Cython limitations refs.bounds.dtype must be np.intp, which is usually a 64-bit signed integer. If it is not it will be converted automatically.

Parameters:
  • query (KmerSignature) – Query k-mer signature in sparse coordinate format (sorted array of k-mer indices).

  • refs (Sequence[KmerSignature]) – List of reference signatures.

  • out (Optional[ndarray]) – Optional pre-allocated array to write results to. Should be the same length as refs with dtype np.float32.

Returns:

Jaccard distance for query against each element of refs.

Return type:

numpy.ndarray

gambit.metric.jaccarddist_matrix(queries, refs, ref_indices=None, out=None, chunksize=None, progress=None)

Calculate a Jaccard distance matrix between a list of query signatures and a list of reference signatures.

This function improves querying performance when the reference signatures are stored in a file (e.g. using gambit.sigs.hdf5.HDF5Signatures) by loading them in chunks (via the chunksize parameter) instead of all in one go.

Performance is greatly improved if refs is a type that yields instances of SignatureArray when indexed with a slice object (SignatureArray or HDF5Signatures), see jaccarddist_array(). There is no such dependence on the type of queries, which can be a simple list.

Parameters:
  • queries (Sequence[KmerSignature]) – Query signatures in sparse coordinate format.

  • refs (Sequence[KmerSignature]) – Reference signatures in sparse coordinate format.

  • ref_indices (Optional[Sequence[int]]) – Optional, indices of refs to use.

  • out (Optional[ndarray]) – (Optional) pre-allocated array to write output to.

  • chunksize (Optional[int]) – Divide refs into chunks of this size.

  • progress – Display a progress meter of the number of elements of the output array calculated so far. See gambit.util.progress.get_progress() for a description of allowed values.

Returns:

Matrix of distances between query signatures in rows and reference signatures in columns.

Return type:

np.ndarray

gambit.metric.jaccarddist_pairwise(sigs, indices=None, flat=False, out=None, progress=None)

Calculate all pairwise Jaccard distances for a list of signatures.

This should be roughly twice as fast as calling jaccarddist_flat() with the same array for the first and second arguments, because each pairwise distance is computed once instead of twice.

For optimal performance the type of sigs is subject to the same requirements as jaccarddist_array() and jaccarddist_matrix().

Parameters:
  • sigs (Sequence[KmerSignature]) – List of signatures in sparse coordinate format.

  • indices (Optional[Sequence[int]]) – Optional, indices of sigs to use.

  • flat (bool) – If True the output is a non-redundant flat (1D) array with exactly one element per pair of signatures. This format can be converted to/from the equivalent full distance matrix with scipy.spatial.distance.squareform().

  • out (Optional[ndarray]) – (Optional) pre-allocated array to write output to.

  • progress – Display a progress meter of the number of elements of the output array calculated so far. See gambit.util.progress.get_progress() for a description of allowed values.

Returns:

Pairwise distances in matrix (if flat=False) or condensed (flat=True) format.

Return type:

np.ndarray

gambit.metric.num_pairs(n)

Get the number of distinct (unordered) pairs of n objects.

Parameters:

n (int) –

Return type:

int