Affinity

class openTSNE.affinity.Affinities(verbose=False)[source]

Compute the affinities between samples.

t-SNE takes as input an affinity matrix \(P\), and does not really care about anything else from the data. This means we can use t-SNE for any data where we are able to express interactions between samples with an affinity matrix.

P

The \(N \times N\) affinity matrix expressing interactions between \(N\) initial data samples.

Type:array_like
verbose
Type:bool
to_new(data, return_distances=False)[source]

Compute the affinities of new samples to the initial samples.

This is necessary for embedding new data points into an existing embedding.

Parameters:
  • data (np.ndarray) – The data points to be added to the existing embedding.
  • return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.
Returns:

  • P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
  • indices (np.ndarray) – Returned if return_distances=True. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.
  • distances (np.ndarray) – Returned if return_distances=True. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.

class openTSNE.affinity.PerplexityBasedNN(data, perplexity=30, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False)[source]

Compute affinities using nearest neighbors.

Please see the Parameter guide for more information.

Parameters:
  • data (np.ndarray) – The data matrix.
  • perplexity (float) – Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which t-SNE will attempt to preserve distances.
  • method (str) – Specifies the nearest neighbor method to use. Can be exact, annoy, pynndescent, approx, or auto (default). approx uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise it uses Pynndescent. auto uses exact nearest neighbors for N<1000 and the same heuristic as approx for N>=1000.
  • metric (Union[str, Callable]) – The metric to be used to compute affinities between points in the original space.
  • metric_params (dict) – Additional keyword arguments for the metric function.
  • symmetrize (bool) – Symmetrize affinity matrix. Standard t-SNE symmetrizes the interactions but when embedding new data, symmetrization is not performed.
  • n_jobs (int) – The number of threads to use while running t-SNE. This follows the scikit-learn convention, -1 meaning all processors, -2 meaning all but one, etc.
  • random_state (Union[int, RandomState]) – If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.
  • verbose (bool) –
check_perplexity(perplexity)[source]
set_perplexity(new_perplexity)[source]

Change the perplexity of the affinity matrix.

Note that we only allow lowering the perplexity or restoring it to its original value. This restriction exists because setting a higher perplexity value requires recomputing all the nearest neighbors, which can take a long time. To avoid potential confusion as to why execution time is slow, this is not allowed. If you would like to increase the perplexity above the initial value, simply create a new instance.

Parameters:new_perplexity (float) – The new perplexity.
to_new(data, perplexity=None, return_distances=False)[source]

Compute the affinities of new samples to the initial samples.

This is necessary for embedding new data points into an existing embedding.

Please see the Parameter guide for more information.

Parameters:
  • data (np.ndarray) – The data points to be added to the existing embedding.
  • perplexity (float) – Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which t-SNE will attempt to preserve distances.
  • return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.
Returns:

  • P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
  • indices (np.ndarray) – Returned if return_distances=True. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.
  • distances (np.ndarray) – Returned if return_distances=True. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.

class openTSNE.affinity.MultiscaleMixture(data, perplexities, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False)[source]

Calculate affinities using a Gaussian mixture kernel.

Instead of using a single perplexity to compute the affinities between data points, we can use a multiscale Gaussian kernel instead. This allows us to incorporate long range interactions.

Please see the Parameter guide for more information.

Parameters:
  • data (np.ndarray) – The data matrix.
  • perplexities (List[float]) – A list of perplexity values, which will be used in the multiscale Gaussian kernel. Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which t-SNE will attempt to preserve distances.
  • method (str) – Specifies the nearest neighbor method to use. Can be exact, annoy, pynndescent, approx, or auto (default). approx uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise it uses Pynndescent. auto uses exact nearest neighbors for N<1000 and the same heuristic as approx for N>=1000.
metric: Union[str, Callable]
The metric to be used to compute affinities between points in the original space.
metric_params: dict
Additional keyword arguments for the metric function.
symmetrize: bool
Symmetrize affinity matrix. Standard t-SNE symmetrizes the interactions but when embedding new data, symmetrization is not performed.
n_jobs: int
The number of threads to use while running t-SNE. This follows the scikit-learn convention, -1 meaning all processors, -2 meaning all but one, etc.
random_state: Union[int, RandomState]
If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.

verbose: bool

check_perplexities(perplexities: Iterable[float]) → Iterable[float][source]

Check and correct/truncate perplexities.

If a perplexity is too large, it is corrected to the largest allowed value. It is then inserted into the list of perplexities only if that value doesn’t already exist in the list.

set_perplexities(new_perplexities)[source]

Change the perplexities of the affinity matrix.

Note that we only allow lowering the perplexities or restoring them to their original maximum value. This restriction exists because setting a higher perplexity value requires recomputing all the nearest neighbors, which can take a long time. To avoid potential confusion as to why execution time is slow, this is not allowed. If you would like to increase the perplexity above the initial value, simply create a new instance.

Parameters:new_perplexities (List[float]) – The new list of perplexities.
to_new(data, perplexities=None, return_distances=False)[source]

Compute the affinities of new samples to the initial samples.

This is necessary for embedding new data points into an existing embedding.

Please see the Parameter guide for more information.

Parameters:
  • data (np.ndarray) – The data points to be added to the existing embedding.
  • perplexities (List[float]) – A list of perplexity values, which will be used in the multiscale Gaussian kernel. Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which t-SNE will attempt to preserve distances.
  • return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.
Returns:

  • P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
  • indices (np.ndarray) – Returned if return_distances=True. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.
  • distances (np.ndarray) – Returned if return_distances=True. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.

class openTSNE.affinity.Multiscale(data, perplexities, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False)[source]

Calculate affinities using averaged Gaussian perplexities.

In contrast to MultiscaleMixture, which uses a Gaussian mixture kernel, here, we first compute single scale Gaussian kernels, convert them to probability distributions, then average them out between scales.

Please see the Parameter guide for more information.

Parameters:
  • data (np.ndarray) – The data matrix.
  • perplexities (List[float]) – A list of perplexity values, which will be used in the multiscale Gaussian kernel. Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which t-SNE will attempt to preserve distances.
  • method (str) – Specifies the nearest neighbor method to use. Can be exact, annoy, pynndescent, approx, or auto (default). approx uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise it uses Pynndescent. auto uses exact nearest neighbors for N<1000 and the same heuristic as approx for N>=1000.
metric: Union[str, Callable]
The metric to be used to compute affinities between points in the original space.
metric_params: dict
Additional keyword arguments for the metric function.
symmetrize: bool
Symmetrize affinity matrix. Standard t-SNE symmetrizes the interactions but when embedding new data, symmetrization is not performed.
n_jobs: int
The number of threads to use while running t-SNE. This follows the scikit-learn convention, -1 meaning all processors, -2 meaning all but one, etc.
random_state: Union[int, RandomState]
If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.

verbose: bool

class openTSNE.affinity.FixedSigmaNN(data, sigma, k=30, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False)[source]

Compute affinities using using nearest neighbors and a fixed bandwidth for the Gaussians in the ambient space.

Using a fixed Gaussian bandwidth can enable us to find smaller clusters of data points than we might be able to using the automatically determined bandwidths using perplexity. Note however that this requires mostly trial and error.

Parameters:
  • data (np.ndarray) – The data matrix.
  • sigma (float) – The bandwidth to use for the Gaussian kernels in the ambient space.
  • k (int) – The number of nearest neighbors to consider for each kernel.
  • method (str) – Specifies the nearest neighbor method to use. Can be exact, annoy, pynndescent, approx, or auto (default). approx uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise it uses Pynndescent. auto uses exact nearest neighbors for N<1000 and the same heuristic as approx for N>=1000.
metric: Union[str, Callable]
The metric to be used to compute affinities between points in the original space.
metric_params: dict
Additional keyword arguments for the metric function.
symmetrize: bool
Symmetrize affinity matrix. Standard t-SNE symmetrizes the interactions but when embedding new data, symmetrization is not performed.
n_jobs: int
The number of threads to use while running t-SNE. This follows the scikit-learn convention, -1 meaning all processors, -2 meaning all but one, etc.
random_state: Union[int, RandomState]
If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.

verbose: bool

to_new(data, k=None, sigma=None, return_distances=False)[source]

Compute the affinities of new samples to the initial samples.

This is necessary for embedding new data points into an existing embedding.

Parameters:
  • data (np.ndarray) – The data points to be added to the existing embedding.
  • k (int) – The number of nearest neighbors to consider for each kernel.
  • sigma (float) – The bandwidth to use for the Gaussian kernels in the ambient space.
  • return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.
Returns:

  • P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
  • indices (np.ndarray) – Returned if return_distances=True. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.
  • distances (np.ndarray) – Returned if return_distances=True. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.