Affinity

class openTSNE.affinity.Affinities(verbose=False)[source]

Compute the affinities between samples.

t-SNE takes as input an affinity matrix \(P\), and does not really care about anything else from the data. This means we can use t-SNE for any data where we are able to express interactions between samples with an affinity matrix.

P

The \(N \times N\) affinity matrix expressing interactions between \(N\) initial data samples.

Type:: array_like

verbose

Type:: bool

property n_samples

to_new(data, return_distances=False)[source]

Compute the affinities of new samples to the initial samples.

This is necessary for embedding new data points into an existing embedding.

Parameters:

data (np.ndarray) – The data points to be added to the existing embedding.
return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.

Returns:

P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
indices (np.ndarray) – Returned if return_distances=True. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.
distances (np.ndarray) – Returned if return_distances=True. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.

class openTSNE.affinity.FixedSigmaNN(data=None, sigma=None, k=30, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False, knn_kwargs=None, knn_index=None)[source]

Compute affinities using nearest neighbors and a fixed bandwidth for the Gaussians in the ambient space.

Using a fixed Gaussian bandwidth can enable us to find smaller clusters of data points than we might be able to using the automatically determined bandwidths using perplexity. Note however that this requires mostly trial and error.

Parameters:

data (np.ndarray) – The data matrix.
sigma (float) – The bandwidth to use for the Gaussian kernels in the ambient space.
k (int) – The number of nearest neighbors to consider for each kernel.
method (str) – Specifies the nearest neighbor method to use. Can be exact, annoy, pynndescent, hnsw, approx, or auto (default). approx uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise, it uses Pynndescent. auto uses exact nearest neighbors for N<1000 and the same heuristic as approx for N>=1000.
metric (Union[str, Callable]) – The metric to be used to compute affinities between points in the original space.
metric_params (dict) – Additional keyword arguments for the metric function.
symmetrize (bool) – Symmetrize the affinity matrix. During standard t-SNE optimization, the affinities are symmetrized. However, when embedding new data points into existing embeddings, symmetrization is not performed.
n_jobs (int) – The number of threads to use while running t-SNE. This follows the scikit-learn convention, -1 meaning all processors, -2 meaning all but one, etc.
random_state (Union[int, RandomState]) – If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.
verbose (bool)
knn_kwargs (Optional[None, dict]) – Optional keyword arguments that will be passed to the knn_index.
knn_index (Optional[nearest_neighbors.KNNIndex]) – Optionally, a precomptued openTSNE.nearest_neighbors.KNNIndex object can be specified. This option will ignore any KNN-related parameters. When knn_index is specified, data must be set to None.

to_new(data, k=None, sigma=None, return_distances=False)[source]

Compute the affinities of new samples to the initial samples.

This is necessary for embedding new data points into an existing embedding.

Parameters:

data (np.ndarray) – The data points to be added to the existing embedding.
k (int) – The number of nearest neighbors to consider for each kernel.
sigma (float) – The bandwidth to use for the Gaussian kernels in the ambient space.
return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.

Returns:

P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
indices (np.ndarray) – Returned if return_distances=True. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.
distances (np.ndarray) – Returned if return_distances=True. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.

class openTSNE.affinity.Multiscale(data=None, perplexities=None, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False, knn_kwargs=None, knn_index=None)[source]

Calculate affinities using averaged Gaussian perplexities.

In contrast to MultiscaleMixture, which uses a Gaussian mixture kernel, here, we first compute single scale Gaussian kernels, convert them to probability distributions, then average them out between scales.

Please see the Parameter guide for more information.

Parameters:

data (np.ndarray) – The data matrix.
perplexities (List[float]) – A list of perplexity values, which will be used in the multiscale Gaussian kernel. Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which t-SNE will attempt to preserve distances.
method (str) – Specifies the nearest neighbor method to use. Can be exact, annoy, pynndescent, hnsw, approx, or auto (default). approx uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise, it uses Pynndescent. auto uses exact nearest neighbors for N<1000 and the same heuristic as approx for N>=1000.
metric (Union[str, Callable]) – The metric to be used to compute affinities between points in the original space.
metric_params (dict) – Additional keyword arguments for the metric function.
symmetrize (bool) – Symmetrize the affinity matrix. During standard t-SNE optimization, the affinities are symmetrized. However, when embedding new data points into existing embeddings, symmetrization is not performed.
n_jobs (int) – The number of threads to use while running t-SNE. This follows the scikit-learn convention, -1 meaning all processors, -2 meaning all but one, etc.
random_state (Union[int, RandomState]) – If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.
verbose (bool)
knn_index (Optional[nearest_neighbors.KNNIndex]) – Optionally, a precomptued openTSNE.nearest_neighbors.KNNIndex object can be specified. This option will ignore any KNN-related parameters. When knn_index is specified, data must be set to None.

class openTSNE.affinity.MultiscaleMixture(data=None, perplexities=None, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False, knn_kwargs=None, knn_index=None)[source]

Calculate affinities using a Gaussian mixture kernel.

Instead of using a single perplexity to compute the affinities between data points, we can use a multiscale Gaussian kernel instead. This allows us to incorporate long range interactions.

Please see the Parameter guide for more information.

Parameters:

data (np.ndarray) – The data matrix.
perplexities (List[float]) – A list of perplexity values, which will be used in the multiscale Gaussian kernel. Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which t-SNE will attempt to preserve distances.
method (str) – Specifies the nearest neighbor method to use. Can be exact, annoy, pynndescent, hnsw, approx, or auto (default). approx uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise, it uses Pynndescent. auto uses exact nearest neighbors for N<1000 and the same heuristic as approx for N>=1000.
metric (Union[str, Callable]) – The metric to be used to compute affinities between points in the original space.
metric_params (dict) – Additional keyword arguments for the metric function.
symmetrize (bool) – Symmetrize the affinity matrix. During standard t-SNE optimization, the affinities are symmetrized. However, when embedding new data points into existing embeddings, symmetrization is not performed.
n_jobs (int) – The number of threads to use while running t-SNE. This follows the scikit-learn convention, -1 meaning all processors, -2 meaning all but one, etc.
random_state (Union[int, RandomState]) – If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.
verbose (bool)
knn_kwargs (Optional[None, dict]) – Optional keyword arguments that will be passed to the knn_index.
knn_index (Optional[nearest_neighbors.KNNIndex]) – Optionally, a precomptued openTSNE.nearest_neighbors.KNNIndex object can be specified. This option will ignore any KNN-related parameters. When knn_index is specified, data must be set to None.

check_perplexities(perplexities, n_samples)[source]

Check and correct/truncate perplexities.

If a perplexity is too large, it is corrected to the largest allowed value. It is then inserted into the list of perplexities only if that value doesn’t already exist in the list.

set_perplexities(new_perplexities)[source]

Change the perplexities of the affinity matrix.

Note that we only allow lowering the perplexities or restoring them to their original maximum value. This restriction exists because setting a higher perplexity value requires recomputing all the nearest neighbors, which can take a long time. To avoid potential confusion as to why execution time is slow, this is not allowed. If you would like to increase the perplexity above the initial value, simply create a new instance.

Parameters:: new_perplexities (List[float]) – The new list of perplexities.

to_new(data, perplexities=None, return_distances=False)[source]

Compute the affinities of new samples to the initial samples.

This is necessary for embedding new data points into an existing embedding.

Please see the Parameter guide for more information.

Parameters:

data (np.ndarray) – The data points to be added to the existing embedding.
perplexities (List[float]) – A list of perplexity values, which will be used in the multiscale Gaussian kernel. Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which t-SNE will attempt to preserve distances.
return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.

Returns:

P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
indices (np.ndarray) – Returned if return_distances=True. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.
distances (np.ndarray) – Returned if return_distances=True. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.

class openTSNE.affinity.PerplexityBasedNN(data=None, perplexity=30, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False, k_neighbors='auto', knn_kwargs=None, knn_index=None)[source]

Compute standard, Gaussian affinities using nearest neighbors.

Please see the Parameter guide for more information.

Parameters:

data (np.ndarray) – The data matrix.
perplexity (float) – Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which t-SNE will attempt to preserve distances.
method (str) – Specifies the nearest neighbor method to use. Can be exact, annoy, pynndescent, hnsw, approx, or auto (default). approx uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise, it uses Pynndescent. auto uses exact nearest neighbors for N<1000 and the same heuristic as approx for N>=1000.
metric (Union[str, Callable]) – The metric to be used to compute affinities between points in the original space.
metric_params (dict) – Additional keyword arguments for the metric function.
symmetrize (bool) – Symmetrize the affinity matrix. During standard t-SNE optimization, the affinities are symmetrized. However, when embedding new data points into existing embeddings, symmetrization is not performed.
n_jobs (int) – The number of threads to use while running t-SNE. This follows the scikit-learn convention, -1 meaning all processors, -2 meaning all but one, etc.
random_state (Union[int, RandomState]) – If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.
verbose (bool)
k_neighbors (int or auto) – The number of neighbors to use in the kNN graph. If auto (default), it is set to three times the perplexity.
knn_kwargs (Optional[None, dict]) – Optional keyword arguments that will be passed to the knn_index.
knn_index (Optional[nearest_neighbors.KNNIndex]) – Optionally, a precomputed openTSNE.nearest_neighbors.KNNIndex object can be specified. This option will ignore any KNN-related parameters. When knn_index is specified, data must be set to None.

static check_perplexity(perplexity, k_neighbors)[source]

set_perplexity(new_perplexity)[source]

Change the perplexity of the affinity matrix.

Note that we only allow setting the perplexity to a value not larger than the number of neighbors used for the original perplexity. This restriction exists because setting a higher perplexity value requires recomputing all the nearest neighbors, which can take a long time. To avoid potential confusion as to why execution time is slow, this is not allowed. If you would like to increase the perplexity above that value, simply create a new instance.

Parameters:: new_perplexity (float) – The new perplexity.

to_new(data, perplexity=None, return_distances=False, k_neighbors='auto')[source]

Compute the affinities of new samples to the initial samples.

This is necessary for embedding new data points into an existing embedding.

Please see the Parameter guide for more information.

Parameters:

data (np.ndarray) – The data points to be added to the existing embedding.
perplexity (float) – Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which t-SNE will attempt to preserve distances.
return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.
k_neighbors (int or auto) – The number of neighbors to query kNN graph for. If auto (default), it is set to three times the perplexity.

Returns:

P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
indices (np.ndarray) – Returned if return_distances=True. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.
distances (np.ndarray) – Returned if return_distances=True. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.

class openTSNE.affinity.PrecomputedAffinities(affinities, normalize=True)[source]

Use a precomputed affinity matrix.

Parameters:

affinities (scipy.sparse.csr_matrix, np.ndarray) – An N x N matrix containing the affinities.
normalize (bool) – Normalize the affinity matrix to sum to 1. Default is True.

to_new(data, return_distances=False)[source]

Compute the affinities of new samples to the initial samples.

This is necessary for embedding new data points into an existing embedding.

Parameters:

data (np.ndarray) – The data points to be added to the existing embedding.
return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.

Returns:

P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
indices (np.ndarray) – Returned if return_distances=True. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.
distances (np.ndarray) – Returned if return_distances=True. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.

class openTSNE.affinity.Uniform(data=None, k_neighbors=30, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False, knn_kwargs=None, knn_index=None)[source]

Compute affinities using nearest neighbors and uniform kernel in the ambient space.

Parameters:

data (np.ndarray) – The data matrix.
k_neighbors (int)
method (str) – Specifies the nearest neighbor method to use. Can be exact, annoy, pynndescent, hnsw, approx, or auto (default). approx uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise, it uses Pynndescent. auto uses exact nearest neighbors for N<1000 and the same heuristic as approx for N>=1000.
metric (Union[str, Callable]) – The metric to be used to compute affinities between points in the original space.
metric_params (dict) – Additional keyword arguments for the metric function.
symmetrize (Union[str, bool]) – Symmetrize the affinity matrix. During standard t-SNE optimization, the affinities are symmetrized. However, when embedding new data points into existing embeddings, symmetrization is not performed. The uniform affinity supports max and mean symmetrization, as well as no symmetrization via none. The max symmetrization yields a binary affinity matrix with all non-zero elements (corresponding to edges of the kNN graph) being the same. The mean symmetrization performs symmetrization via (A + A.T) / 2, resulting in the affinity matrix with two possible non-zero values. Applying no symmetrization results in a non-symmetric affinity matrix. We default to mean symmetrization, but the default will change to max in future versions.
n_jobs (int) – The number of threads to use while running t-SNE. This follows the scikit-learn convention, -1 meaning all processors, -2 meaning all but one, etc.
random_state (Union[int, RandomState]) – If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.
verbose (bool)
knn_kwargs (Optional[None, dict]) – Optional keyword arguments that will be passed to the knn_index.
knn_index (Optional[nearest_neighbors.KNNIndex]) – Optionally, a precomptued openTSNE.nearest_neighbors.KNNIndex object can be specified. This option will ignore any KNN-related parameters. When knn_index is specified, data must be set to None.

to_new(data, k_neighbors=None, return_distances=False)[source]

Compute the affinities of new samples to the initial samples.

This is necessary for embedding new data points into an existing embedding.

Parameters:

data (np.ndarray) – The data points to be added to the existing embedding.
k_neighbors (int) – The number of nearest neighbors to consider.
return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.

Returns:

P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
indices (np.ndarray) – Returned if return_distances=True. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.
distances (np.ndarray) – Returned if return_distances=True. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.