Affinity¶

class
openTSNE.affinity.
Affinities
(verbose=False)[source]¶ Compute the affinities between samples.
tSNE takes as input an affinity matrix \(P\), and does not really care about anything else from the data. This means we can use tSNE for any data where we are able to express interactions between samples with an affinity matrix.

P
¶ The \(N \times N\) affinity matrix expressing interactions between \(N\) initial data samples.
Type: array_like

verbose
¶ Type: bool

n_samples
¶

to_new
(data, return_distances=False)[source]¶ Compute the affinities of new samples to the initial samples.
This is necessary for embedding new data points into an existing embedding.
Parameters:  data (np.ndarray) – The data points to be added to the existing embedding.
 return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.
Returns:  P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
 indices (np.ndarray) – Returned if
return_distances=True
. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.  distances (np.ndarray) – Returned if
return_distances=True
. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.


class
openTSNE.affinity.
PerplexityBasedNN
(data=None, perplexity=30, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False, k_neighbors='auto', knn_index=None)[source]¶ Compute affinities using nearest neighbors.
Please see the Parameter guide for more information.
Parameters:  data (np.ndarray) – The data matrix.
 perplexity (float) – Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which tSNE will attempt to preserve distances.
 method (str) – Specifies the nearest neighbor method to use. Can be
exact
,annoy
,pynndescent
,hnsw
,approx
, orauto
(default).approx
uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise it uses Pynndescent.auto
uses exact nearest neighbors for N<1000 and the same heuristic asapprox
for N>=1000.  metric (Union[str, Callable]) – The metric to be used to compute affinities between points in the original space.
 metric_params (dict) – Additional keyword arguments for the metric function.
 symmetrize (bool) – Symmetrize affinity matrix. Standard tSNE symmetrizes the interactions but when embedding new data, symmetrization is not performed.
 n_jobs (int) – The number of threads to use while running tSNE. This follows the
scikitlearn convention,
1
meaning all processors,2
meaning all but one, etc.  random_state (Union[int, RandomState]) – If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.
 verbose (bool) –
 k_neighbors (int or
auto
) – The number of neighbors to use in the kNN graph. Ifauto
(default), it is set to three times the perplexity.  knn_index (Optional[nearest_neighbors.KNNIndex]) – Optionally, a precomptued
openTSNE.nearest_neighbors.KNNIndex
object can be specified. This option will ignore any KNNrelated parameters. Whenknn_index
is specified,data
must be set to None.

set_perplexity
(new_perplexity)[source]¶ Change the perplexity of the affinity matrix.
Note that we only allow setting the perplexity to a value not larger than the number of neighbors used for the original perplexity. This restriction exists because setting a higher perplexity value requires recomputing all the nearest neighbors, which can take a long time. To avoid potential confusion as to why execution time is slow, this is not allowed. If you would like to increase the perplexity above that value, simply create a new instance.
Parameters: new_perplexity (float) – The new perplexity.

to_new
(data, perplexity=None, return_distances=False, k_neighbors='auto')[source]¶ Compute the affinities of new samples to the initial samples.
This is necessary for embedding new data points into an existing embedding.
Please see the Parameter guide for more information.
Parameters:  data (np.ndarray) – The data points to be added to the existing embedding.
 perplexity (float) – Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which tSNE will attempt to preserve distances.
 return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.
 k_neighbors (int or
auto
) – The number of neighbors to query kNN graph for. Ifauto
(default), it is set to three times the perplexity.
Returns:  P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
 indices (np.ndarray) – Returned if
return_distances=True
. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.  distances (np.ndarray) – Returned if
return_distances=True
. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.

class
openTSNE.affinity.
MultiscaleMixture
(data=None, perplexities=None, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False, knn_index=None)[source]¶ Calculate affinities using a Gaussian mixture kernel.
Instead of using a single perplexity to compute the affinities between data points, we can use a multiscale Gaussian kernel instead. This allows us to incorporate long range interactions.
Please see the Parameter guide for more information.
Parameters:  data (np.ndarray) – The data matrix.
 perplexities (List[float]) – A list of perplexity values, which will be used in the multiscale Gaussian kernel. Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which tSNE will attempt to preserve distances.
 method (str) – Specifies the nearest neighbor method to use. Can be
exact
,annoy
,pynndescent
,hnsw
,approx
, orauto
(default).approx
uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise it uses Pynndescent.auto
uses exact nearest neighbors for N<1000 and the same heuristic asapprox
for N>=1000.
 metric: Union[str, Callable]
 The metric to be used to compute affinities between points in the original space.
 metric_params: dict
 Additional keyword arguments for the metric function.
 symmetrize: bool
 Symmetrize affinity matrix. Standard tSNE symmetrizes the interactions but when embedding new data, symmetrization is not performed.
 n_jobs: int
 The number of threads to use while running tSNE. This follows the
scikitlearn convention,
1
meaning all processors,2
meaning all but one, etc.  random_state: Union[int, RandomState]
 If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.
verbose: bool
 knn_index: Optional[nearest_neighbors.KNNIndex]
 Optionally, a precomptued
openTSNE.nearest_neighbors.KNNIndex
object can be specified. This option will ignore any KNNrelated parameters. Whenknn_index
is specified,data
must be set to None.

check_perplexities
(perplexities, n_samples)[source]¶ Check and correct/truncate perplexities.
If a perplexity is too large, it is corrected to the largest allowed value. It is then inserted into the list of perplexities only if that value doesn’t already exist in the list.

set_perplexities
(new_perplexities)[source]¶ Change the perplexities of the affinity matrix.
Note that we only allow lowering the perplexities or restoring them to their original maximum value. This restriction exists because setting a higher perplexity value requires recomputing all the nearest neighbors, which can take a long time. To avoid potential confusion as to why execution time is slow, this is not allowed. If you would like to increase the perplexity above the initial value, simply create a new instance.
Parameters: new_perplexities (List[float]) – The new list of perplexities.

to_new
(data, perplexities=None, return_distances=False)[source]¶ Compute the affinities of new samples to the initial samples.
This is necessary for embedding new data points into an existing embedding.
Please see the Parameter guide for more information.
Parameters:  data (np.ndarray) – The data points to be added to the existing embedding.
 perplexities (List[float]) – A list of perplexity values, which will be used in the multiscale Gaussian kernel. Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which tSNE will attempt to preserve distances.
 return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.
Returns:  P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
 indices (np.ndarray) – Returned if
return_distances=True
. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.  distances (np.ndarray) – Returned if
return_distances=True
. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.

class
openTSNE.affinity.
Multiscale
(data=None, perplexities=None, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False, knn_index=None)[source]¶ Calculate affinities using averaged Gaussian perplexities.
In contrast to
MultiscaleMixture
, which uses a Gaussian mixture kernel, here, we first compute single scale Gaussian kernels, convert them to probability distributions, then average them out between scales.Please see the Parameter guide for more information.
Parameters:  data (np.ndarray) – The data matrix.
 perplexities (List[float]) – A list of perplexity values, which will be used in the multiscale Gaussian kernel. Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which tSNE will attempt to preserve distances.
 method (str) – Specifies the nearest neighbor method to use. Can be
exact
,annoy
,pynndescent
,hnsw
,approx
, orauto
(default).approx
uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise it uses Pynndescent.auto
uses exact nearest neighbors for N<1000 and the same heuristic asapprox
for N>=1000.
 metric: Union[str, Callable]
 The metric to be used to compute affinities between points in the original space.
 metric_params: dict
 Additional keyword arguments for the metric function.
 symmetrize: bool
 Symmetrize affinity matrix. Standard tSNE symmetrizes the interactions but when embedding new data, symmetrization is not performed.
 n_jobs: int
 The number of threads to use while running tSNE. This follows the
scikitlearn convention,
1
meaning all processors,2
meaning all but one, etc.  random_state: Union[int, RandomState]
 If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.
verbose: bool
 knn_index: Optional[nearest_neighbors.KNNIndex]
 Optionally, a precomptued
openTSNE.nearest_neighbors.KNNIndex
object can be specified. This option will ignore any KNNrelated parameters. Whenknn_index
is specified,data
must be set to None.

class
openTSNE.affinity.
FixedSigmaNN
(data=None, sigma=None, k=30, method='auto', metric='euclidean', metric_params=None, symmetrize=True, n_jobs=1, random_state=None, verbose=False, knn_index=None)[source]¶ Compute affinities using using nearest neighbors and a fixed bandwidth for the Gaussians in the ambient space.
Using a fixed Gaussian bandwidth can enable us to find smaller clusters of data points than we might be able to using the automatically determined bandwidths using perplexity. Note however that this requires mostly trial and error.
Parameters:  data (np.ndarray) – The data matrix.
 sigma (float) – The bandwidth to use for the Gaussian kernels in the ambient space.
 k (int) – The number of nearest neighbors to consider for each kernel.
 method (str) – Specifies the nearest neighbor method to use. Can be
exact
,annoy
,pynndescent
,hnsw
,approx
, orauto
(default).approx
uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise it uses Pynndescent.auto
uses exact nearest neighbors for N<1000 and the same heuristic asapprox
for N>=1000.
 metric: Union[str, Callable]
 The metric to be used to compute affinities between points in the original space.
 metric_params: dict
 Additional keyword arguments for the metric function.
 symmetrize: bool
 Symmetrize affinity matrix. Standard tSNE symmetrizes the interactions but when embedding new data, symmetrization is not performed.
 n_jobs: int
 The number of threads to use while running tSNE. This follows the
scikitlearn convention,
1
meaning all processors,2
meaning all but one, etc.  random_state: Union[int, RandomState]
 If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.
verbose: bool
 knn_index: Optional[nearest_neighbors.KNNIndex]
 Optionally, a precomptued
openTSNE.nearest_neighbors.KNNIndex
object can be specified. This option will ignore any KNNrelated parameters. Whenknn_index
is specified,data
must be set to None.

to_new
(data, k=None, sigma=None, return_distances=False)[source]¶ Compute the affinities of new samples to the initial samples.
This is necessary for embedding new data points into an existing embedding.
Parameters:  data (np.ndarray) – The data points to be added to the existing embedding.
 k (int) – The number of nearest neighbors to consider for each kernel.
 sigma (float) – The bandwidth to use for the Gaussian kernels in the ambient space.
 return_distances (bool) – If needed, the function can return the indices of the nearest neighbors and their corresponding distances.
Returns:  P (array_like) – An \(N \times M\) affinity matrix expressing interactions between \(N\) new data points the initial \(M\) data samples.
 indices (np.ndarray) – Returned if
return_distances=True
. The indices of the \(k\) nearest neighbors in the existing embedding for every new data point.  distances (np.ndarray) – Returned if
return_distances=True
. The distances to the \(k\) nearest neighbors in the existing embedding for every new data point.