sklearn

class openTSNE.sklearn.TSNE(n_components=2, perplexity=30, learning_rate='auto', early_exaggeration_iter=250, early_exaggeration='auto', n_iter=500, exaggeration=None, dof=1, theta=0.5, n_interpolation_points=3, min_num_intervals=50, ints_in_interval=1, initialization='pca', metric='euclidean', metric_params=None, initial_momentum=0.8, final_momentum=0.8, max_grad_norm=None, max_step_norm=5, n_jobs=1, neighbors='auto', negative_gradient_method='auto', callbacks=None, callbacks_every_iters=50, random_state=None, verbose=False)[source]

t-Distributed Stochastic Neighbor Embedding.

Please see the Parameter guide for more information.

Parameters:
  • n_components (int) – The dimension of the embedding space. This deafults to 2 for easy visualization, but sometimes 1 is used for t-SNE heatmaps. t-SNE is not designed to embed into higher dimension and please note that acceleration schemes break down and are not fully implemented.

  • perplexity (float) – Perplexity can be thought of as the continuous \(k\) number of nearest neighbors, for which t-SNE will attempt to preserve distances.

  • learning_rate (Union[str, float]) – The learning rate for t-SNE optimization. When learning_rate="auto" the appropriate learning rate is selected according to N / exaggeration as determined in Belkina et al. (2019), Nature Communications. Note that this will result in a different learning rate during the early exaggeration phase and afterwards. This should not be used when adding samples into existing embeddings, where the learning rate often needs to be much lower to obtain convergence.

  • early_exaggeration_iter (int) – The number of iterations to run in the early exaggeration phase.

  • early_exaggeration (Union[str, float]) – The exaggeration factor to use during the early exaggeration phase. Typical values range from 4 to 32. When early_exaggeration="auto" early exaggeration factor defaults to 12, unless desired subsequent exaggeration is higher, i.e.: early_exaggeration = max(12, exaggeration).

  • n_iter (int) – The number of iterations to run in the normal optimization regime.

  • exaggeration (float) – The exaggeration factor to use during the normal optimization phase. This can be used to form more densely packed clusters and is useful for large data sets.

  • dof (float) – Degrees of freedom as described in Kobak et al. “Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations”, 2019.

  • theta (float) – Only used when negative_gradient_method="bh" or its other aliases. This is the trade-off parameter between speed and accuracy of the tree approximation method. Typical values range from 0.2 to 0.8. The value 0 indicates that no approximation is to be made and produces exact results also producing longer runtime. Alternatively, you can use auto to approximately select the faster method.

  • n_interpolation_points (int) – Only used when negative_gradient_method="fft" or its other aliases. The number of interpolation points to use within each grid cell for interpolation based t-SNE. It is highly recommended leaving this value at the default 3.

  • min_num_intervals (int) – Only used when negative_gradient_method="fft" or its other aliases. The minimum number of grid cells to use, regardless of the ints_in_interval parameter. Higher values provide more accurate gradient estimations.

  • ints_in_interval (float) – Only used when negative_gradient_method="fft" or its other aliases. Indicates how large a grid cell should be e.g. a value of 3 indicates a grid side length of 3. Lower values provide more accurate gradient estimations.

  • initialization (Union[np.ndarray, str]) – The initial point positions to be used in the embedding space. Can be a precomputed numpy array, pca, spectral or random. Please note that when passing in a precomputed positions, it is highly recommended that the point positions have small variance (std(Y) < 0.0001), otherwise you may get poor embeddings.

  • metric (Union[str, Callable]) – The metric to be used to compute affinities between points in the original space.

  • metric_params (dict) – Additional keyword arguments for the metric function.

  • initial_momentum (float) – The momentum to use during the early exaggeration phase.

  • final_momentum (float) – The momentum to use during the normal optimization phase.

  • max_grad_norm (float) – Maximum gradient norm. If the norm exceeds this value, it will be clipped. This is most beneficial when adding points into an existing embedding and the new points overlap with the reference points, leading to large gradients. This can make points “shoot off” from the embedding, causing the interpolation method to compute a very large grid, and leads to worse results.

  • max_step_norm (float) – Maximum update norm. If the norm exceeds this value, it will be clipped. This prevents points from “shooting off” from the embedding.

  • n_jobs (int) – The number of threads to use while running t-SNE. This follows the scikit-learn convention, -1 meaning all processors, -2 meaning all but one, etc.

  • neighbors (str) – Specifies the nearest neighbor method to use. Can be exact, annoy, pynndescent, hnsw, approx, or auto (default). approx uses Annoy if the input data matrix is not a sparse object and if Annoy supports the given metric. Otherwise it uses Pynndescent. auto uses exact nearest neighbors for N<1000 and the same heuristic as approx for N>=1000.

  • negative_gradient_method (str) – Specifies the negative gradient approximation method to use. For smaller data sets, the Barnes-Hut approximation is appropriate and can be set using one of the following aliases: bh, BH or barnes-hut. For larger data sets, the FFT accelerated interpolation method is more appropriate and can be set using one of the following aliases: fft, FFT or ìnterpolation. Alternatively, you can use auto to approximately select the faster method.

  • callbacks (Union[Callable, List[Callable]]) – Callbacks, which will be run every callbacks_every_iters iterations.

  • callbacks_every_iters (int) – How many iterations should pass between each time the callbacks are invoked.

  • random_state (Union[int, RandomState]) – If the value is an int, random_state is the seed used by the random number generator. If the value is a RandomState instance, then it will be used as the random number generator. If the value is None, the random number generator is the RandomState instance used by np.random.

  • verbose (bool) –

fit(X, y=None)[source]

Fit X into an embedded space.

Parameters:
  • X (np.ndarray) – The data matrix to be embedded.

  • y (ignored) –

fit_transform(X, y=None)[source]

Fit X into an embedded space and return that transformed output.

Parameters:
  • X (np.ndarray) – The data matrix to be embedded.

  • y (ignored) –

Returns:

Embedding of the training data in low-dimensional space.

Return type:

np.ndarray

transform(X, *args, **kwargs)[source]

Apply dimensionality reduction to X.

See openTSNE.TSNEEmbedding.transform() for additional parameters.

Parameters:

X (np.ndarray) – The data matrix to be embedded.

Returns:

Embedding of the training data in low-dimensional space.

Return type:

np.ndarray