Embedding large data sets ========================= Embedding large data sets typically requires more care. Using various tricks described in *preserving_global_structure* can become quite slow to run. Instead, we can take a smaller, manageable sample of our data set, obtain a good visualization of that. Then, we can add the remaining points to the embedding and use that as our initialization. Remember that the initialization largely affects the structure of the embedding. This way, our initialization provides the global structure for the embedding, and the subsequent optimization can focus on preserving local strucutre. .. code:: ipython3 import gzip import pickle import numpy as np import openTSNE from examples import utils import matplotlib.pyplot as plt %matplotlib inline Load data --------- The preprocessed data set can be downloaded from http://file.biolab.si/opentsne/benchmark/10x_mouse_zheng.pkl.gz. .. code:: ipython3 %%time with gzip.open("data/10x_mouse_zheng.pkl.gz", "rb") as f: data = pickle.load(f) x = data["pca_50"] y = data["CellType1"] .. parsed-literal:: CPU times: user 5.88 s, sys: 2.54 s, total: 8.42 s Wall time: 8.42 s .. code:: ipython3 print("Data set contains %d samples with %d features" % x.shape) .. parsed-literal:: Data set contains 1306127 samples with 50 features .. code:: ipython3 def plot(x, y, **kwargs): fig, ax = plt.subplots(ncols=2, figsize=(16, 8)) alpha = kwargs.pop("alpha", 0.1) utils.plot( x, np.zeros_like(y), ax=ax[0], colors={0: "k"}, alpha=alpha, draw_legend=False, **kwargs, ) utils.plot( x, y, ax=ax[1], colors=utils.MOUSE_10X_COLORS, alpha=alpha, draw_legend=False, **kwargs, ) .. code:: ipython3 def rotate(degrees): phi = degrees * np.pi / 180 return np.array([ [np.cos(phi), -np.sin(phi)], [np.sin(phi), np.cos(phi)], ]) .. code:: ipython3 plot(x, y) .. image:: output_7_0.png We’ll also precompute the full affinities, since we’ll be needing it in several places throughout the notebook, and can take a long time to run. .. code:: ipython3 %%time aff50 = openTSNE.affinity.PerplexityBasedNN( x, perplexity=50, n_jobs=32, random_state=0, ) .. parsed-literal:: CPU times: user 52min 31s, sys: 1min 21s, total: 53min 53s Wall time: 8min 2s .. code:: ipython3 %%time aff500 = openTSNE.affinity.PerplexityBasedNN( x, perplexity=500, n_jobs=32, random_state=0, ) .. parsed-literal:: CPU times: user 5h 36min 15s, sys: 8min 53s, total: 5h 45min 8s Wall time: 39min 40s Standard t-SNE -------------- First, let’s see what standard t-SNE does. .. code:: ipython3 # Because we're given the data representation as the top 50 principal components # we can just use the top 2 components as the initilization. There is no sense in # calculating PCA on a PCA representation init = openTSNE.initialization.rescale(x[:, :2]) .. code:: ipython3 %%time embedding_standard = openTSNE.TSNE( n_jobs=32, verbose=True, ).fit(affinities=aff50, initialization=init) .. parsed-literal:: -------------------------------------------------------------------------------- TSNE(early_exaggeration=12, n_jobs=32, verbose=True) -------------------------------------------------------------------------------- ===> Running optimization with exaggeration=12.00, lr=108843.92 for 250 iterations... Iteration 50, KL divergence 8.0820, 50 iterations in 49.5503 sec Iteration 100, KL divergence 7.8989, 50 iterations in 52.9387 sec Iteration 150, KL divergence 7.8314, 50 iterations in 52.7589 sec Iteration 200, KL divergence 7.8026, 50 iterations in 53.2414 sec Iteration 250, KL divergence 7.7870, 50 iterations in 52.9956 sec --> Time elapsed: 261.49 seconds ===> Running optimization with exaggeration=1.00, lr=1306127.00 for 500 iterations... Iteration 50, KL divergence 5.7583, 50 iterations in 59.4852 sec Iteration 100, KL divergence 5.4506, 50 iterations in 60.9192 sec Iteration 150, KL divergence 5.2984, 50 iterations in 63.3280 sec Iteration 200, KL divergence 5.2018, 50 iterations in 65.2476 sec Iteration 250, KL divergence 5.1327, 50 iterations in 64.6718 sec Iteration 300, KL divergence 5.0798, 50 iterations in 66.1166 sec Iteration 350, KL divergence 5.0372, 50 iterations in 74.1704 sec Iteration 400, KL divergence 5.0017, 50 iterations in 78.6565 sec Iteration 450, KL divergence 4.9715, 50 iterations in 82.1395 sec Iteration 500, KL divergence 4.9456, 50 iterations in 92.5772 sec --> Time elapsed: 707.32 seconds CPU times: user 3h 41min 29s, sys: 10min 8s, total: 3h 51min 37s Wall time: 16min 14s .. code:: ipython3 plot(embedding_standard, y) .. image:: output_14_0.png This doesn’t look too great. The cluster separation is quite poor and the visualization is visually not very appealing. Using larger exaggeration ------------------------- Exaggeration can be used in order to get better separation between clusters. Let’s see if that helps. .. code:: ipython3 %%time embedding_exag = openTSNE.TSNE( exaggeration=4, n_jobs=32, verbose=True, ).fit(affinities=aff50, initialization=init) .. parsed-literal:: -------------------------------------------------------------------------------- TSNE(early_exaggeration=12, exaggeration=4, n_jobs=32, verbose=True) -------------------------------------------------------------------------------- ===> Running optimization with exaggeration=12.00, lr=108843.92 for 250 iterations... Iteration 50, KL divergence 8.0820, 50 iterations in 51.7397 sec Iteration 100, KL divergence 7.8994, 50 iterations in 54.2033 sec Iteration 150, KL divergence 7.8317, 50 iterations in 54.2117 sec Iteration 200, KL divergence 7.8020, 50 iterations in 55.0897 sec Iteration 50, KL divergence 6.9432, 50 iterations in 56.7796 sec Iteration 100, KL divergence 6.8014, 50 iterations in 57.5679 sec Iteration 150, KL divergence 6.7409, 50 iterations in 56.7147 sec Iteration 200, KL divergence 6.7068, 50 iterations in 56.3419 sec Iteration 250, KL divergence 6.6856, 50 iterations in 56.2568 sec Iteration 300, KL divergence 6.6712, 50 iterations in 56.5754 sec Iteration 350, KL divergence 6.6609, 50 iterations in 56.0285 sec Iteration 400, KL divergence 6.6529, 50 iterations in 55.7784 sec Iteration 450, KL divergence 6.6468, 50 iterations in 55.4943 sec Iteration 500, KL divergence 6.6425, 50 iterations in 55.6567 sec --> Time elapsed: 563.20 seconds CPU times: user 3h 51min 45s, sys: 9min 50s, total: 4h 1min 35s Wall time: 13min 57s .. code:: ipython3 plot(embedding_exag, y) .. image:: output_18_0.png The separation has improved quite a bit, but many clusters are still intertwined with others. Using a larger perplexity ------------------------- .. code:: ipython3 %%time embedding_aff500 = openTSNE.TSNE( n_jobs=32, verbose=True, ).fit(affinities=aff500, initialization=init) .. parsed-literal:: -------------------------------------------------------------------------------- TSNE(early_exaggeration=12, n_jobs=32, verbose=True) -------------------------------------------------------------------------------- ===> Running optimization with exaggeration=12.00, lr=108843.92 for 250 iterations... Iteration 50, KL divergence 5.9688, 50 iterations in 216.9035 sec Iteration 100, KL divergence 5.9218, 50 iterations in 218.9351 sec Iteration 150, KL divergence 5.8993, 50 iterations in 218.5032 sec Iteration 200, KL divergence 5.8900, 50 iterations in 223.6785 sec Iteration 250, KL divergence 5.8854, 50 iterations in 224.3249 sec --> Time elapsed: 1102.35 seconds ===> Running optimization with exaggeration=1.00, lr=1306127.00 for 500 iterations... Iteration 50, KL divergence 3.8664, 50 iterations in 237.2816 sec Iteration 100, KL divergence 3.6473, 50 iterations in 226.6633 sec Iteration 150, KL divergence 3.5520, 50 iterations in 235.5178 sec Iteration 200, KL divergence 3.4979, 50 iterations in 231.3676 sec Iteration 250, KL divergence 3.4621, 50 iterations in 240.6245 sec Iteration 300, KL divergence 3.4373, 50 iterations in 255.3066 sec Iteration 350, KL divergence 3.4183, 50 iterations in 241.8503 sec Iteration 400, KL divergence 3.4035, 50 iterations in 248.7666 sec Iteration 450, KL divergence 3.3922, 50 iterations in 251.1434 sec Iteration 500, KL divergence 3.3823, 50 iterations in 272.1862 sec --> Time elapsed: 2440.72 seconds CPU times: user 1d 2h 32min 21s, sys: 10min 19s, total: 1d 2h 42min 40s Wall time: 59min 32s .. code:: ipython3 plot(embedding_aff500, y) .. image:: output_22_0.png … with higher exaggeration -------------------------- .. code:: ipython3 %%time embedding_aff500_exag4 = openTSNE.TSNE( exaggeration=4, n_jobs=32, verbose=True, ).fit(affinities=aff500, initialization=init) .. parsed-literal:: -------------------------------------------------------------------------------- TSNE(early_exaggeration=12, exaggeration=4, n_jobs=32, verbose=True) -------------------------------------------------------------------------------- ===> Running optimization with exaggeration=12.00, lr=108843.92 for 250 iterations... Iteration 50, KL divergence 5.9688, 50 iterations in 215.2427 sec Iteration 100, KL divergence 5.9218, 50 iterations in 221.4924 sec Iteration 150, KL divergence 5.8993, 50 iterations in 218.4626 sec Iteration 200, KL divergence 5.8901, 50 iterations in 216.9501 sec Iteration 250, KL divergence 5.8855, 50 iterations in 217.3419 sec --> Time elapsed: 1089.49 seconds ===> Running optimization with exaggeration=4.00, lr=326531.75 for 500 iterations... Iteration 50, KL divergence 5.0544, 50 iterations in 216.4466 sec Iteration 100, KL divergence 4.9857, 50 iterations in 214.6171 sec Iteration 150, KL divergence 4.9611, 50 iterations in 214.4493 sec Iteration 200, KL divergence 4.9495, 50 iterations in 213.1030 sec Iteration 250, KL divergence 4.9430, 50 iterations in 214.2775 sec Iteration 300, KL divergence 4.9384, 50 iterations in 214.3649 sec Iteration 350, KL divergence 4.9351, 50 iterations in 213.4982 sec Iteration 400, KL divergence 4.9327, 50 iterations in 213.9199 sec Iteration 450, KL divergence 4.9306, 50 iterations in 213.1906 sec Iteration 500, KL divergence 4.9288, 50 iterations in 212.4853 sec --> Time elapsed: 2140.36 seconds CPU times: user 1d 1h 28min 18s, sys: 9min 56s, total: 1d 1h 38min 15s Wall time: 54min 23s .. code:: ipython3 plot(embedding_aff500_exag4, y) .. image:: output_25_0.png Initialize via downsampling --------------------------- We now perform the sample-transform trick we described above. Create train/test split ~~~~~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 np.random.seed(0) .. code:: ipython3 indices = np.random.permutation(list(range(x.shape[0]))) reverse = np.argsort(indices) x_sample, x_rest = x[indices[:25000]], x[indices[25000:]] y_sample, y_rest = y[indices[:25000]], y[indices[25000:]] Create sample embedding ~~~~~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 %%time sample_affinities = openTSNE.affinity.PerplexityBasedNN( x_sample, perplexity=500, n_jobs=32, random_state=0, verbose=True, ) .. parsed-literal:: ===> Finding 1500 nearest neighbors using Annoy approximate search using euclidean distance... --> Time elapsed: 15.41 seconds ===> Calculating affinity matrix... --> Time elapsed: 6.54 seconds CPU times: user 3min 39s, sys: 4.97 s, total: 3min 44s Wall time: 21.9 s .. code:: ipython3 %time sample_init = openTSNE.initialization.pca(x_sample, random_state=42) .. parsed-literal:: CPU times: user 5.35 s, sys: 657 ms, total: 6.01 s Wall time: 171 ms .. code:: ipython3 %time sample_embedding = openTSNE.TSNE(n_jobs=32, verbose=True).fit(affinities=sample_affinities, initialization=sample_init) .. parsed-literal:: -------------------------------------------------------------------------------- TSNE(early_exaggeration=12, n_jobs=32, verbose=True) -------------------------------------------------------------------------------- ===> Running optimization with exaggeration=12.00, lr=2083.33 for 250 iterations... Iteration 50, KL divergence 3.0347, 50 iterations in 4.0133 sec Iteration 100, KL divergence 3.0733, 50 iterations in 3.9727 sec Iteration 150, KL divergence 3.0691, 50 iterations in 4.0981 sec Iteration 200, KL divergence 3.0690, 50 iterations in 3.9779 sec Iteration 250, KL divergence 3.0690, 50 iterations in 3.9779 sec --> Time elapsed: 20.04 seconds ===> Running optimization with exaggeration=1.00, lr=25000.00 for 500 iterations... Iteration 50, KL divergence 1.1925, 50 iterations in 4.2847 sec Iteration 100, KL divergence 1.1550, 50 iterations in 4.4657 sec Iteration 150, KL divergence 1.1452, 50 iterations in 4.8737 sec Iteration 200, KL divergence 1.1407, 50 iterations in 5.5286 sec Iteration 250, KL divergence 1.1382, 50 iterations in 5.5675 sec Iteration 300, KL divergence 1.1366, 50 iterations in 5.9546 sec Iteration 350, KL divergence 1.1356, 50 iterations in 5.3339 sec Iteration 400, KL divergence 1.1349, 50 iterations in 5.9090 sec Iteration 450, KL divergence 1.1342, 50 iterations in 6.5074 sec Iteration 500, KL divergence 1.1339, 50 iterations in 5.7942 sec --> Time elapsed: 54.23 seconds CPU times: user 34min 13s, sys: 2min 15s, total: 36min 28s Wall time: 1min 14s .. code:: ipython3 plot(sample_embedding, y[indices[:25000]], alpha=0.5) .. image:: output_34_0.png Learn the full embedding ~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 %time rest_init = sample_embedding.prepare_partial(x_rest, k=1, perplexity=1/3) .. parsed-literal:: ===> Finding 1 nearest neighbors in existing embedding using Annoy approximate search... --> Time elapsed: 174.82 seconds ===> Calculating affinity matrix... --> Time elapsed: 0.67 seconds CPU times: user 4min 18s, sys: 32.3 s, total: 4min 51s Wall time: 2min 55s .. code:: ipython3 init_full = np.vstack((sample_embedding, rest_init))[reverse] .. code:: ipython3 plot(init_full, y) .. image:: output_38_0.png .. code:: ipython3 init_full = init_full / (np.std(init_full[:, 0]) * 10000) np.std(init_full, axis=0) .. parsed-literal:: array([1.00000000e-04, 1.06471542e-04]) .. code:: ipython3 embedding = openTSNE.TSNEEmbedding( init_full, aff50, n_jobs=32, verbose=True, random_state=42, ) .. code:: ipython3 %time embedding1 = embedding.optimize(n_iter=500, exaggeration=12) .. parsed-literal:: ===> Running optimization with exaggeration=12.00, lr=108843.92 for 500 iterations... Iteration 50, KL divergence 7.9326, 50 iterations in 48.9843 sec Iteration 100, KL divergence 7.8311, 50 iterations in 51.0335 sec Iteration 150, KL divergence 7.7772, 50 iterations in 51.2970 sec Iteration 200, KL divergence 7.7543, 50 iterations in 51.3436 sec Iteration 250, KL divergence 7.7423, 50 iterations in 51.4951 sec Iteration 300, KL divergence 7.7350, 50 iterations in 51.2477 sec Iteration 350, KL divergence 7.7299, 50 iterations in 51.5019 sec Iteration 400, KL divergence 7.7263, 50 iterations in 51.7564 sec Iteration 450, KL divergence 7.7237, 50 iterations in 51.6552 sec Iteration 500, KL divergence 7.7216, 50 iterations in 51.2847 sec --> Time elapsed: 511.60 seconds CPU times: user 2h 22min 35s, sys: 6min 28s, total: 2h 29min 4s Wall time: 8min 34s .. code:: ipython3 plot(embedding1, y) .. image:: output_42_0.png .. code:: ipython3 %time embedding2 = embedding1.optimize(n_iter=250, exaggeration=4) .. parsed-literal:: ===> Running optimization with exaggeration=4.00, lr=326531.75 for 250 iterations... Iteration 50, KL divergence 6.9325, 50 iterations in 55.0011 sec Iteration 100, KL divergence 6.8182, 50 iterations in 54.9294 sec Iteration 150, KL divergence 6.7678, 50 iterations in 54.4471 sec Iteration 200, KL divergence 6.7398, 50 iterations in 54.4368 sec Iteration 250, KL divergence 6.7219, 50 iterations in 53.8917 sec --> Time elapsed: 272.71 seconds CPU times: user 1h 13min 35s, sys: 3min 14s, total: 1h 16min 49s Wall time: 4min 34s .. code:: ipython3 plot(embedding2, y) .. image:: output_44_0.png .. code:: ipython3 %time embedding3 = embedding2.optimize(n_iter=250, exaggeration=4) .. parsed-literal:: ===> Running optimization with exaggeration=4.00, lr=326531.75 for 250 iterations... Iteration 50, KL divergence 6.7076, 50 iterations in 51.7026 sec Iteration 100, KL divergence 6.6960, 50 iterations in 51.7740 sec Iteration 150, KL divergence 6.6850, 50 iterations in 51.9709 sec Iteration 200, KL divergence 6.6749, 50 iterations in 51.9325 sec Iteration 250, KL divergence 6.6667, 50 iterations in 51.3449 sec --> Time elapsed: 258.73 seconds CPU times: user 1h 9min 44s, sys: 3min 14s, total: 1h 12min 58s Wall time: 4min 20s .. code:: ipython3 plot(embedding3, y) .. image:: output_46_0.png .. code:: ipython3 %time embedding4 = embedding3.optimize(n_iter=250, exaggeration=4) .. parsed-literal:: ===> Running optimization with exaggeration=4.00, lr=326531.75 for 250 iterations... Iteration 50, KL divergence 6.6604, 50 iterations in 51.5464 sec Iteration 100, KL divergence 6.6554, 50 iterations in 51.8160 sec Iteration 150, KL divergence 6.6514, 50 iterations in 51.7537 sec Iteration 200, KL divergence 6.6484, 50 iterations in 51.8397 sec Iteration 250, KL divergence 6.6455, 50 iterations in 51.6008 sec --> Time elapsed: 258.56 seconds CPU times: user 1h 11min 20s, sys: 3min 14s, total: 1h 14min 35s Wall time: 4min 20s .. code:: ipython3 plot(embedding4, y) .. image:: output_48_0.png Comparison to UMAP ------------------ .. code:: ipython3 from umap import UMAP .. parsed-literal:: /home/ppolicar/miniconda3/envs/tsne/lib/python3.10/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. @numba.jit() /home/ppolicar/miniconda3/envs/tsne/lib/python3.10/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. @numba.jit() /home/ppolicar/miniconda3/envs/tsne/lib/python3.10/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. @numba.jit() /home/ppolicar/miniconda3/envs/tsne/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm /home/ppolicar/miniconda3/envs/tsne/lib/python3.10/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. @numba.jit() .. code:: ipython3 umap = UMAP(n_neighbors=15, min_dist=0.1, random_state=1) .. code:: ipython3 %time embedding_umap = umap.fit_transform(x) .. parsed-literal:: OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead. .. parsed-literal:: CPU times: user 9h 39min 56s, sys: 1h 17min 49s, total: 10h 57min 45s Wall time: 1h 21min 7s .. code:: ipython3 plot(embedding_umap, y) .. image:: output_53_0.png