Explore perturbation clusters

A common question we aim to ask is “What other perturbations look like mine?” The easiest way to get the answer for this question is using our browsable datasets. In this case we would like to use the ‘Matches’ databases, which provide cosines similarities between perturbations within a dataset, obtained from all-vs-all calculations.

The limitation of this approach is that the size of JUMP results in two challenges: - Calculating the distances across all pairs of perturbations is intractable for most computers without a GPU (Graphics Processing Unit). - The resultant similarity matrix is too big for web browser-based exploration, so we limit the browsable similarity dataset to tthe op 100 most correlated/anticorrelated pairs of perturbations.

Despite the aforementioned problems, we provide the full matrix of perturbation distances in case it is of use to data analysts. You can find this and other datasets on Zenodo. The data files of interest for this exercise are “{org,crispr}_cosinesim_full.parquet”.

The following analysis focuses on showcasing how to query one of these distance matrices to find all the distances between any given perturbation and all others. One use-case of this is testing how similar perturbation A and B are relative to perturbation C’s similarity to A.


from random import choices, seed

import matplotlib.pyplot as plt
import polars as pl
import seaborn as sns

We select the CRISPR dataset for this example. As with previous examples, this is a lazy-loaded data frame. This enables us to download very big datasets without worrying about whether or not they will fill into memory. In these datasets, the values range between 0 and 2, where 0 means that two profiles are the same, 1 means that they are orthogonal (completely uncorrelated) and 2 means that they are completely anticorrelated.

distances = pl.scan_parquet(
    "https://zenodo.org/api/records/13259495/files/crispr_cosinesim_full.parquet/content"
)
distances.head().collect()
shape: (5, 7_977)
JCP2022_805250 JCP2022_804898 JCP2022_805900 JCP2022_807210 JCP2022_803410 JCP2022_807794 JCP2022_800876 JCP2022_800830 JCP2022_805681 JCP2022_802504 JCP2022_800777 JCP2022_800497 JCP2022_804796 JCP2022_806564 JCP2022_801225 JCP2022_807865 JCP2022_803439 JCP2022_807178 JCP2022_802128 JCP2022_801787 JCP2022_807440 JCP2022_804038 JCP2022_801171 JCP2022_802228 JCP2022_801332 JCP2022_804693 JCP2022_800784 JCP2022_803011 JCP2022_801962 JCP2022_805986 JCP2022_807402 JCP2022_800462 JCP2022_801558 JCP2022_807179 JCP2022_806226 JCP2022_806465 JCP2022_801814 JCP2022_805180 JCP2022_807889 JCP2022_804285 JCP2022_805626 JCP2022_805864 JCP2022_803826 JCP2022_800780 JCP2022_804308 JCP2022_804821 JCP2022_803189 JCP2022_804291 JCP2022_805854 JCP2022_807379 JCP2022_805728 JCP2022_807071 JCP2022_800371 JCP2022_806829 JCP2022_806376 JCP2022_802049 JCP2022_805085 JCP2022_806543 JCP2022_806740 JCP2022_805128 JCP2022_800014 JCP2022_801188 JCP2022_800488 JCP2022_804935 JCP2022_802784 JCP2022_800258 JCP2022_803251 JCP2022_801353 JCP2022_807322 JCP2022_802236 JCP2022_806494 JCP2022_805767 JCP2022_805144 JCP2022_802646
f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32
5.9605e-7 1.247144 0.967903 1.132685 1.052072 1.073054 1.070843 1.051375 1.129689 0.772845 1.0463 1.092238 0.906448 0.927316 0.896206 1.036608 1.109385 0.904212 0.942388 1.075579 0.991834 0.915789 1.074332 1.086189 1.089866 1.011456 1.151555 1.030307 1.027324 0.933274 1.164992 1.059377 0.927121 0.859473 1.055 0.908565 0.91787 0.899526 1.067557 1.222073 1.13687 0.84244 1.301761 1.030211 0.8842 0.903793 0.941208 0.96786 0.92024 0.988243 1.038809 1.023377 0.866169 0.982345 0.862695 0.950073 1.013993 0.963446 0.964861 1.221614 1.012841 1.116919 1.044985 0.95041 0.873647 1.092933 0.9966 1.11305 0.92142 0.898604 1.068462 0.939565 0.981605 0.837548
1.247144 5.3644e-7 0.849671 0.742207 0.902 0.880709 1.000254 0.834797 0.89356 1.076251 1.022806 1.029713 1.053186 1.053384 1.129679 1.155893 0.817272 0.989782 1.048217 0.978221 1.135644 0.74395 0.988017 0.974628 0.926641 0.893171 0.802536 1.050146 1.08058 0.915889 0.746005 1.091021 1.142545 1.149375 0.891322 0.981572 0.981943 1.016775 0.947588 0.84507 0.9275 1.23627 0.876745 1.046394 1.011125 1.086978 1.042936 1.199081 1.228067 1.03397 0.855395 0.930174 1.223315 0.99053 1.105305 1.011096 1.020536 1.084031 1.091003 0.90998 0.842879 0.86072 0.904132 1.179529 1.222647 1.050518 1.048739 0.795272 1.157747 0.963267 0.858091 0.834493 1.040158 0.957812
0.967903 0.849671 4.1723e-7 0.907077 1.031568 0.982156 1.005138 1.092488 1.073875 0.870309 1.030525 1.066498 1.007132 0.851524 1.058993 1.102674 0.921328 0.976307 1.042816 1.11988 1.009265 0.828333 0.960826 0.980577 0.931674 0.935145 0.977908 0.846657 0.916294 0.986605 0.968405 0.920526 1.145505 0.947867 0.895159 1.022881 0.865911 0.914141 0.919133 1.013684 1.079146 0.834806 0.939826 1.008615 0.962056 0.976553 1.100624 0.997636 1.123526 1.093243 0.950548 1.028595 1.082327 0.841474 1.073792 1.06748 0.992501 1.084669 0.850902 0.964968 1.112577 1.068541 0.824711 1.10152 0.929986 1.058774 1.133116 1.010063 0.967616 0.895663 0.990082 1.114065 0.897693 1.017657
1.132685 0.742207 0.907077 6.5565e-7 1.205265 0.890984 1.142175 0.86796 1.042397 0.958762 0.944834 1.103978 0.884924 1.070043 0.973253 1.231588 0.567953 0.995585 1.046111 1.208642 0.821681 0.803467 0.899464 0.802612 1.083011 0.953896 0.888156 0.90227 0.966256 0.826272 0.830262 1.104464 1.044595 0.926648 0.956805 0.949712 1.097377 0.97032 0.843955 0.68654 1.000185 1.004951 0.868828 0.721056 1.027855 0.954742 0.914465 1.154364 1.339245 0.956219 0.800423 1.099996 1.189476 0.917771 1.076726 1.169464 1.080106 1.296887 0.859278 0.977511 0.758992 0.841586 0.943176 1.098098 1.207128 1.105987 1.14492 1.192619 1.358119 0.780779 0.934115 0.818462 0.685961 0.871601
1.052072 0.902 1.031568 1.205265 4.1723e-7 1.034174 0.850223 1.247003 0.878785 1.073189 1.005777 0.913054 1.013503 1.011858 1.135321 0.810449 1.250481 0.960766 0.864973 0.733279 1.248159 1.069302 0.992714 1.240854 0.922457 0.789956 1.011713 0.99239 1.180302 1.230094 0.914126 0.922827 0.941237 0.978482 0.957761 0.867414 1.03808 0.944449 1.159692 1.051106 1.082658 1.102017 1.116237 1.092143 1.060347 1.016548 1.207051 1.047528 0.725492 0.964185 1.007569 0.965586 1.00403 0.848321 1.084304 0.876304 0.958506 0.938069 1.066701 1.001688 1.072793 1.077091 0.904107 1.080146 0.93987 0.70647 1.067201 0.926861 0.839981 1.144658 1.186504 1.104135 1.42188 1.148127

Note that the only metadata information in this matrix are the column names as JUMP IDs (JCP2022_X), meaning that we will need to use a mapper from these JUMP ids to conventional names; feel free to look at the previous how-to that demonstrates that. We will now select three features at random and look at their correlation matrix

seed(42)
cols = distances.collect_schema().names()
ncols = len(cols)
sampled_col_idx = sorted(choices(range(ncols), k=3))
sampled_cols = [cols[ix] for ix in sampled_col_idx]

sampled_distances = (
    distances.with_row_index()
    .filter(pl.col("index").is_in(sampled_col_idx))
    .select(pl.col(sampled_cols))
    .collect()
)
sampled_distances
shape: (3, 3)
JCP2022_801809 JCP2022_805029 JCP2022_801476
f32 f32 f32
4.7684e-7 0.945577 1.091766
0.945577 5.9605e-7 0.961309
1.091766 0.961309 3.5763e-7

Finally, we plot them in a heatmap for easier visualisation

pandas_correlation = sampled_distances.to_pandas()
pandas_correlation.index = pandas_correlation.columns
sns.heatmap(
    pandas_correlation,
    annot=True,
    fmt=".3f",
    vmin=0,
    vmax=2,
    cmap=sns.color_palette("vlag", as_cmap=True),
)
plt.yticks(rotation=30)
plt.tight_layout()

Whilst in this case it is not a terribly interesting result, this shows that we see no correlation between three randomly selected perturbations.