from random import choices, seed
import matplotlib.pyplot as plt
import polars as pl
import seaborn as sns
Explore perturbation clusters
A common question we aim to ask is “What other perturbations look like mine?” The easiest way to get the answer for this question is using our browsable datasets. In this case we would like to use the ‘Matches’ databases, which provide cosines similarities between perturbations within a dataset, obtained from all-vs-all calculations.
The limitation of this approach is that the size of JUMP results in two challenges: - Calculating the distances across all pairs of perturbations is intractable for most computers without a GPU (Graphics Processing Unit). - The resultant similarity matrix is too big for web browser-based exploration, so we limit the browsable similarity dataset to tthe op 100 most correlated/anticorrelated pairs of perturbations.
Despite the aforementioned problems, we provide the full matrix of perturbation distances in case it is of use to data analysts. You can find this and other datasets on Zenodo. The data files of interest for this exercise are “{org,crispr}_cosinesim_full.parquet”.
The following analysis focuses on showcasing how to query one of these distance matrices to find all the distances between any given perturbation and all others. One use-case of this is testing how similar perturbation A and B are relative to perturbation C’s similarity to A.
We select the CRISPR dataset for this example. As with previous examples, this is a lazy-loaded data frame. This enables us to download very big datasets without worrying about whether or not they will fill into memory. In these datasets, the values range between 0 and 2, where 0 means that two profiles are the same, 1 means that they are orthogonal (completely uncorrelated) and 2 means that they are completely anticorrelated.
= pl.scan_parquet(
distances "https://zenodo.org/api/records/13259495/files/crispr_cosinesim_full.parquet/content"
) distances.head().collect()
JCP2022_805250 | JCP2022_804898 | JCP2022_805900 | JCP2022_807210 | JCP2022_803410 | JCP2022_807794 | JCP2022_800876 | JCP2022_800830 | JCP2022_805681 | JCP2022_802504 | JCP2022_800777 | JCP2022_800497 | JCP2022_804796 | JCP2022_806564 | JCP2022_801225 | JCP2022_807865 | JCP2022_803439 | JCP2022_807178 | JCP2022_802128 | JCP2022_801787 | JCP2022_807440 | JCP2022_804038 | JCP2022_801171 | JCP2022_802228 | JCP2022_801332 | JCP2022_804693 | JCP2022_800784 | JCP2022_803011 | JCP2022_801962 | JCP2022_805986 | JCP2022_807402 | JCP2022_800462 | JCP2022_801558 | JCP2022_807179 | JCP2022_806226 | JCP2022_806465 | JCP2022_801814 | … | JCP2022_805180 | JCP2022_807889 | JCP2022_804285 | JCP2022_805626 | JCP2022_805864 | JCP2022_803826 | JCP2022_800780 | JCP2022_804308 | JCP2022_804821 | JCP2022_803189 | JCP2022_804291 | JCP2022_805854 | JCP2022_807379 | JCP2022_805728 | JCP2022_807071 | JCP2022_800371 | JCP2022_806829 | JCP2022_806376 | JCP2022_802049 | JCP2022_805085 | JCP2022_806543 | JCP2022_806740 | JCP2022_805128 | JCP2022_800014 | JCP2022_801188 | JCP2022_800488 | JCP2022_804935 | JCP2022_802784 | JCP2022_800258 | JCP2022_803251 | JCP2022_801353 | JCP2022_807322 | JCP2022_802236 | JCP2022_806494 | JCP2022_805767 | JCP2022_805144 | JCP2022_802646 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | … | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 | f32 |
5.9605e-7 | 1.247144 | 0.967903 | 1.132685 | 1.052072 | 1.073054 | 1.070843 | 1.051375 | 1.129689 | 0.772845 | 1.0463 | 1.092238 | 0.906448 | 0.927316 | 0.896206 | 1.036608 | 1.109385 | 0.904212 | 0.942388 | 1.075579 | 0.991834 | 0.915789 | 1.074332 | 1.086189 | 1.089866 | 1.011456 | 1.151555 | 1.030307 | 1.027324 | 0.933274 | 1.164992 | 1.059377 | 0.927121 | 0.859473 | 1.055 | 0.908565 | 0.91787 | … | 0.899526 | 1.067557 | 1.222073 | 1.13687 | 0.84244 | 1.301761 | 1.030211 | 0.8842 | 0.903793 | 0.941208 | 0.96786 | 0.92024 | 0.988243 | 1.038809 | 1.023377 | 0.866169 | 0.982345 | 0.862695 | 0.950073 | 1.013993 | 0.963446 | 0.964861 | 1.221614 | 1.012841 | 1.116919 | 1.044985 | 0.95041 | 0.873647 | 1.092933 | 0.9966 | 1.11305 | 0.92142 | 0.898604 | 1.068462 | 0.939565 | 0.981605 | 0.837548 |
1.247144 | 5.3644e-7 | 0.849671 | 0.742207 | 0.902 | 0.880709 | 1.000254 | 0.834797 | 0.89356 | 1.076251 | 1.022806 | 1.029713 | 1.053186 | 1.053384 | 1.129679 | 1.155893 | 0.817272 | 0.989782 | 1.048217 | 0.978221 | 1.135644 | 0.74395 | 0.988017 | 0.974628 | 0.926641 | 0.893171 | 0.802536 | 1.050146 | 1.08058 | 0.915889 | 0.746005 | 1.091021 | 1.142545 | 1.149375 | 0.891322 | 0.981572 | 0.981943 | … | 1.016775 | 0.947588 | 0.84507 | 0.9275 | 1.23627 | 0.876745 | 1.046394 | 1.011125 | 1.086978 | 1.042936 | 1.199081 | 1.228067 | 1.03397 | 0.855395 | 0.930174 | 1.223315 | 0.99053 | 1.105305 | 1.011096 | 1.020536 | 1.084031 | 1.091003 | 0.90998 | 0.842879 | 0.86072 | 0.904132 | 1.179529 | 1.222647 | 1.050518 | 1.048739 | 0.795272 | 1.157747 | 0.963267 | 0.858091 | 0.834493 | 1.040158 | 0.957812 |
0.967903 | 0.849671 | 4.1723e-7 | 0.907077 | 1.031568 | 0.982156 | 1.005138 | 1.092488 | 1.073875 | 0.870309 | 1.030525 | 1.066498 | 1.007132 | 0.851524 | 1.058993 | 1.102674 | 0.921328 | 0.976307 | 1.042816 | 1.11988 | 1.009265 | 0.828333 | 0.960826 | 0.980577 | 0.931674 | 0.935145 | 0.977908 | 0.846657 | 0.916294 | 0.986605 | 0.968405 | 0.920526 | 1.145505 | 0.947867 | 0.895159 | 1.022881 | 0.865911 | … | 0.914141 | 0.919133 | 1.013684 | 1.079146 | 0.834806 | 0.939826 | 1.008615 | 0.962056 | 0.976553 | 1.100624 | 0.997636 | 1.123526 | 1.093243 | 0.950548 | 1.028595 | 1.082327 | 0.841474 | 1.073792 | 1.06748 | 0.992501 | 1.084669 | 0.850902 | 0.964968 | 1.112577 | 1.068541 | 0.824711 | 1.10152 | 0.929986 | 1.058774 | 1.133116 | 1.010063 | 0.967616 | 0.895663 | 0.990082 | 1.114065 | 0.897693 | 1.017657 |
1.132685 | 0.742207 | 0.907077 | 6.5565e-7 | 1.205265 | 0.890984 | 1.142175 | 0.86796 | 1.042397 | 0.958762 | 0.944834 | 1.103978 | 0.884924 | 1.070043 | 0.973253 | 1.231588 | 0.567953 | 0.995585 | 1.046111 | 1.208642 | 0.821681 | 0.803467 | 0.899464 | 0.802612 | 1.083011 | 0.953896 | 0.888156 | 0.90227 | 0.966256 | 0.826272 | 0.830262 | 1.104464 | 1.044595 | 0.926648 | 0.956805 | 0.949712 | 1.097377 | … | 0.97032 | 0.843955 | 0.68654 | 1.000185 | 1.004951 | 0.868828 | 0.721056 | 1.027855 | 0.954742 | 0.914465 | 1.154364 | 1.339245 | 0.956219 | 0.800423 | 1.099996 | 1.189476 | 0.917771 | 1.076726 | 1.169464 | 1.080106 | 1.296887 | 0.859278 | 0.977511 | 0.758992 | 0.841586 | 0.943176 | 1.098098 | 1.207128 | 1.105987 | 1.14492 | 1.192619 | 1.358119 | 0.780779 | 0.934115 | 0.818462 | 0.685961 | 0.871601 |
1.052072 | 0.902 | 1.031568 | 1.205265 | 4.1723e-7 | 1.034174 | 0.850223 | 1.247003 | 0.878785 | 1.073189 | 1.005777 | 0.913054 | 1.013503 | 1.011858 | 1.135321 | 0.810449 | 1.250481 | 0.960766 | 0.864973 | 0.733279 | 1.248159 | 1.069302 | 0.992714 | 1.240854 | 0.922457 | 0.789956 | 1.011713 | 0.99239 | 1.180302 | 1.230094 | 0.914126 | 0.922827 | 0.941237 | 0.978482 | 0.957761 | 0.867414 | 1.03808 | … | 0.944449 | 1.159692 | 1.051106 | 1.082658 | 1.102017 | 1.116237 | 1.092143 | 1.060347 | 1.016548 | 1.207051 | 1.047528 | 0.725492 | 0.964185 | 1.007569 | 0.965586 | 1.00403 | 0.848321 | 1.084304 | 0.876304 | 0.958506 | 0.938069 | 1.066701 | 1.001688 | 1.072793 | 1.077091 | 0.904107 | 1.080146 | 0.93987 | 0.70647 | 1.067201 | 0.926861 | 0.839981 | 1.144658 | 1.186504 | 1.104135 | 1.42188 | 1.148127 |
Note that the only metadata information in this matrix are the column names as JUMP IDs (JCP2022_X), meaning that we will need to use a mapper from these JUMP ids to conventional names; feel free to look at the previous how-to that demonstrates that. We will now select three features at random and look at their correlation matrix
42)
seed(= distances.collect_schema().names()
cols = len(cols)
ncols = sorted(choices(range(ncols), k=3))
sampled_col_idx = [cols[ix] for ix in sampled_col_idx]
sampled_cols
= (
sampled_distances
distances.with_row_index()filter(pl.col("index").is_in(sampled_col_idx))
.
.select(pl.col(sampled_cols))
.collect()
) sampled_distances
JCP2022_801809 | JCP2022_805029 | JCP2022_801476 |
---|---|---|
f32 | f32 | f32 |
4.7684e-7 | 0.945577 | 1.091766 |
0.945577 | 5.9605e-7 | 0.961309 |
1.091766 | 0.961309 | 3.5763e-7 |
Finally, we plot them in a heatmap for easier visualisation
= sampled_distances.to_pandas()
pandas_correlation = pandas_correlation.columns
pandas_correlation.index
sns.heatmap(
pandas_correlation,=True,
annot=".3f",
fmt=0,
vmin=2,
vmax=sns.color_palette("vlag", as_cmap=True),
cmap
)=30)
plt.yticks(rotation plt.tight_layout()
Whilst in this case it is not a terribly interesting result, this shows that we see no correlation between three randomly selected perturbations.