This document is about dimensionality reduction, visualization and clustering of the Blood2K dataset using scAND. More documentation on imputation and batch effect correction can be found at http://www.zhanglab-amss.org/homepage/software.html
import pandas as pd
import numpy as np
import os
import sys
import time
import scanpy as sc
import matplotlib.pyplot as plt
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics.cluster import homogeneity_score
from sklearn.cluster import KMeans
sc.__version__
'1.7.1'
sys.path.append('../../scAND-code/')
import scAND
import scAND_utils
np.random.seed(2019)
The input of scAND contains three parts:
Count_df = pd.read_csv('Processed_Data/Filter_df.txt', sep='\t')
cells = pd.read_csv('Processed_Data/Filter_Cells.txt', sep='\t', header=None)
cells = cells[0]
peaks = pd.read_csv('Processed_Data/Filter_Peaks.txt', sep='\t', header=None)
peaks = peaks[0]
print('Number of peaks: %d' %peaks.shape[0])
print('Number of cells: %d' %cells.shape[0])
Number of peaks: 134962 Number of cells: 2034
We introduced a leave-out imputation strategy for the selection of parameter β . Explicitly, we ranmdomly set 10% (by default) entries to 0 and calculated the L2 distance between the true data and the imputed one with scAND.
For large-scale dataset, we recommended using peaks from a single chromosome.
used_peaks = None
used_beta, Impute_Distance = scAND_utils.Estimate_beta(Count_df, cells, peaks, used_peaks=used_peaks)
----------Graph Normalization... ----------Info of Imputation Leave Out Rate: 10 % Number of Used Fragments: 12209031 Number of Tested Fragments: 1356559 ----------Run scAND... ----------Info of Data: Number of Cells: 2034 Number of Peaks: 134962 ----------Constructing Graph without Binarization... ----------Calculating Eigendecomposition... ----------Eigen-decomposition reweighting...
0%| | 0/20 [00:00<?, ?it/s]
----------Calculating imputation distance
100%|██████████████████████████████████████████| 20/20 [01:16<00:00, 3.82s/it]
The estimated beta is 0.90.
Run_scAND(Count_df, d, weights, cells, peaks, random_seed=2019, L2_norm=True, Binary=True, Graph_norm=True, return_peaks=False, verbose=True)
The parameters of Run_scAND() function:
beta_list = np.arange(20)/20
Rep_cells = scAND.Run_scAND(Count_df, d=50, weights=beta_list, cells=cells, peaks=peaks, random_seed=2019)
----------Info of Data: Number of Cells: 2034 Number of Peaks: 134962 ----------Constructing Binary Graph... Dim of Graph: (136996, 136996) ----------Graph Normalization... ----------Calculating Eigendecomposition... ----------Eigen-decomposition reweighting...
scAND_utils.Elbow_plot(Rep_cells, used_beta, log=True)
used_dim = 20
print('The estimated beta is %.2f.' %used_beta)
print('The estimated dimension is %d.' %used_dim)
The estimated beta is 0.90. The estimated dimension is 20.
used_rep = scAND.Get_Result(Rep_cells, beta=used_beta, dim=used_dim, L2_norm=True)
metadata = pd.read_csv('Processed_Data/metadata.txt', sep='\t', header=None, index_col=0)
metadata.columns = ['label']
metadata.head()
label | |
---|---|
0 | |
BM1077-CLP-Frozen-160106-13 | CLP |
BM1077-CLP-Frozen-160106-14 | CLP |
BM1077-CLP-Frozen-160106-2 | CLP |
BM1077-CLP-Frozen-160106-21 | CLP |
BM1077-CLP-Frozen-160106-27 | CLP |
num_clusters = np.unique(metadata['label']).shape[0]
print('Number of labels: ', num_clusters)
Number of labels: 9
adata = sc.AnnData(used_rep)
adata.var_names_make_unique()
adata.obs['label'] = metadata['label']
sc.pp.neighbors(adata, n_neighbors=30, random_state=2019)
sc.tl.umap(adata)
D:\Users\kndong\Anaconda3\lib\site-packages\numba\np\ufunc\parallel.py:363: NumbaWarning: The TBB threading layer requires TBB version 2019.5 or later i.e., TBB_INTERFACE_VERSION >= 11005. Found TBB_INTERFACE_VERSION = 11004. The TBB threading layer is disabled.
warnings.warn(problem)
fig, ax = plt.subplots(figsize=(3, 3))
sc.pl.umap(adata, color='label', title='scAND', s=15, ax=ax)
... storing 'label' as categorical
sc.pp.neighbors(adata, random_state=2019)
sc.tl.tsne(adata, random_state=2019)
WARNING: Consider installing the package MulticoreTSNE (https://github.com/DmitryUlyanov/Multicore-TSNE). Even for n_jobs=1 this speeds up the computation considerably and might yield better converged results.
fig, ax = plt.subplots(figsize=(3, 3))
sc.pl.tsne(adata, color='label', title='scAND', s=15, ax=ax)
# Louvain Clustering
this_resolution, adata = scAND_utils.getNClusters_Louvain(adata, num_clusters, range_max=3)
fig, ax = plt.subplots(figsize=(3, 3))
sc.pl.tsne(adata, color='louvain', title='Louvain', s=15, ax=ax)
step 0 got 19 at resolution 1.5 step 1 got 15 at resolution 0.75 step 2 got 12 at resolution 0.375 step 3 got 9 at resolution 0.1875
# KMeans Clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=2019).fit(adata.X)
adata.obs['kmeans'] = pd.Series(kmeans.labels_, index=adata.obs.index).astype('category')
fig, ax = plt.subplots(figsize=(3, 3))
sc.pl.tsne(adata, color='kmeans', title='Kmeans', s=15, ax=ax)
ARI_df = pd.Series(index=['ARI of Louvain', 'ARI of Kmeans'])
ARI_df['ARI of Louvain'] = adjusted_rand_score(adata.obs['louvain'], adata.obs['label'])
ARI_df['ARI of Kmeans'] = adjusted_rand_score(adata.obs['kmeans'], adata.obs['label'])
ARI_df
<ipython-input-21-b4811c3dbbec>:1: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. ARI_df = pd.Series(index=['ARI of Louvain', 'ARI of Kmeans'])
ARI of Louvain 0.625283 ARI of Kmeans 0.558721 dtype: float64
NMI_df = pd.Series(index=['NMI of Louvain', 'NMI of Kmeans'])
NMI_df['NMI of Louvain'] = normalized_mutual_info_score(adata.obs['louvain'], adata.obs['label'])
NMI_df['NMI of Kmeans'] = normalized_mutual_info_score(adata.obs['kmeans'], adata.obs['label'])
NMI_df
<ipython-input-22-f1cd15ee5c08>:1: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. NMI_df = pd.Series(index=['NMI of Louvain', 'NMI of Kmeans'])
NMI of Louvain 0.696250 NMI of Kmeans 0.683736 dtype: float64
HS_df = pd.Series(index=['Homogeneity Score of Louvain', 'Homogeneity Score of Kmeans'])
HS_df['Homogeneity Score of Louvain'] = homogeneity_score(adata.obs['louvain'], adata.obs['label'])
HS_df['Homogeneity Score of Kmeans'] = homogeneity_score(adata.obs['kmeans'], adata.obs['label'])
HS_df
<ipython-input-23-8f3250938cc7>:1: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. HS_df = pd.Series(index=['Homogeneity Score of Louvain', 'Homogeneity Score of Kmeans'])
Homogeneity Score of Louvain 0.713858 Homogeneity Score of Kmeans 0.666992 dtype: float64