This document is about dimensionality reduction, visualization and clustering of the Blood2K dataset using scAND. More documentation on imputation and batch effect correction can be found at http://www.zhanglab-amss.org/homepage/software.html

Load data

The input of scAND contains three parts:

  1. A pandas.Series() represents the barcode of cells
  2. A pandas.Series() represents the region of peaks
  3. A pandas.DataFrame() contains 3 columns: Peaks, Cells, Counts (The number of Peaks and Cells start as 1). Each row of the DataFrame represents an element in scATAC-seq data.

Estimate the value of beta

We introduced a leave-out imputation strategy for the selection of parameter β . Explicitly, we ranmdomly set 10% (by default) entries to 0 and calculated the L2 distance between the true data and the imputed one with scAND.

For large-scale dataset, we recommended using peaks from a single chromosome.

Run scAND

Run_scAND(Count_df, d, weights, cells, peaks, random_seed=2019, L2_norm=True, Binary=True, Graph_norm=True, return_peaks=False, verbose=True)

The parameters of Run_scAND() function:

  1. Count_df: A pandas.DataFrame() contains 3 columns: Peaks, Cells, Counts (The number of Peaks and Cells start as 1). Each row of the DataFrame represents an element in scATAC-seq data.
  2. d: The dimensions of the low-dimensional representation of scAND.
  3. weights: The parameter beta in scAND model. The input should be a list() or a np.array(). scAND can calculate results of different parameters simultaneously while only adding very little computational complexity.
  4. cells: A pandas.Series() represents the barcode of cells
  5. peaks: A pandas.Series() represents the region of peaks
  6. random_seed: The random seed.
  7. Binary: Logical, should binarization be applied to the scATAC-seq matrix?
  8. Graph_norm: Logical, should graph normalization be applied to the adjacency matrix of network?
  9. return_peaks: Logical, should the scAND representation of peaks be returned?
  10. return_Norm_factor: Logical, should the norm factor of graph normalization process be returned?
  11. verbose: Logical, should the function run silently?

Visualization and Clustering