EpiScanpy – Epigenomics single cell analysis in python¶
EpiScanpy is a toolkit to analyse single-cell open chromatin (scATAC-seq) and single-cell DNA methylation (for example scBS-seq) data. EpiScanpy is the epigenomic extension of the very popular scRNA-seq analysis tool Scanpy (Genome Biology, 2018) [Wolf18]. For more information, read scanpy documentation.
The documentation for epiScanpy is available here. If epiScanpy is useful to your research, consider citing epiScanpy.
Report issues and access the code on GitHub.
Version 0.2.0 August 7, 2020¶
This release deal with the compatibility problems with the latest version of scanpy. Additionally, it contains new features to build quick custom count matrices (bld_mtx_fly), to convert snap into h5ad files (snap2anndata) or build gene activity matrices (geneactivity).
Version 0.1.8 November 5, 2019¶
Release new processing function & quality controls.
Version 0.1.7 November 5, 2019¶
Release for SCOG epiScanpy Hackathon in Saarbrucken.
This version is not fully compatible with previous version.
Version 0.1.0 May 10, 2019¶
Initial release.
Tutorials¶
Single cell ATAC-seq¶
To get started, we recommend epiScanpy’s analysis pipeline for scATAC-seq data from Buenrostro et al. [Buenrostro18]. , the dataset consist of ~3000cells of human PBMCs. This tutorial focuses on preprocessing, clustering, identification of cell types via known marker genes and trajectory inference. The tutorial can be found here.

If you want to see how to build count matrices from ATAC-seq bam files for different set of annotations (like enhancers). The tutorial can be found here.
Soon available, there will be a tutorial providing a function to very quickly build custom count matrices using standard 10x single cell ATAC output.



An additional tutorial on processing and clustering count matrices from the Cusanovich mouse scATAC-seq atlas [Cusanovich18].. Here.
Single cell DNA methylation¶
Here you can find a tutorial for the preprocessing, clustering and identification of cell types for single-cell DNA methylation data using the publicly available data from Luo et al. [Luo17].
The first tutorial shows how to build the count matrices for the different feature spaces (windows, promoters) in different cytosine contexts. Here is the tutorial.
Then, there is a second tutorial on how to use them and compare the results. The data used comes from mouse brain (frontal cortex). It will be available very soon.




Usage Principles¶
Import the epiScanpy API as:
import episcanpy.api as epi
import anndata as ad
Workflow¶
The first step is to build the count matrix. Because single-cell epigenomic data types have different characteristics (count data in ATAC-seq versus methylation level in DNA methylation, for example), epiScanpy implements -omic specific approaches to build the count matrix.
All the functions to build the count matrices (for ATAC, methylation or other) will use epi.ct
(ct = count).
The first step is to load an annotation and then build the count matrix that will be either methylation or ATAC-seq specific. For example using epi.ct
, e.g.:
epi.ct.load_features(file_features, **tool_params) # to load annotation files
epi.ct.build_count_mtx(cell_file_names, omic="ATAC") # to build the ATAC-seq count matrix
If you have an already build matrix, you can load it with any additional metadata (such as cell annotations or batches).
The count matrix, either the one that has been constructed or uploaded, with any additional informations (such as cell annotations or batches) are stored as an AnnData
object. All functions for quality control and preprocessing are called using epi.pp
(pp = preprocessing).
To visualise how common features are and what is the coverage distribution of the count matrix features, use:
epi.pp.commoness_features(adata, **processing_params)
epi.pp.coverage_cells(adata, **processing_params)
To remove low quality cells you can use the following functions:
epi.pp.filter_cells(adata, min_features=10)
epi.pp.filter_features(adata, min_cells=10)
- To reduce the feature space to the most variable features: ::
epi.pl.cal_var(adata) epi.pp.select_var_feature(adata, max_score=0.2, nb_features=50000)
The next step, is the calculation of tSNE, UMAP, PCA etc. For that, we take advantage of the embedding into Scanpy and we use mostly Scanpy functions, which are called using sc.tl
(tl = tool) [Wolf18]. For that, see Scanpy usage principles: <https://scanpy.readthedocs.io/en/latest/basic_usage.html>`__. For example, to obtain cell-cell distance calculations or low dimensional representation we make use of the adata
object, and store n_obs observations (cells) of n_vars variables (expression, methylation, chromatin features). For each tool, there typically is an associated plotting function in sc.tl
and sc.pl
(pl = plot)
epi.pp.pca(adata, n_comps=100, svd_solver='arpack')
epi.pp.neighbors(adata, n_neighbors=15)
epi.tl.tsne(adata, **tool_params)
epi.pl.tsne(adata, **plotting_params)
There are also epiScanpy specific tools and plotting functions that can be accessed using epi.tl
and epi.pl
epi.tl.silhouette(adata, **tool_params)
epi.pl.silhouette(adata, **plotting_params)
epi.pl.prct_overlap(adata, **plotting_params)
Data structure¶
Similarly to Scanpy, the methylation and ATAC-seq matrices are stored as Anndata objects. For more information on the datastructure see here`here <https://anndata.readthedocs.io/en/latest/>`__
System Requirements¶
Hardware requirements¶
epiScanpy
package requires only a standard computer with enough RAM to support the in-memory operations.
Software requirements¶
### OS Requirements This package is supported for macOS and Linux. The package has been tested on the following systems: + macOS: Mojave & Catalina (10.14 to 10.15.4)
Python Dependencies¶
EpiScanpy require a working version of Python
(>= 3.6)
Additionally, this package epiScanpy
depends on other Python dependencies and packages.:
anndata
matplotlib
numpy
pandas
pyliftOver
pysam
scanpy
scipy
scikit-learn
seaborn
bamnostic
Installation¶
Anaconda¶
If you do not have a working Python 3.5 or 3.6 installation, consider installing Miniconda (see Installing Miniconda). Then run:
conda install seaborn scikit-learn statsmodels numba
conda install -c conda-forge python-igraph louvain
conda create -n scanpy python=3.6 scanpy
Finally, run:
conda install -c annadanese episcanpy
Pull epiScanpy from PyPI (consider
using pip3
to access Python 3):
pip install episcanpy
Github¶
you can also install epiScanpy directly from Github:
pip install git+https://github.com/colomemaria/epiScanpy
API¶
Import epiScanpy’s high-level API as:
import episcanpy.api as epi
Count Matrices: CT¶
Loading data, loading annotations, building count matrices, filtering of lowly covered methylation variables. Filtering of lowly covered cells.
Building count matrices¶
Quickly build a count matrix from tsv/tbi file.
|
Building count matrix on the fly. |
Load features¶
In order to build a count matrix for either methylation or open chromatin data, loading the segmentation of the genome of interest or the set of features of interest is a prerequirement.
|
The function load features is here to transform a bed file into a usable set of units to measure methylation levels. |
|
Generate windows/bins of the given size for the appropriate genome (default choice is human). |
|
If the features loaded are too smalls or of different sizes, it is possible to normalise them to a unique given size by extending the feature coordinate in both directions. |
|
Plot the different feature sizes in an histogram. |
|
Extract the names of the loaded features, specifying the chromosome they originated from. |
Reading methylation file¶
Functions to read methylation files, extract methylation and buildthe count matrices:
|
Build methylation count matrix for a given annotation. |
|
Read file from which you want to extract the methylation level and (assuming it is like the Ecker/Methylpy format) extract the number of methylated read and the total number of read for the cytosines covered and in the right genomic context (CG or CH) :param sample_name: name of the file to read to extract key information. |
|
read the raw count matrix and convert it into an AnnData object. |
Reading open chromatin(ATAC) file¶
ATAC-seq specific functions to build count matrices and load data:
|
Build a count matrix one set of features at a time. |
|
Convert regular atac matrix into a sparse Anndata: |
General functions¶
Functions non -omic specific:
|
Convert regular atac matrix into a sparse Anndata: |
Preprocessing: PP¶
Imputing missing data (methylation), filtering lowly covered cells or variables, correction for batch effect.
|
Histogram of the number of open features (in the case of ATAC-seq data) per cell. |
|
Correlation between a given PC and a covariate. |
|
Display how often a feature is measured as open (for ATAC-seq). |
|
Display how often a feature is measured as open (for ATAC-seq). |
|
This function computes a variability score to rank the most variable features across all cells. |
|
Show distribution plots of cells sharing features and variability score. |
|
This function computes a variability score to rank the most variable features across all cells. |
|
convert the count matrix into a binary matrix. |
|
Automatically computes PCA coordinates, loadings and variance decomposition, a neighborhood graph of observations, t-distributed stochastic neighborhood embedding (tSNE) Uniform Manifold Approximation and Projection (UMAP) |
|
Load observational metadata in adata.obs. |
|
Load sparse matrix (including matrices corresponding to 10x data) as AnnData objects. |
|
Filter cell outliers based on counts and numbers of genes expressed. |
|
Filter features based on number of cells or counts. |
|
Normalize counts per cell. |
|
Principal component analysis [Pedregosa11]. |
|
Normalize total counts per cell. |
|
Regress out unwanted sources of variation. |
|
Subsample to a fraction of the number of observations. |
|
Downsample counts from count matrix. |
|
Compute a neighborhood graph of observations [McInnes18]. |
|
Transform adata.X from a matrix or array to a csc sparse matrix. |
|
Transform adata.X from a matrix or array to a csc sparse matrix. |
Methylation matrices¶
Methylation specific count matrices.
|
Impute missing values in methyaltion level matrices. |
|
read the raw count matrix and convert it into an AnnData object. |
|
Temporary function to load and impute methyaltion count matrix into an AnnData object |
Tools: TL¶
|
It is a wrap-up function of scanpy sc.tl.rank_genes_groups function. |
|
Automatically computes PCA coordinates, loadings and variance decomposition, a neighborhood graph of observations, t-distributed stochastic neighborhood embedding (tSNE) Uniform Manifold Approximation and Projection (UMAP) |
|
Convert list of known cell type markers from literature to a dictionary Input list of known marker genes First row is considered the header |
|
Use markers of a given cell type to plot peak openness for peaks in promoters of the given markers Input cell type, cell type markers, peak promoter intersections |
|
Deprecated - Please use epi.tl.var_features_to_genes instead. |
|
Once you called the most variable features. |
|
merge values of peaks/windows/features overlapping genebodies + 2kb upstream. |
|
Diffusion Maps [Coifman05] [Haghverdi15] [Wolf18]. |
|
Force-directed graph drawing [Islam11] [Jacomy14] [Chippada18]. |
|
t-SNE [Maaten08] [Amir13] [Pedregosa11]. |
|
Embed the neighborhood graph using UMAP [McInnes18]. |
|
Infer progression of cells through geodesic distance along the graph [Haghverdi16] [Wolf19]. |
|
Cluster cells into subgroups [Blondel08] [Levine15] [Traag17]. |
|
Cluster cells into subgroups [Traag18]. |
|
Compute kmeans clustering using X_pca fits. |
|
Compute hierarchical clustering using X_pca fits. |
|
Function will test different settings of louvain to obtain the target number of clusters. |
|
Computes a hierarchical clustering for the given groupby categories. |
|
Compute Adjusted Rand Index. |
|
Compute adjusted Mutual Info. |
|
Compute homogeneity score. |
|
Compute silhouette scores. |
Plotting: PL¶
The plotting module episcanpy.plotting
largely parallels the tl.*
and a few of the pp.*
functions.
For most tools and for some preprocessing functions, you’ll find a plotting function with the same name.
|
Scatter plot in PCA coordinates. |
|
Plot PCA results. |
|
Plot the variance ratio. |
|
Scatter plot in tSNE basis. |
|
Scatter plot in UMAP basis. |
|
Plot ranking of features. |
|
Plot ranking of features for all tested comparisons. |
|
Plot ranking of features using dotplot plot (see |
|
Plot ranking of features using stacked_violin plot (see |
|
Plot ranking of features using matrixplot plot (see |
|
Plot ranking of features using heatmap plot (see |
|
Plot ranking of features using heatmap plot (see |
|
Show distribution plots of cells sharing features and variability score. |
|
Violin plot. |
|
Scatter plot along observations or variables axes. |
|
Plot rankings. |
|
Hierarchically-clustered heatmap. |
|
Heatmap of the expression values of genes. |
|
Makes a dot plot of the expression values of var_names. |
|
Creates a heatmap of the mean expression values per cluster of each var_names If groupby is not given, the matrixplot assumes that all data belongs to a single category. |
|
In this type of plot each var_name is plotted as a filled line plot where the y values correspond to the var_name values and x is each of the cells. |
|
Plots a dendrogram of the categories defined in groupby. |
|
Plots the correlation matrix computed as part of sc.tl.dendrogram. |
|
% or cell count corresponding to the overlap of different cell types between 2 set of annotations/clusters. |
|
Heatmap of the cluster correspondance between 2 set of annaotations. |
|
|
|
Plot the product of tl.silhouette as a silhouette plot |
|
Both compute silhouette scores and plot it. |
|
Show distribution plots of cells sharing features and variability score. |
|
This function computes a variability score to rank the most variable features across all cells. |
References¶
- Angerer16
Angerer et al. (2016), destiny – diffusion maps for large-scale single-cell data in R, Bioinformatics.
- Cusanovich18
Cusanovich, D. A. et al. A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility. Cell 174, 1309–1324.e18 (2018).
- Luo17
Luo, C. et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science 357, 600–604 (2017).
- Wolf18
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
- Buenrostro18
Buenrostro J. D. et al. Integrated Single-Cell Analysis Maps the Continuous Regulatory Landscape of Human Hematopoietic Differentiation. Cell 173 (2018)