Quality Control for Functional Genomics

This tutorial is a continuation of Data preprocessing.

Learning outcomes

  • apply quality control methods agnostic to signal structure, which are used in functional genomics, on an example of ATAC-seq data

  • become accustomed to work on Rackham cluster



../../../_images/workflow-proc.png

The aim of this part of the data analysis workflow is to perform general signal structure agnostic (i.e. peak - independent) quality control (lower-right part of the concept map). These include:

  • assessment of read coverage along the genome;

  • replicate congruency.

Basic read count statistics were already collected in Data preprocessing.


Important

We assume that the environment and directory structure has been already set in Data preprocessing.

Cumulative Enrichment

Cumulative enrichment, aka BAM fingerprint, is a way of assesing the quality of signal concentrated predeminantly in a small fraction of a genome (such as peaks detected in ATAC-seq and ChIP-seq). It determines how well the signal in the sample can be differentiated from the background.

Cumulative enrichment is obtained by sampling indexed BAM files and plotting a profile of cumulative read coverages for each sample. All reads overlapping a window (bin) of the specified length are counted; these counts are sorted and the cumulative sum is plotted.

To compute cumulative enrichment for processed bam files in our ATAC-seq data set (assuming we are in drectory analysis, so if you have followed the previous tutorial, you should move one directory level up cd ..). Here we use files preprocessed earlier:

mkdir deepTools
cd deepTools

#link necessary files to avoid long paths in commands
ln -s ../../data_proc/* .

module load deepTools/3.3.2

plotFingerprint --bamfiles ENCFF363HBZ.chr14.proc.bam ENCFF398QLV.chr14.proc.bam ENCFF045OAB.chr14.proc.bam ENCFF828ZPN.chr14.proc.bam \
 --binSize=1000 --plotFile NKcellsATAC_chr14.fingerprint.pdf \
 --labels ENCFF363HBZ ENCFF398QLV ENCFF045OAB ENCFF828ZPN -p 5 &> fingerprint.log

You can copy the resulting file to your local system to view it.

Have a look at NKcellsATAC_chr14.fingerprint.pdf, read deepTools What the plots tell you and answer

  • does it indicate a good sample quality, i.e. signal present in narrow regions?

Replicate Clustering

To assess overall similarity between libraries from different samples one can compute sample clustering heatmaps using multiBamSummary and plotCorrelation in bins mode from deepTools.

In this method the genome is divided into bins of specified size (--binSize parameter) and reads mapped to each bin are counted. The resulting signal profiles are used to cluster libraries to identify groups of similar signal profile.

We chose to compute pairwise Spearman correlation coefficients for this step, as they are based on ranks of each bin rather than signal values.

In this part we use bam files filtered previously, to save time.

multiBamSummary bins --bamfiles ENCFF363HBZ.chr14.proc.bam ENCFF398QLV.chr14.proc.bam ENCFF045OAB.chr14.proc.bam ENCFF828ZPN.chr14.proc.bam \
 --labels ENCFF363HBZ ENCFF398QLV ENCFF045OAB ENCFF828ZPN \
 --outFileName multiBamArray_NKcellsATAC_chr14.npz --binSize 5000 -p 5 &> multiBamSummary.log


plotCorrelation --corData multiBamArray_NKcellsATAC_chr14.npz \
 --plotFile NKcellsATAC_chr14_correlation_bin.pdf --outFileCorMatrix NKcellsATAC_chr14_correlation_bin.txt \
 --whatToPlot heatmap --corMethod spearman

You can copy the resulting file to your local system to view it.

What do you think?

  • which samples are similar?

  • are the clustering results as you would have expected them to be?

In addition to these general procedures, several specialised assay - specific quality metrics exist, which probe signal characteristics related to each method. These are key QC metrics to evaluate the experiment and should always be colleced during the QC step. The method specific tutorials are: ATACseq and ChIPseq.

We can now follow with ATACseq specifc QC methods.