Quality Control for Functional Genomics

This tutorial is a continuation of Data preprocessing.

Learning outcomes

  • apply quality control methods agnostic to signal structure, which are used in functional genomics, on an example of ATAC-seq data

  • become accustomed to work on Rackham cluster



../../../_images/workflow-proc.png

The aim of this part of the data analysis workflow is to perform general signal structure agnostic (i.e. peak - independent) quality control (lower-right part of the concept map). These include:

  • assessment of read coverage along the genome;

  • replicate congruency.

Basic read count statistics were already collected in Data preprocessing.


Important

We assume that the environment and directory structure has been already set in Data preprocessing.

Cumulative Enrichment

Cumulative enrichment, aka BAM fingerprint, is a way of assesing the quality of signal concentrated predeminantly in a small fraction of a genome (such as peaks detected in ATAC-seq and ChIP-seq). It determines how well the signal in the sample can be differentiated from the background.

Cumulative enrichment is obtained by sampling indexed BAM files and plotting a profile of cumulative read coverages for each sample. All reads overlapping a window (bin) of the specified length are counted; these counts are sorted and the cumulative sum is plotted.

To compute cumulative enrichment for processed bam files in our ATAC-seq data set (assuming we are in drectory analysis, so if you have followed the previous tutorial, you should move one directory level up cd ..). Here we use files preprocessed earlier:

mkdir deepTools
cd deepTools

#link necessary files to avoid long paths in commands
ln -s ../../data_proc/* .

module load deepTools/3.5.6

plotFingerprint --bamfiles SRR17296554.filt.chr1.bam SRR17296555.filt.chr1.bam SRR17296556.filt.chr1.bam SRR17296557.filt.chr1.bam \
 --binSize=1000 --plotFile Invivo_proc.fingerprint.pdf \
 --smartLabels -p 5 &> fingerprint.log

You can copy the resulting file to your local system to view it.

Have a look at Invivo_proc.fingerprint.pdf, read deepTools What the plots tell you and answer

  • does it indicate a good sample quality, i.e. signal present in narrow regions?

The plot below contains non subset data and more samples presented in the article.

Replicate Clustering

To assess overall similarity between libraries from different samples one can compute sample clustering heatmaps using multiBamSummary and plotCorrelation in bins mode from deepTools.

In this method the genome is divided into bins of specified size (--binSize parameter) and reads mapped to each bin are counted. The resulting signal profiles are used to cluster libraries to identify groups of similar signal profile.

..We chose to compute pairwise Spearman correlation coefficients for this step, as they are based on ranks of each bin rather than signal values.

In this part we use bam files prepared before the workshop, to save time.

multiBamSummary bins --bamfiles SRR17296554.filt.chr1.bam SRR17296555.filt.chr1.bam SRR17296556.filt.chr1.bam SRR17296557.filt.chr1.bam \
 --smartLabels \
 --outFileName Invivo_proc.npz --binSize 5000 -p 5 &> multiBamSummary.log


plotCorrelation --corData Invivo_proc.npz \
 --plotFile Invivo_proc_correlation_bin.pdf --outFileCorMatrix Invivo_proc_correlation_bin.txt \
 --whatToPlot heatmap --corMethod pearson --plotNumbers

# to change the min number plotted
plotCorrelation --corData Invivo_proc.npz \
 --plotFile Invivo_proc_correlation_bin.pdf --outFileCorMatrix Invivo_proc_correlation_bin.txt \
 --whatToPlot heatmap --corMethod pearson --plotNumbers -min 0.95

You can copy the resulting file to your local system to view it.

What do you think?

  • which samples are similar?

  • are the clustering results as you would have expected them to be?

Figure 2. Correlation of binned ATAC-seq signal in Batf KO in vivo data set. Data subset to chr 1.

Pearson correlation, data subset to chr1

Pearson correlation, data subset to chr1, min 0.95

../../../_images/Invivo_proc_correlation_bin.png ../../../_images/Invivo_proc_correlation_bin_2.png
Figure 3. Correlation of binned ATAC-seq signal in Batf KO in vivo and in vitro data sets. Non subset data.

Pearson correlation, non-subset data

Spearman correlation, non-subset data

../../../_images/tsao2022_proc.plotCorrelation.pearson.png ../../../_images/tsao2022_proc.plotCorrelation.spearman.png

In addition to these general procedures, several specialised assay - specific quality metrics exist, which probe signal characteristics related to each method. These are key QC metrics to evaluate the experiment and should always be colleced during the QC step. The method specific tutorials are: ATACseq and ChIPseq.

We can now follow with ATACseq specifc QC methods.