Quality Control for ATAC-seq

This tutorial is a continuation of General QC.

Learning outcomes

  • assess quality of the ATAC-seq libraries with a range of quality metrics

  • work interactively with ATAC-seq signal using Integrative Genome Viewer (IGV)



The aim of this part of the data analysis workflow is to collect ATAC-seq specific quality metrics:

  • fragment length distribution;

  • presence of signal in nuclesome-free regions (NFR) and mononucleosome fractions;

  • enrichment of signal in transcription start site (TSS) regions.


Important

We assume that the environment and directory structure has been already set in Data preprocessing.

Fragment Length Distribution

In ATAC-seq experiments, tagmentation of Tn5 transposases produces signature size pattern of fragments derived from nucleosome-free regions (NFR), mononucleosome, dinucleosome, trinucleosome and longer oligonucleosome from open chromatin regions (Figure below, adapted from Li et al ).

Please note the pre-processed BAM files need to be used to get an unbiased distribution of insert fragment size in the ATAC-seq library.

../../../_images/Lietal_atacTn5.png

To compute fragment length distribution for processed bam file in our ATAC-seq data set (assuming we are in drectory analysis):

mkdir QC
cd QC

ln -s ../processedData/ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam  .
ln -s ../processedData/ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam.bai  .

module load picard/2.23.4

java -Xmx32G -jar $PICARD_HOME/picard.jar CollectInsertSizeMetrics \
 -I ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam \
 -O ENCFF045OAB.chr14.proc.fraglen.stats \
 -H ENCFF045OAB.chr14.proc.fraglen.pdf -M 0.5

You can copy the resulting file to your local system to view it.

Have a look at ENCFF045OAB.chr14.proc.fraglen.pdf, and answer

  • does it indicate a good sample quality? is the chromatin structure preserved?

  • what do the periodic peaks correspond to?

Generating this key QC plot is only possible for PE libraries. Can you tell what the peaks at approximately 50bp, 200bp, 400bp and 600bp correspond to?

To give some context compare to plots on Figure 2.

Figure 2. Examples of insert size distribution for ATAC-seq experiments.

Naked DNA

Failed ATAC-seq

Noisy ATAC-seq

Successful ATAC-seq

../../../_images/Screenshot_sizeDistribution_Naked.png ../../../_images/Screenshot_sizeDistribution_Failed.png ../../../_images/Screenshot_sizeDistribution_Failed2.png ../../../_images/Screenshot_sizeDistribution_Good.png


Data Preparation for Probing Signal at TSS

We will be working in R in this section. First, we load the required version together with libraries:

module load R_packages/4.1.1

We activate R console upon typing R in the terminal.

It is possible to use a more sophisticated graphical interface to R, however, some steps in this tutorial can be time consuming, therefore using a simple terminal interface, while not very conveninet, may be more efficient in this case.

We begin by loading necessary libraries:

library(ATACseqQC)
library(BSgenome.Hsapiens.UCSC.hg38)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(ChIPpeakAnno)
library(Rsamtools)

We can now give the path to the processed bam file:

bamFile="ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam"
bamFileLabels <- "ENCFF045OAB"

We collect library statistics:

bam_qc=bamQC(bamFile, outPath = NULL)

We can now inspect the statistics:

bam_qc[1:10]

The output:

bam_qc[1:10]

$totalQNAMEs
[1] 1439244

$duplicateRate
[1] 0

$mitochondriaRate
[1] 0

$properPairRate
[1] 1

$unmappedRate
[1] 0

$hasUnmappedMateRate
[1] 0

$notPassingQualityControlsRate
[1] 0

$nonRedundantFraction
[1] 0.9999958

$PCRbottleneckCoefficient_1
[1] 0.9999979

$PCRbottleneckCoefficient_2
[1] 479746

Most of these values are meaningless at this point, as we have already processed the bam file (i.e. removed duplicated fragments etc.) To compare the statistics for the unprocessed file, please see below.

Shiftig and Splitting Aligned Reads

Tagmentation by Tn5 transposase produces 5’ overhang of 9 base long, the coordinates of reads mapping to the positive and negative strands need to be shifted by + 4 and - 5, respectively, to account for the 9-bp duplication created by DNA repair of the nick by Tn5 transposase and achieve base-pair resolution of TF footprint and motif-related analyses.

We perform it at this point to plot signal at TSS, and we save the resulting object for later use.

We create a directory where the processed bam files will be saved:

## files will be saved into outPath respective to the working directory
outPath <- "splitBam"
dir.create(outPath)

First, we collect information on which SAM/BAM tags are present in our bam file:

possibleTag <- combn(LETTERS, 2)
possibleTag <- c(paste0(possibleTag[1, ], possibleTag[2, ]),
                 paste0(possibleTag[2, ], possibleTag[1, ]))

bamTop100 <- scanBam(BamFile(bamFile, yieldSize = 100),
                     param = ScanBamParam(tag = possibleTag))[[1]]$tag
tags <- names(bamTop100)[lengths(bamTop100)>0]

We shift the coordinates only for alignments on chr14, which is where most of our data is:

seqlev <- "chr14"
which <- as(seqinfo(Hsapiens)[seqlev], "GRanges")

We create an object with genomic alignments:

gal <- readBamFile(bamFile, tag=tags, which=which,asMates=TRUE, bigFile=TRUE)

This object is empty, because we used bigFile=TRUE - this is expected, so do not be alarmed.

The function shiftGAlignmentsList in the ATACseqQC package is used for shifting the alignments:

shiftedBamFile <- file.path(outPath, "shifted.bam")
gal1 <- shiftGAlignmentsList(gal, outbam=shiftedBamFile)

### save the GRanges object for future use
saveRDS(gal1, file = "gal1.rds", ascii = FALSE, version = NULL,compress = TRUE, refhook = NULL)

Next, we split the shifted alignments into different fractions by length (nucleosome free, mononucleosome, dinucleosome, and trinucleosome).

Shifted reads that do not fit into any of the above bins can be discarded.

Splitting reads is a time-consuming step because we are using random forest to classify the fragments based on fragment length and GC content.

By default, we assign the top 10% of short reads (reads below 100_bp) as nucleosome-free regions and the top 10% of intermediate length reads as (reads between 180 and 247 bp) mononucleosome. This serves as the training set to classify the rest of the fragments.

We need genomic locations of TSS:

txs <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
txs <- txs[seqnames(txs) %in% "chr14"]
genome <- Hsapiens

We split the alignments (this process takes a few minutes):

objs <- splitGAlignmentsByCut(gal1, txs=txs, genome=genome, outPath = outPath)

When done, we save the object for later use:

saveRDS(objs, file = "objs.rds", ascii = FALSE, version = NULL,compress = TRUE, refhook = NULL)

Finally, we have prepared the data for plotting the signal in NFR and mononuclesome fraction and calculating signal distribution at TSS.


Signal in NFR and Mononucleosome Fractions

Files we are going to use and TSS coordinates:

bamFiles <- file.path(outPath,
                     c("NucleosomeFree.bam",
                     "mononucleosome.bam",
                     "dinucleosome.bam",
                     "trinucleosome.bam"))

TSS <- promoters(txs, upstream=0, downstream=1)
TSS <- unique(TSS)

Calculate and log2 transform the signal around TSS:

librarySize <- estLibSize(bamFiles)

NTILE <- 101
dws <- ups <- 1010
sigs <- enrichedFragments(gal=objs[c("NucleosomeFree",
                                     "mononucleosome",
                                     "dinucleosome",
                                     "trinucleosome")],
                          TSS=TSS,
                          librarySize=librarySize,
                          seqlev=seqlev,
                          TSS.filter=0.5,
                          n.tile = NTILE,
                          upstream = ups,
                          downstream = dws)

sigs.log2 <- lapply(sigs, function(.ele) log2(.ele+1))

We can now save the heatmap:

pdf("Heatmap_splitbam.pdf")
featureAlignedHeatmap(sigs.log2, reCenterPeaks(TSS, width=ups+dws),
                      zeroAt=.5, n.tile=NTILE)

dev.off()
  • What are the differences in the signal profile in these two fractions? Why do we observe them?


Signal at TSS

We can now calculate signal distribution at TSS:

out <- featureAlignedDistribution(sigs,
                                  reCenterPeaks(TSS, width=ups+dws),
                                  zeroAt=.5, n.tile=NTILE, type="l",
                                  ylab="Averaged coverage")

## rescale the nucleosome-free and nucleosome signals to 0~1 for plotting
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
out <- apply(out, 2, range01)

And plot it:

pdf("TSSprofile_splitbam.pdf")
        matplot(out, type="l", xaxt="n",
        xlab="Position (bp)",
        ylab="Fraction of signal")
        axis(1, at=seq(0, 100, by=10)+1,
     labels=c("-1K", seq(-800, 800, by=200), "1K"), las=2)
        abline(v=seq(0, 100, by=10)+1, lty=2, col="gray")
dev.off()


  • What are the differences in the signal profile in these two fractions? Why do we observe them?

Note

To finish working in R type q(). Do not save workspace image - in this case, to save space.


Signal Visualisation Using IGV

In this part we will look more closely at our data, which is a good practice, as data summaries can be at times misleading. In principle we could look at the data on Uppmax using installed tools but it is much easier (and glitch-free) to work with genome browser locally. If you have not done this before the course, install Interactive Genome Browser IGV.

We would like to visualise processed alignments (bam and corresponding bai) at several loci with strong signal. We can also view the bam files split into nucleosome-free, mono- di- and tri- nucleosome fractions. Data has been mapped to hg38, so choose the appropriate reference.

We will need the following files:

  • atacseq/analysis/processedData/ENCFF045OAB.chr14.blacklist_M_filt.mapq5.dedup.bam and bai

  • atacseq/analysis/QC/splitBam/NucleosomeFree.bam and bai

  • atacseq/analysis/QC/splitBam/mononucleosome.bam and bai

  • atacseq/analysis/QC/splitBam/dinucleosome.bam and bai

  • atacseq/analysis/QC/splitBam/trinucleosome.bam and bai



You will have to zoom in to view the alignments and coverage tracks.

Change the default viewing settings in IGV by shift-clicking onto a track name (left panel):

bam tracks:

  • alignment view to Squished

  • colour alignments to insert size and pair orientation

coverage tracks:

  • pay attention to track scale; it is set to Auto; the tracks won’t show at the same scale, you can harmonise the scale if you want to see the differences in signal height


You can move long the chromosome 14 and you will spot locations with high signal density.

Examples:

  • chr14:61,513,568-61,543,546

  • chr14:90,365,635-90,395,613

  • chr14:92,508,638-92,538,616

  • chr14:93,095,621-93,125,599


An example is shown on the Figure below (this figure includes also detected peak intervals - we will discuss them in detail in the following sections).

../../../_images/igv_qc_split1.png

Bonus question

  • Why does the NFR track show unusually high fraction of discordant alignments (labeled green)?

After the QC performed in this tutorial and in general QC, we can now move to ATAC-seq data analysis.