Differential Accessibility in ATAC-seq

Learning outcomes

  • to perform GC-aware normalisation of ATAC-seq data using EDASeq

  • to detect differentially accessible regions using edgeR

Introduction

In this tutorial we use an R / Bioconductor packages EDAseq and edgeR to perform normalisation and analysis of differential accessibility in ATAC-seq data.

Data & Methods

We will build upon the main lab ATACseq data analysis:

  • first we will summarise reads to detected peaks using subset data as an example to obtain a counts table;

  • we will use the counts table encompassing complete data for differential accessibility analysis;

Setting-up

We need access to file nk_merged_peaks.saf which we created in the ATACseq data analysis, part “Merged Peaks” and bam files with alignments.

Assuming we start at atacseq/analysis:

mkdir counts_table
cd counts_table

ln -s ../peaks/consensus/nk_merged_peaks.saf
ln -s ../../data_proc/* .

If you haven’t followed the peak calling and merging lab, you can continue from this point by linking necessary files:

ln -s ../../results/peaks/consensus/nk_merged_peaks.saf
ln -s ../../data_proc/* .

Data Summarisation

We can now summarise the reads allowing for 20% overlap of the read length with peak feature (--fracOverlap 0.2) and counting fragments rather than reads (-p for PE):

module load subread/2.0.0
featureCounts -p -F SAF -a nk_merged_peaks.saf --fracOverlap 0.2 -o nk_merged_peaks_macs3.counts ENCFF363HBZ.chr14.proc.bam ENCFF398QLV.chr14.proc.bam ENCFF828ZPN.chr14.proc.bam ENCFF045OAB.chr14.proc.bam

Let’s take a look inside the counts table using head nk_merged_peaks_macs3.counts.

nk_merged_peaks_macs3.counts

head nk_merged_peaks_macs3.counts
# Program:featureCounts v2.0.0; Command:"featureCounts" "-p" "-F" "SAF" "-a" "nk_merged_peaks.saf" "--fracOverlap" "0.2" "-o" "nk_merged_peaks_macs3.counts" "ENCFF363HBZ.chr14.proc.bam" "ENCFF398QLV.chr14.proc.bam" "ENCFF828ZPN.chr14.proc.bam" "ENCFF045OAB.chr14.proc.bam"
Geneid  Chr     Start   End     Strand  Length  ENCFF363HBZ.chr14.proc.bam      ENCFF398QLV.chr14.proc.bam      ENCFF828ZPN.chr14.proc.bam      ENCFF045OAB.chr14.proc.bam
nk_merged_macs3_1       chr14   19161216        19161474        .       259     12      8       36      16
nk_merged_macs3_2       chr14   19161804        19162012        .       209     9       6       45      32
nk_merged_macs3_3       chr14   19239901        19240289        .       389     22      18      64      38
nk_merged_macs3_4       chr14   19384255        19384509        .       255     3       11      35      27
nk_merged_macs3_5       chr14   19488513        19488925        .       413     26      17      95      71
nk_merged_macs3_6       chr14   20305439        20306101        .       663     339     372     143     97
nk_merged_macs3_7       chr14   20332839        20333570        .       732     262     228     199     135
nk_merged_macs3_8       chr14   20342750        20343788        .       1039    2555    2424    1774    1226

We should remove the first line starting with #, as it can interfere with the way R reads in data:

awk '(NR>1)' nk_merged_peaks_macs3.counts > nk_merged_peaks_macs3.counts.tsv

Differential Accessibility

Please note that in the following exercise we use a counts table generated using a different peak set, hence some small differences to peaks called during the course may be present.

You can continue working in the atacseq/analysis/counts directory. This directory contains merged peaks called earlier using macs3 callpeak as well as count tables derived from summarising of non-subset data (we won’t need the count tables for this exercise). We will use file nk_merged_peaks_macs3.counts and annotation libraries, which are preinstalled. We access them via a module R_packages.

We now load R and packages:

module load R_packages/4.1.1

We activate R console upon typing R in the terminal.

We begin by loading necessary libraries:

library(edgeR)
library(EDASeq)

library(GenomicAlignments)
library(GenomicFeatures)

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(wesanderson)

library(Hmisc)
library(dplyr)

txdb = TxDb.Hsapiens.UCSC.hg38.knownGene

ff = FaFile("/proj/epi2023/atacseq_proc/hg38ucsc/hg38.fa")


We can read in the data, format it and define experimental groups:

cnt_table = read.table("nk_merged_peaks_macs3.counts", sep="\t", header=TRUE, blank.lines.skip=TRUE)
rownames(cnt_table)=cnt_table$Geneid

#update colnames of this count table
colnames(cnt_table)=c("Geneid","Chr","Start","End","Strand","Length","ENCFF363HBZ","ENCFF398QLV","ENCFF045OAB","ENCFF828ZPN")


groups = factor(c(rep("NK",2),rep("NKstim",2)))

#this data frame contains only read counts to peaks on assembled chromosomes
reads.peak = cnt_table[,c(7:10)]

We now prepare data with GC content of the peak regions for GC-aware normalisation.

gr = GRanges(seqnames=cnt_table$Chr, ranges=IRanges(cnt_table$Start, cnt_table$End), strand="*", mcols=data.frame(peakID=cnt_table$Geneid))

peakSeqs = getSeq(x=ff, gr)

gcContentPeaks = letterFrequency(peakSeqs, "GC",as.prob=TRUE)[,1]

#divide into 20 bins by GC content
gcGroups = Hmisc::cut2(gcContentPeaks, g=20)
mcols(gr)$gc = gcContentPeaks

Figure below shows that the accessibility measure of a particular genomic region is associated with its GC content. However, the slope and shape of the curves may differ between samples, which indicates that GC content effects are sample–specific and can therefore bias between–sample comparisons.

To visualise GC bias in peaks:

lowListGC = list()
for(kk in 1:ncol(reads.peak)){
  set.seed(kk)
  lowListGC[[kk]] = lowess(x=gcContentPeaks, y=log1p(reads.peak[,kk]), f=1/10)
}

names(lowListGC)=colnames(reads.peak)

dfList = list()
for(ss in 1:length(lowListGC)){
  oox = order(lowListGC[[ss]]$x)
  dfList[[ss]] = data.frame(x=lowListGC[[ss]]$x[oox], y=lowListGC[[ss]]$y[oox], sample=names(lowListGC)[[ss]])
}
dfAll = do.call(rbind, dfList)
dfAll$sample = factor(dfAll$sample)

p1.1 = ggplot(dfAll, aes(x=x, y=y, group=sample, color=sample)) +
  geom_line(size = 1) +
  xlab("GC-content") +
  ylab("log(count + 1)") +
  theme_classic()

pdf("GCcontent_peaks.pdf")
## plot just the average GC content
p1.1
dev.off()

We can see that GC content has an effect on counts within the peaks.

We have seen from analyses presented on lecture slides and in https://www.biorxiv.org/content/10.1101/2021.01.26.428252v2 that full quantile normalisation (FQ-FQ) implemented in package EDASeq is one of the methods which can mitigate the GC bias in detection of DA regions.

We’ll detect differentially accessible regions using edgeR. We will input the normalised GC content as an offset to edgeR.

To calculate the offsets, which correct for library size as well as GC content (full quantile normalisation in both cases):

reads.peak=as.matrix(reads.peak)

dataOffset = withinLaneNormalization(reads.peak,y=gcContentPeaks,num.bins=20,which="full",offset=TRUE)
dataOffset = betweenLaneNormalization(reads.peak,which="full",offset=TRUE)

We now use the statistical framework of edgeR. We do not perform the internal normalisation (TMM) as usually, and instead we provide the offsets calculated by EDASeq.

design = model.matrix(~groups)

d = DGEList(counts=reads.peak, group=groups)

keep = filterByExpr(d)

> summary(keep)
        Mode   FALSE    TRUE
logical      21   54743


d=d[keep,,keep.lib.sizes=FALSE]

d$offset = -dataOffset[keep,]
d.eda = estimateGLMCommonDisp(d, design = design)
d.eda = estimateGLMCommonDisp(d, design = design)
fit = glmFit(d.eda, design = design)
lrt.EDASeq = glmLRT(fit, coef = 2)

DA_res=as.data.frame(topTags(lrt.EDASeq, nrow(lrt.EDASeq$table)))

The top DA peaks in stimulated vs non-stimulated NK cells:

> head(DA_res)

                         logFC   logCPM       LR       PValue          FDR
nk_merged_macs3_30535  6.404743 5.289554 442.7384 2.744592e-98 2.274389e-93
nk_merged_macs3_14734  6.403253 4.939915 432.3314 5.052034e-96 2.093260e-91
nk_merged_macs3_16907  6.199114 4.881266 415.2994 2.573898e-92 7.109792e-88
nk_merged_macs3_43844  7.262906 4.361125 398.6341 1.092157e-88 2.262621e-84
nk_merged_macs3_18163  5.626103 6.097144 397.4212 2.005906e-88 3.324509e-84
nk_merged_macs3_46357 -5.894601 5.021572 392.9164 1.918571e-87 2.649803e-83

Let’s add more peak information:

DA_res$Geneid = rownames(DA_res)
DA.res.coords = left_join(DA_res,cnt_table[1:4],by="Geneid")

        > head(DA.res.coords)
      logFC   logCPM       LR       PValue          FDR                Geneid
1  6.404743 5.289554 442.7384 2.744592e-98 2.274389e-93 nk_merged_macs3_30535
2  6.403253 4.939915 432.3314 5.052034e-96 2.093260e-91 nk_merged_macs3_14734
3  6.199114 4.881266 415.2994 2.573898e-92 7.109792e-88 nk_merged_macs3_16907
4  7.262906 4.361125 398.6341 1.092157e-88 2.262621e-84 nk_merged_macs3_43844
5  5.626103 6.097144 397.4212 2.005906e-88 3.324509e-84 nk_merged_macs3_18163
6 -5.894601 5.021572 392.9164 1.918571e-87 2.649803e-83 nk_merged_macs3_46357
    Chr     Start       End
1 chr17    642297    643906
2 chr11  86292675  86294054
3 chr12  24838503  24839731
4  chr2 157477051 157477910
5 chr12  68155671  68157629
6  chr2 241985117 241985981

We can now save the results:

write.table(DA.res.coords, "nk_DA_stim_vs_ctrl.tsv", quote = FALSE, sep = "\t",
    eol = "\n", na = "NA", dec = ".", row.names = FALSE,
    col.names = TRUE, fileEncoding = "")

We can check how well the GC correction worked:

gcGroups.sub=gcGroups[keep]
dfEdgeR = data.frame(logFC=log(2^lrt.EDASeq$table$logFC), gc=gcGroups.sub)

pedgeR = ggplot(dfEdgeR) +
  aes(x=gc, y=logFC, color=gc) +
  geom_violin() +
  geom_boxplot(width=0.1) +
  scale_color_manual(values=wesanderson::wes_palette("Zissou1", nlevels(gcGroups), "continuous")) +
  geom_abline(intercept = 0, slope = 0, col="black", lty=2) +
  ylim(c(-1,1)) +
  ggtitle("log2FCs in bins by GC content, FQ-FQ normalisation") +
  xlab("GC-content bin") +
  theme_bw()+
  theme(aspect.ratio = 1)+
  theme(axis.text.x = element_text(angle = 45, vjust = .5),
        legend.position = "none",
        axis.title = element_text(size=16))

ggsave(filename="log2FC_vs_GCcontent.EDAseq.pdf",plot=pedgeR ,path=".",device="pdf")


It seems that FQ-FQ normalisation did not completely remove the effect of GC content on log2FC in thie dataset. However, these effects are somewhat mitigated, you can compare this plot to one obtained by using the standard TMM normalisation.


Part of the reason why the GC effects are not completely removed in this case may be that the fold change calculation/ DA analysis is not performed on properly replicated data; we should have at least 3 replicates per condition, and we only have two.