Differential Accessibility in ATAC-seq

Date:

2025-09-16


Learning outcomes

  • to interrogate GC bias in ATAC-seq data

  • to select appropriate scaling normalisation of ATAC-seq data

  • to detect differentially accessible regions using edgeR

Note

We continue working with data from (Tsao et al. 2022). We will use the count table derived from non subset data (already prepared).



Introduction

In this tutorial we use an R / Bioconductor packages edgeR (Robinson and Oshlack 2010), (Chen, Lun, and Smyth 2016), EDASeq (Risso et al. 2011) and cqn (Hansen, Irizarry, and WU 2012) to perform normalisation and analysis of differential accessibility in ATAC-seq data.



Data & Methods

We will build upon the main lab ATACseq data analysis:

  • we will interrogate GC bias in peaks and in adjacent non-overlapping genomic bins;

  • we will use the counts table encompassing complete data for differential accessibility analysis;


Setting Up

You can continue working in the directory atacseq/analysis/DA. This directory contains count tables derived from summarising of non-subset data to merged peaks called by genrich in the joint mode. We will use file AB_Batf_KO_invivo.genrich_joint.merged_peaks.featureCounts.


We can link additional files:

mkdir bam_dir
find /proj/epi2025/atacseq_proc/data_proc/ -name \*.ba* -exec ln -vs "{}" bam_dir/ ';'


We access the R environment via:

module load R_packages/4.3.1

We activate R console upon typing R in the terminal.


We begin by loading necessary libraries:

library(tidyverse)
library(dplyr)
library(kableExtra)

library(ggplot2)
library(wesanderson)

library(GenomicRanges)
library(Hmisc)
library(Biostrings)
library(regioneR)
library(bamsignals)
require(MASS)

library(BSgenome.Mmusculus.UCSC.mm39)

library(edgeR)
library(limma)
library(SummarizedExperiment)
library(EDASeq)
library(cqn)


workdir=getwd()

To set working directory to your desired path you can use these commands:

workdir="/path/to/workdir"

workdir=setwd()


Note

We take advantage of the module system on Rackham in this tutorial. The code was tested under R 4.3.1 The lab was developed under different R version, as stated in session info.


Data

We can now load data. We will subset the count table to only contain the peaks on assembled chromosomes.

count_table_fname="AB_Batf_KO_invivo.genrich_joint.merged_peaks.featureCounts"
cnt_table_pth=file.path(file.path(workdir,"counts"),count_table_fname)

cnt_table=read.table(cnt_table_pth, sep="\t", header=TRUE, blank.lines.skip=TRUE)
rownames(cnt_table)=cnt_table$Geneid
rownames(cnt_table)=c(gsub("AB_Batf_KO_invivo.genrich_joint.","",rownames(cnt_table)))
colnames(cnt_table)=c(colnames(cnt_table)[1:6],gsub(".filt.bam","",colnames(cnt_table)[7:10]))

colnames(cnt_table)[7:10]=c("B1_WT_Batf-floxed","B2_WT_Batf-floxed","A1_Batf_cKO","A2_Batf_cKO")

#remove peaks not on the assembled chromosomes
cnt_table_chr=cnt_table|>
  dplyr::filter(Chr%in%c(1:19) | Chr%in%c("X","Y"))

reads.peak=cnt_table_chr[,c(7:10)]

head(reads.peak)
##                B1_WT_Batf-floxed B2_WT_Batf-floxed A1_Batf_cKO A2_Batf_cKO
## merged_peaks_1               299               238         325         330
## merged_peaks_2               106                83         162         174
## merged_peaks_3                19                24          25          21
## merged_peaks_4                27                31          40          29
## merged_peaks_5               114               101          65         151
## merged_peaks_6               129               137         120         204
  • All peaks: n = 65027.

  • Peaks on assembled chromosomes: n = 64879. These peaks will be used for further analysis.



GC Bias

GC Bias in Genomic Bins

To ivestigate the GC bias in adjacent genomic bins (background), we start with creating the GRanges object holding the tiled genome intervals. We will do it for one chromosome only (chr1), to save compute time.

chr.lengths = seqlengths(Mmusculus)[1:21]
chr.lengths.chr1=chr.lengths[1]

#tiles
tiles_chr1=GenomicRanges::tileGenome(chr.lengths.chr1,tilewidth=5000, cut.last.tile.in.chrom=TRUE)

#sequence
tileSeqs=BSgenome::getSeq(Mmusculus,tiles_chr1)

#GCcontent
gcContentTiles=Biostrings::letterFrequency(tileSeqs, "GC",as.prob=TRUE)[,1]
mcols(tiles_chr1)$gc=gcContentTiles


We need to tweak chromosome names to match the genome reference used for read mapping:

# tiles_chr1
# remove chr from granges obj
seqlevels(tiles_chr1)=gsub("chr","",seqlevels(tiles_chr1))


We can now count reads in all bam files in the data set, and plot them.

bam_dir=file.path(workdir,"bam_dir")
bam_fnames=list.files(bam_dir,pattern = "\\.bam$",)

par(mfrow = c(2, length(bam_fnames)/2 ) )

for (bam_fname in bam_fnames){

    bam_path=file.path(bam_dir,bam_fname)

    tiles_bam=tiles_chr1

    tiled_counts=bamCount(bam_path, tiles_bam, verbose=FALSE)
    mcols(tiles_bam)$readcount=tiled_counts

    smoothScatter(tiles_bam$gc, log2(tiles_bam$readcount+1),
      main=paste("Logcounts vs GC in bins",bam_fname,sep="\n"), ylab="log(counts+1)", xlab="GC content")

}


../../../_images/unnamed-chunk-7-1.png

We can see that the signal of logcounts vs GC content looks very similar in all libraries.


GC Bias in Peaks

To ivestigate the GC bias in peaks (signal), we start with creating the GRanges object holding the peak intervals.

We need to prefix the chromosome name by “chr” (per UCSC convention) in the first step to be able to use a BSgenome object from the Bioconductor package BSgenome.Mmusculus.UCSC.mm39. Please note this only works with assembled chromosomes; the non-assembled contigs follow different naming conventions in Ensembl (the source of the reference assembly for read mapping) and UCSC (the source of BSgenome package).

peaks_gr=GRanges(seqnames=paste0("chr",cnt_table_chr$Chr), ranges=IRanges(cnt_table_chr$Start, cnt_table_chr$End), strand="*", mcols=data.frame(peakID=rownames(cnt_table_chr)))

We now prepare data with GC content of the peak regions for GC-aware normalisation.

peakSeqs=BSgenome::getSeq(Mmusculus,peaks_gr)

gcContentPeaks=Biostrings::letterFrequency(peakSeqs, "GC",as.prob=TRUE)[,1]

#divide into 20 bins by GC content
gcGroups=Hmisc::cut2(gcContentPeaks, g=20)
mcols(peaks_gr)$gc=gcContentPeaks
mcols(peaks_gr)$gc_group=gcGroups

peaks_gr
## GRanges object with 64879 ranges and 3 metadata columns:
##           seqnames            ranges strand |       mcols.peakID        gc
##              <Rle>         <IRanges>  <Rle> |        <character> <numeric>
##       [1]     chr1   3050939-3052959      * |     merged_peaks_1  0.392875
##       [2]     chr1   3053048-3054634      * |     merged_peaks_2  0.379962
##       [3]     chr1   3054861-3055532      * |     merged_peaks_3  0.345238
##       [4]     chr1   3057260-3057785      * |     merged_peaks_4  0.376426
##       [5]     chr1   3059375-3061360      * |     merged_peaks_5  0.402316
##       ...      ...               ...    ... .                ...       ...
##   [64875]     chrY 90814281-90815165      * | merged_peaks_64875  0.505085
##   [64876]     chrY 90815739-90816707      * | merged_peaks_64876  0.430341
##   [64877]     chrY 90818033-90819321      * | merged_peaks_64877  0.493406
##   [64878]     chrY 90819900-90820364      * | merged_peaks_64878  0.369892
##   [64879]     chrY 90821996-90824312      * | merged_peaks_64879  0.469141
##                gc_group
##                <factor>
##       [1] [0.234,0.396)
##       [2] [0.234,0.396)
##       [3] [0.234,0.396)
##       [4] [0.234,0.396)
##       [5] [0.396,0.417)
##       ...           ...
##   [64875] [0.505,0.514)
##   [64876] [0.417,0.431)
##   [64877] [0.487,0.496)
##   [64878] [0.234,0.396)
##   [64879] [0.461,0.470)
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths

Figure below shows that the accessibility measure of a particular genomic region is associated with its GC content. In this data set, the curves are almost identical for all samples, indicating no difference in GC bias between samples.

However, in some cases the slope and shape of the curves may differ between samples, which indicates that GC content effects are sample–specific and can therefore bias between–sample comparisons.

We start by creating a data frame with gc contents and read count in each peak in each sample as well as perform lowess (locally weighted scatterplot smoothing) regression to fit the trend:

lowListGC = list()
for(kk in 1:ncol(reads.peak)){
  set.seed(kk)
  lowListGC[[kk]] = lowess(x=gcContentPeaks, y=log1p(reads.peak[,kk]), f=1/10)
}

names(lowListGC)=colnames(reads.peak)

dfList = list()
for(ss in 1:length(lowListGC)){
  oox = order(lowListGC[[ss]]$x)
  dfList[[ss]] = data.frame(x=lowListGC[[ss]]$x[oox], y=lowListGC[[ss]]$y[oox], sample=names(lowListGC)[[ss]])
}
dfAll = do.call(rbind, dfList)
dfAll$sample = factor(dfAll$sample)

We can now plot the relationship of logcounts vs GC content:

plotGCHex <- function(gr, counts){
  counts2 <- counts
  df <- as_tibble(cbind(counts2,gc=mcols(gr)$gc))
  df <- gather(df, sample, value, -gc)
  ggplot(data=df, aes(x=gc, y=log(value+1)) ) +
    ylab("log(count + 1)") + xlab("GC-content") +
    geom_hex(bins = 50) + theme_bw()
}

plot_GC_bias=plotGCHex(peaks_gr, rowMeans(reads.peak)) +
  theme(axis.title = element_text(size=16)) +
  labs(fill="Nr. of peaks") +
  geom_line(aes(x=x, y=y, group=sample, color=sample), data=dfAll, linewidth=1) +
  scale_color_discrete()
../../../_images/unnamed-chunk-12-1.png



Differential Accessibility

We can define experimental groups:

groups=factor(c(rep("ctrl",2),rep("KO_Batf",2)))
groups
## [1] ctrl    ctrl    KO_Batf KO_Batf
## Levels: ctrl KO_Batf

design=model.matrix(~groups)
rownames(design)=colnames(reads.peak)
design
##                   (Intercept) groupsKO_Batf
## B1_WT_Batf-floxed           1             0
## B2_WT_Batf-floxed           1             0
## A1_Batf_cKO                 1             1
## A2_Batf_cKO                 1             1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$groups
## [1] "contr.treatment"


We’ll detect differentially accessible regions using edgeR. As we do not observe strong effects of GC content on signal neither in peaks nor in genomic bins, we decided to use the scaling normalisation by trimmed mean of M-values (TMM) (Robinson and Oshlack 2010).

We start by creating DGEList, the object edgeR uses to store data for calculations. Before we start the DA analysis, it’s advisable to remove peaks with very low counts.

reads.dge = DGEList(counts=reads.peak, group=groups)
keep = filterByExpr(reads.dge)
reads.dge=reads.dge[keep,,keep.lib.sizes=FALSE]

summary(keep)
##    Mode   FALSE    TRUE
## logical     418   64461

reads.dge
## An object of class "DGEList"
## $counts
##                B1_WT_Batf-floxed B2_WT_Batf-floxed A1_Batf_cKO A2_Batf_cKO
## merged_peaks_1               299               238         325         330
## merged_peaks_2               106                83         162         174
## merged_peaks_3                19                24          25          21
## merged_peaks_4                27                31          40          29
## merged_peaks_5               114               101          65         151
## 64456 more rows ...
##
## $samples
##                     group lib.size norm.factors
## B1_WT_Batf-floxed    ctrl 43359738            1
## B2_WT_Batf-floxed    ctrl 33327965            1
## A1_Batf_cKO       KO_Batf 43438468            1
## A2_Batf_cKO       KO_Batf 46400831            1


These steps perform the standard edgeR workflow for differential analysis:

reads.dge.tmm = normLibSizes(reads.dge)


We can inspect sample grouping on multidimensional scaling (MDS) plot before proceeding:

plotMDS(reads.dge.tmm)
../../../_images/unnamed-chunk-16-1.png


All looks as expected, we can proceed with the differential analysis:

reads.dge.tmm = estimateDisp(reads.dge.tmm, design)
qlf.fit.tmm=glmQLFit(reads.dge.tmm, design, robust=TRUE)
qlf.ftest.tmm=glmQLFTest(qlf.fit.tmm, coef=2)
DA_res.qlf.tmm=as.data.frame(topTags(qlf.ftest.tmm, nrow(qlf.ftest.tmm$table)))
DA_res.qlf.tmm=DA_res.qlf.tmm|>dplyr::mutate(peakID=rownames(DA_res.qlf.tmm))


This results in a table with results of DA analysis:

head(DA_res.qlf.tmm)
##                        logFC   logCPM        F       PValue          FDR
## merged_peaks_28038 -1.610768 6.074221 756.5472 1.722979e-90 1.110650e-85
## merged_peaks_51767 -1.508490 6.159745 710.5749 3.363346e-87 1.084023e-82
## merged_peaks_2997  -1.517878 5.964106 638.5088 9.578557e-82 2.058145e-77
## merged_peaks_1873  -1.157643 6.593339 534.2112 4.151681e-73 6.690538e-69
## merged_peaks_36974 -1.141022 6.596217 524.9430 2.716748e-72 3.502486e-68
## merged_peaks_40709 -1.638902 5.206780 468.3368 4.081673e-67 4.385146e-63
##                                peakID
## merged_peaks_28038 merged_peaks_28038
## merged_peaks_51767 merged_peaks_51767
## merged_peaks_2997   merged_peaks_2997
## merged_peaks_1873   merged_peaks_1873
## merged_peaks_36974 merged_peaks_36974
## merged_peaks_40709 merged_peaks_40709


We should also take a look at the diagnostic plots to verify that they look as expected. The MA plot is a visualisation that plots the log-fold-change between experimental groups (M) against the average expression across all the samples (A) for each gene. The expectation is that for the bulk of the peaks the log2FC is close to 0 (the cloud of points is centred on 0 on the Y-axis) and minimal signal amplitude related effects.

plotMD(qlf.ftest.tmm)
../../../_images/unnamed-chunk-19-1.png


At this point we can add the information from peak annotation Peak Annotation

If you are still in the same R session, you can skip the step below.

If you started a new R session, you can read in the table with peak annotations:

peak_annots_pth=file.path(workdir,"Allpeaks_annot.Ensembl.rds")

peakAnno_df=readRDS(peak_annots_pth)

#rename the column with peakID for data frame joining and remove redundant columns
peakAnno_df=peakAnno_df|>dplyr::rename(peakID=mcols.peakID)|>dplyr::select(!c(seqnames,strand.x,start,end,width,strand.y))


Note

If you did not follow the Peak Annotation lab, you copy the saved file from ../../results/DA/objects/Allpeaks_annot.Ensembl.rds


We can now join the tables with peak annotations and DA results:

peaks_gr_df=as.data.frame(peaks_gr)|>dplyr::rename(peakID=mcols.peakID)

DA_res_table=DA_res.qlf.tmm |>
  dplyr::left_join(peakAnno_df,by="peakID")|>
  dplyr::left_join(peaks_gr_df,by="peakID")|>
  dplyr::select(seqnames,start,end,peakID,gc,logFC,FDR,annotation,geneChr,geneStart,geneEnd,geneStrand,geneId,transcriptId,external_gene_name,distanceToTSS)
head(DA_res_table)
##   seqnames     start       end             peakID     logFC          FDR
## 1       17  66268427  66269247 merged_peaks_28038 -1.610768 1.110650e-85
## 2        6 122504236 122505014 merged_peaks_51767 -1.508490 1.084023e-82
## 3        1 155076669 155077704  merged_peaks_2997 -1.517878 2.058145e-77
## 4        1  95195320  95196614  merged_peaks_1873 -1.157643 6.690538e-69
## 5        2 162944874 162945676 merged_peaks_36974 -1.141022 3.502486e-68
## 6        3 138125917 138126743 merged_peaks_40709 -1.638902 4.385146e-63
##          gc                                                    annotation
## 1 0.4360536                                             Distal Intergenic
## 2 0.4801027 Intron (ENSMUST00000032210/ENSMUSG00000030116, intron 8 of 8)
## 3 0.5009653                                                        3' UTR
## 4 0.4617761                                             Distal Intergenic
## 5 0.5031133                                             Distal Intergenic
## 6 0.4087062     Exon (ENSMUST00000161312/ENSMUSG00000037797, exon 4 of 6)
##   geneChr geneStart   geneEnd geneStrand             geneId       transcriptId
## 1      17  66261129  66265392          1 ENSMUSG00000139744 ENSMUST00000355127
## 2       6 122499458 122505594          1 ENSMUSG00000030116 ENSMUST00000126357
## 3       1 155070767 155077993          1 ENSMUSG00000026470 ENSMUST00000194158
## 4       1  95183688  95184535          2 ENSMUSG00000099592 ENSMUST00000190584
## 5       2 162934819 162934943          1 ENSMUSG00002076785 ENSMUST00020181897
## 6       3 138121256 138136653          1 ENSMUSG00000037797 ENSMUST00000013458
##   external_gene_name distanceToTSS
## 1            Gm65735          7298
## 2              Mfap5          4778
## 3               Stx6          5902
## 4             Gm5264        -10785
## 5            Gm56299         10055
## 6               Adh4          4661


We can save the complete results of peak annotation, GC content and DA analysis:

saveRDS(DA_res_table, file = "FiltPeaks.DA.TMM.annot.rds")


You can now follow with other downstream tutorials listed under Downstream Analyses


GC Bias Correction

Plotting Log2FC in GC Bins

When a strong effect of GC content on signal is observed, a GC aware scaling normalisation can be considered. It is important to perform all diagnostic plots, however, to verify whether it does not distort the data in an unexpected manner. One should always be aware that the GC bias, although technical, may also reflect sample biology, therefore removing it may lead to signal loss.

We can first verify whether there is a GC bias in log2FC detection using GC agnostic TMM scaling.

We can plot log2FC distribution in GC content bins.

For this we will need the GC bins we calculated before, so we need to join that information to the results of DA analysis:

peak_info_df=as.data.frame(peaks_gr)|>
  dplyr::rename(peakID=mcols.peakID)

df_GCbias=DA_res.qlf.tmm |>
    dplyr::left_join(peak_info_df, by="peakID") |>
    dplyr::select(logFC,gc_group)

Let’s plot the log2FC in GC bins:

plot_lfc_GC_TMM = ggplot(df_GCbias) +
  aes(x=gc_group, y=logFC, color=gc_group) +
  geom_violin(width=0.95) +
  geom_boxplot(width=0.15, color="grey20") +
  scale_color_manual(values=wesanderson::wes_palette("Zissou1", nlevels(df_GCbias$gc_group), "continuous")) +
  geom_abline(intercept = 0, slope = 0, col="black", lty=2) +
  #ylim(c(-1,1)) + ## this was in the original code from EDAseq paper; it calculates medians for values within the ylim interval - not from the entire data
  coord_cartesian(ylim=c(-1,1)) +
  ggtitle(paste0("log2FCs in bins by GC content, normalisation: TMM")) +
  xlab("GC-content bin") +
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, vjust = .5),
        legend.position = "none",
        axis.title = element_text(size=16))

plot_lfc_GC_TMM
../../../_images/unnamed-chunk-24-1.png

A negligible bias in log2FC can be observed within the range of log2FC (-1,1).

You can alter the plotted range by changing coord_cartesian(ylim=c(-1,1)) to your desired range.

In any case, for this data set the systematic effect of GC contents on detected log2FC is very small, and below the reasonable size effect cutoff.


Correcting for GC contents

If required, the raw counts can be scaled in a GC aware manner, rather than using the TMM method.

Two related methods are presented below. Both perform conditional quantile scaling, and output the offsets which can then be used in edgeR statistical framework.


Full Quantile GC-GC Scaling

This method is implemented in Bioconductor package EDASeq (Risso et al. 2011).

To calculate the offsets, which correct for library size as well as GC content (full quantile normalisation in both cases):

exprsSet.eda=newSeqExpressionSet(reads.dge$counts)
peaks_gr.keep=peaks_gr[keep]
fData(exprsSet.eda)$gc=peaks_gr.keep$gc

exprsSet.eda.wl=withinLaneNormalization(exprsSet.eda,"gc",num.bins=20, which="full",offset=TRUE)
exprsSet.eda.bl=betweenLaneNormalization(exprsSet.eda.wl,which="full",offset=TRUE)

The offsets can be inspected:

head(offst(exprsSet.eda.bl))
##                B1_WT_Batf-floxed B2_WT_Batf-floxed A1_Batf_cKO A2_Batf_cKO
## merged_peaks_1         1.2280870         1.5061165   1.2441529   1.1811897
## merged_peaks_2         1.0093249         1.2584039   1.1587582   1.0848913
## merged_peaks_3         0.6129416         0.9425809   0.7350182   0.5523253
## merged_peaks_4         0.7139578         0.9909705   0.8238461   0.6286266
## merged_peaks_5         0.6563096         0.9560009   0.5779452   0.6904069
## merged_peaks_6         0.6932922         1.0369034   0.7632731   0.7825120

We will input the offset matrix to edgeR:

reads.dge.edaseq = reads.dge
reads.dge.edaseq$offset = -offst(exprsSet.eda.bl)

The statistical testing follows:

reads.dge.edaseq=estimateDisp(reads.dge.edaseq, design)
qlf.fit.edaseq=glmQLFit(reads.dge.edaseq, design, robust=TRUE)
qlf.ftest.edaseq=glmQLFTest(qlf.fit.edaseq, coef=2)
DA_res.qlf.edaseq=as.data.frame(topTags(qlf.ftest.edaseq, nrow(qlf.ftest.edaseq$table)))
DA_res.qlf.edaseq=DA_res.qlf.edaseq|>dplyr::mutate(peakID=rownames(DA_res.qlf.edaseq))

We can now plot the log2FC in GC bins, as for TMM scaling:

df_GCbias=DA_res.qlf.edaseq |>
    dplyr::left_join(peak_info_df, by="peakID") |>
    dplyr::select(logFC,gc_group)

plot_lfc_GC_edaseq = ggplot(df_GCbias) +
  aes(x=gc_group, y=logFC, color=gc_group) +
  geom_violin(width=0.95) +
  geom_boxplot(width=0.15, color="grey20") +
  scale_color_manual(values=wesanderson::wes_palette("Zissou1", nlevels(df_GCbias$gc_group), "continuous")) +
  geom_abline(intercept = 0, slope = 0, col="black", lty=2) +
  #ylim(c(-1,1)) + ## this was in the original code from EDAseq paper; it calculates medians for values within the ylim interval - not from the entire data
  coord_cartesian(ylim=c(-1,1)) +
  ggtitle(paste0("log2FCs in bins by GC content, normalisation: GC FQ-FQ")) +
  xlab("GC-content bin") +
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, vjust = .5),
        legend.position = "none",
        axis.title = element_text(size=16))

plot_lfc_GC_edaseq
../../../_images/unnamed-chunk-29-1.png


Conditional Quantile Normalization

This method is implemented in Bioconductor package cqn (Hansen, Irizarry, and WU 2012).

In calculating offsets, it can correct both for GC content as well as peak length.

#assuming we have the subset peaks_gr
peaks_gr.keep=peaks_gr[keep]


peaks=as.data.frame(cbind(
      gc=peaks_gr.keep$gc,
      length=width(peaks_gr.keep)
    ))
rownames(peaks)=peaks_gr.keep$mcols.peakID

cqn_out=cqn(counts=reads.dge$counts,lengths=peaks$length,x=peaks$gc,
             sizeFactors=reads.dge$samples$lib.size,verbose=TRUE)
## RQ fit ....
## SQN
## .

cqn_out
##
## Call:
##  cqn(counts = reads.dge$counts, x = peaks$gc, lengths = peaks$length,
##     sizeFactors = reads.dge$samples$lib.size, verbose = TRUE)
##
## Object of class 'cqn' with
##   64461 regions
##   4 samples
## fitted using smooth length

head(cqn_out$glm.offset)
##                B1_WT_Batf-floxed B2_WT_Batf-floxed A1_Batf_cKO A2_Batf_cKO
## merged_peaks_1          6.077630          5.849091    6.038654    6.142638
## merged_peaks_2          5.602194          5.392023    5.695700    5.805515
## merged_peaks_3          3.810420          3.669992    3.829517    3.872245
## merged_peaks_4          3.460506          3.172147    3.333104    3.457118
## merged_peaks_5          5.812975          5.660115    5.655834    6.000880
## merged_peaks_6          5.993356          5.874478    5.930493    6.225818

We will input the offset matrix to edgeR:

reads.dge.cqn = reads.dge
reads.dge.cqn$offset = cqn_out$glm.offset

The statistical testing follows:

reads.dge.cqn=estimateDisp(reads.dge.cqn, design)
qlf.fit.cqn=glmQLFit(reads.dge.cqn, design, robust=TRUE)
qlf.ftest.cqn=glmQLFTest(qlf.fit.cqn, coef=2)
DA_res.qlf.cqn=as.data.frame(topTags(qlf.ftest.cqn, nrow(qlf.ftest.cqn$table)))
DA_res.qlf.cqn=DA_res.qlf.cqn|>dplyr::mutate(peakID=rownames(DA_res.qlf.cqn))

We can now plot the log2FC in GC bins, as for TMM scaling:

df_GCbias=DA_res.qlf.cqn |>
    dplyr::left_join(peak_info_df, by="peakID") |>
    dplyr::select(logFC,gc_group)

plot_lfc_GC_cqn = ggplot(df_GCbias) +
  aes(x=gc_group, y=logFC, color=gc_group) +
  geom_violin(width=0.95) +
  geom_boxplot(width=0.15, color="grey20") +
  scale_color_manual(values=wesanderson::wes_palette("Zissou1", nlevels(df_GCbias$gc_group), "continuous")) +
  geom_abline(intercept = 0, slope = 0, col="black", lty=2) +
  #ylim(c(-1,1)) + ## this was in the original code from EDAseq paper; it calculates medians for values within the ylim interval - not from the entire data
  coord_cartesian(ylim=c(-1,1)) +
  ggtitle(paste0("log2FCs in bins by GC content, normalisation: cqn")) +
  xlab("GC-content bin") +
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, vjust = .5),
        legend.position = "none",
        axis.title = element_text(size=16))

plot_lfc_GC_cqn
../../../_images/unnamed-chunk-33-1.png

Note

It is advised to verify the estimated model parameters and fit using the diagnostic plots provided in edgeR i.e. plotBCV(reads.dge) and plotQLDisp(fit)



Session Info

References

Chen, Yunshun, Aaron T. L. Lun, and Gordon K. Smyth. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of RNA-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline.” F1000Research 5 (August): 1438. https://doi.org/10.12688/f1000research.8987.2.

Hansen, K. D., R. A. Irizarry, and Z. WU. 2012. “Removing Technical Variability in RNA-Seq Data Using Conditional Quantile Normalization.” Biostatistics 13 (2): 204–16. https://doi.org/10.1093/biostatistics/kxr054.

Risso, Davide, Katja Schwartz, Gavin Sherlock, and Sandrine Dudoit. 2011. “GC-Content Normalization for RNA-Seq Data.” BMC Bioinformatics 12 (1). https://doi.org/10.1186/1471-2105-12-480.

Robinson, M. D., and A. Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biol 11 (3): R25.

Tsao, Hsiao-Wei, James Kaminski, Makoto Kurachi, R. Anthony Barnitz, Michael A. DiIorio, Martin W. LaFleur, Wataru Ise, et al. 2022. “Batf-Mediated Epigenetic Control of Effector CD8 + t Cell Differentiation.” Science Immunology 7 (68). https://doi.org/10.1126/sciimmunol.abi4919.