Detection of differential binding sites using csaw¶
This is an alternative workflow for detection of differential binding / occupancy in ChIP-seq data. In contrast to working with reads counted within peaks detected in a peak calling step (as in the earlier example with DiffBind
), this approach uses a sliding window to count reads across the genome. Each window is then tested for significant differences between libraries from different conditions, using the methods in the edgeR
package. This package also offers an FDR control strategy more appropriate for ChIP-seq experiments than simple BH adjustment.
It can be used for point-source binding (TFs) as well as for broad signal (histones). However, it can only be used for cases where global occupancy levels are unchanged.
As this method is agnostic to signal structure, it requires careful choice of strategies for filtering and normalisation. Here, we show a very simple workflow. More details can be found in the Csaw User Guide.
Requirements Local¶
R version 4.0.4 (2021-02-15) -- "Lost Library Book"
csaw
edgeR
R packages required for annotation:
org.Hs.eg.db
TxDb.Hsapiens.UCSC.hg19.knownGene
Recommended (local):
R-Studio to work in
Note
This exercise was tested on Rackham using pre-installed R libraries. Local installation of recommended R packages may require additional software dependecies.
Requirements Remote (Uppmax)¶
The software is configured.
To prepare the files, assuming you are in ~/chipseq/analysis
:
mkdir csaw
cd csaw
ln -s ../../data/bam/hela/
ln -s ../../data/bam/sknsh/
Loading data and preparing the contrast to test¶
You can now load the version of R for which we tested this class along with other dependencies:
module load R_packages/4.0.4
The remaining part of the exercise is performed in R
.
dir.sknsh = "./sknsh"
dir.hela = "./hela"
hela.1=file.path(dir.hela,"ENCFF000PED.chr12.rmdup.sort.bam")
hela.2=file.path(dir.hela,"ENCFF000PEE.chr12.rmdup.sort.bam")
sknsh.1=file.path(dir.sknsh,"ENCFF000RAG.chr12.rmdup.sort.bam")
sknsh.2=file.path(dir.sknsh,"ENCFF000RAH.chr12.rmdup.sort.bam")
bam.files <- c(hela.1,hela.2,sknsh.1,sknsh.2)
We need to provide the information about the design of the experiment using model.matrix
function:
grouping <- factor(c('hela', 'hela', 'sknsh', 'sknsh'))
design <- model.matrix(~0 + grouping)
colnames(design) <- levels(grouping)
The design should look like this:
> design
hela sknsh
1 1 0
2 1 0
3 0 1
4 0 1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$grouping
[1] "contr.treatment"
Let’s test which peaks are differentially occupied in HeLa cells vs in sknsh cells.
We prepare the information on contrast to be tested using makeContrasts
function from package limma
. This is not the only way to do so, and examples are given in csaw
and edgeR
manuals. In this case we want to test for the differences in REST binding in HeLa vs. SKNSH cell lines:
library(edgeR)
contrast <- makeContrasts(hela - sknsh, levels=design)
The contrast should look like this
> contrast
Contrasts
Levels hela - sknsh
hela 1
sknsh -1
Now we are ready to load data and create an object with counted reads:
library(csaw)
data <- windowCounts(bam.files, ext=100, width=10)
Parameters for file loading can be modified (examples in the csaw
User Guide), depending on how the data was processed. Here we explicitely input the value for fragment length as we have this information from the cross correlation analysis performed earlier during ChIP-seq data processing tutorial. It is 100 for Hela and 95 & 115 for sknsh. We’ve used 100 as it seems a reasonable averge value.
We can inspect the resulting data
object, e.g.:
> data
class: RangedSummarizedExperiment
dim: 155408 4
metadata(6): spacing width ... param final.ext
assays(1): counts
rownames: NULL
rowData names(0):
colnames: NULL
colData names(4): bam.files totals ext rlen
> data$totals
[1] 1637778 2009932 2714033 4180463
Filtering out regions with very low coverage¶
The next step is to filter out uninformative regions, i.e. windows with low read count, which represent background. There are many strategies to do it, depending on the biology of the experiment, IP efficiency and data processing. Here, we filter out lowest 99.9% of the windows, retaining the 0.1% windows with highest signal. The rationale is that for TF experiments only 0.1% of the genome can be bound, hence the remaining must represent background.
keep <- filterWindowsProportion(data)$filter > 0.999
data.filt <- data[keep,]
To investigate the effectiveness of our filtering strategy:
> summary(keep)
Mode FALSE TRUE
logical 145558 9850
Normalisation¶
Assigning reads into larger bins for normalisation:
binned <- windowCounts(bam.files, bin=TRUE, width=10000)
The TMM method trims away putative DB bins (i.e., those with extreme M-values) and computes normalization factors from the remainder to use in edgeR. The size of each library is scaled by the corresponding factor to obtain an effective library size for modelling.
Calculating the normalisation factors using a modified TMM method:
data.filt <- normFactors(binned, se.out=data.filt)
Inspecting the normalisation factors:
> data.filt$norm.factors
[1] 0.9727458 1.0718693 0.9279702 1.0335341
Detecting differentially binding (DB) sites¶
This part of the procedure follows the logic developed for transcriptomics data in edgeR
package. The steps are described in great detail in csawBook.
Detecting DB windows:
data.filt.calc <- asDGEList(data.filt)
data.filt.calc <- estimateDisp(data.filt.calc, design)
>summary(data.filt.calc$trended.dispersion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2136 0.3565 0.4022 0.3984 0.4742 0.4892
fit <- glmQLFit(data.filt.calc, design, robust=TRUE)
results <- glmQLFTest(fit, contrast=contrast)
rowData(data.filt) <- cbind(rowData(data.filt), results$table)
Inspecting the results table:
> head(results$table)
logFC logCPM F PValue
1 -7.239404 2.165639 17.229173 3.327018e-05
2 -5.244217 2.783211 9.484909 2.074540e-03
3 -3.023888 2.755437 4.721852 2.979352e-02
4 -2.050617 2.612401 2.684560 1.013412e-01
5 -1.827703 2.459979 2.459072 1.168638e-01
6 -4.336717 2.052296 14.330442 1.538194e-04
Correcting for multiple testing¶
First we merge adjacent DB windows into longer clusters. Windows that are less than tol
apart are considered to be adjacent and are grouped into the same cluster. The chosen tol
represents the minimum distance at which two binding events are treated as separate sites.
Large values (500 - 1000 bp) reduce redundancy and favor a region-based interpretation of
the results, while smaller values (< 200 bp) allow resolution of individual binding sites.
merged <- mergeWindows(rowRanges(data.filt), tol=1000L)
Next, we apply the multiple testing correction to obtain FDR. We combine p-values across clustered tests using Simes method to control the cluster FDR.
table.combined <- combineTests(merged$id, results$table)
The resulting table.combined
object contains FDR for each cluster:
> head(table.combined)
DataFrame with 6 rows and 8 columns
num.tests num.up.logFC num.down.logFC PValue FDR direction
<integer> <integer> <integer> <numeric> <numeric> <character>
1 7 0 5 2.32891e-04 0.004039711 down
2 3 0 3 6.98933e-06 0.000482289 down
3 3 0 3 1.94804e-04 0.003679925 down
4 5 0 5 4.10817e-05 0.001168075 down
5 3 0 3 6.67458e-05 0.001720419 down
6 5 0 5 1.88055e-04 0.003620735 down
rep.test rep.logFC
<integer> <numeric>
1 1 -7.23940
2 8 -7.00091
3 13 -7.50151
4 14 -7.12133
5 19 -7.20842
6 23 -8.90988
num.tests
- the total number of windows in each cluster;fields
num.up.logFC
andnum.down.logFC
- for each log-FC column inresults$table
; contain the number of windows with log-FCs above 0.5 or below -0.5, respectively;PValue
- the combined p value;FDR
- the q-value corresponding to the combined p value;direction
- the dominant direction of change for windows in each cluster.
Each combined p value represents evidence against the global null hypothesis, i.e., all individual nulls are true in each cluster. This may be more relevant than examining each test individually when multiple tests in a cluster represent parts of the same underlying event, i.e., genomic regions consisting of clusters of windows. The BH method is then applied to control the FDR across all clusters.
Inspecting the results¶
We select statistically significant DB events at FDR 0.05:
is.sig.region <- table.combined$FDR <= 0.05
table(table.combined$direction[is.sig.region])
How many regions were detected as differentialy bound?
down up
231 201
out of
> length(table.combined$FDR)
[1] 2758
We can also obtain information on the best window in each cluster:
tab.best <- getBestTest(merged$id, results$table)
> head(tab.best)
DataFrame with 6 rows and 8 columns
num.tests num.up.logFC num.down.logFC PValue FDR direction
<integer> <integer> <integer> <numeric> <numeric> <character>
1 7 0 4 2.32891e-04 0.004310830 down
2 3 0 3 6.98933e-06 0.000529342 down
3 3 0 3 2.56536e-04 0.004535416 down
4 5 0 5 4.10817e-05 0.001168075 down
5 3 0 3 6.67458e-05 0.001720419 down
6 5 0 5 6.41895e-04 0.008158277 down
rep.test rep.logFC
<integer> <numeric>
1 1 -7.23940
2 8 -7.00091
3 11 -7.33950
4 14 -7.12133
5 19 -7.20842
6 22 -7.47709
We can inspect congruency of the replicates on multi-dimensional scaling (MDS) plots. The distance between each pair of libraries is computed as the square root of the mean squared log-fold change across the top set of bins with the highest absolute log-fold changes. A small top set visualizes the most extreme differences whereas a large set visualizes overall differences.
par(mfrow=c(2,2))
adj.counts <- cpm(data.filt.calc, log=TRUE)
for (top in c(100, 500, 1000, 5000)) {
plotMDS(adj.counts, main=top, col=c("blue", "blue", "red", "red"),labels=c("hela", "hela", "sknsh", "sknsh"), top=top)
}
Let’s save this plot:
pdf("csaw-MDS.pdf")
par(mfrow=c(2,2))
adj.counts <- cpm(data.filt.calc, log=TRUE)
for (top in c(100, 500, 1000, 5000)) {
plotMDS(adj.counts, main=top, col=c("blue", "blue", "red", "red"),labels=c("hela", "hela", "sknsh", "sknsh"), top=top)
}
dev.off()
Annotation of the results¶
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
anno <- detailRanges(merged$region, txdb=TxDb.Hsapiens.UCSC.hg19.knownGene,
orgdb=org.Hs.eg.db, promoter=c(3000, 1000), dist=5000)
merged$region$overlap <- anno$overlap
merged$region$left <- anno$left
merged$region$right <- anno$right
Creating the final object with results and annotation¶
Now we bring it all together:
all.results <- data.frame(as.data.frame(merged$region)[,1:3], table.combined, anno)
all.results, unsorted
> head(all.results)
seqnames start end num.tests num.up.logFC num.down.logFC PValue
1 chr1 10051 10360 7 0 5 2.328912e-04
2 chr1 29301 29410 3 0 3 6.989334e-06
3 chr1 100451 100560 3 0 3 1.948039e-04
4 chr1 151101 151310 5 0 5 4.108169e-05
5 chr1 246751 246860 3 0 3 6.674578e-05
6 chr1 408351 408560 5 0 5 1.880546e-04
FDR direction rep.test rep.logFC overlap left
1 0.0040397108 down 1 -7.239404 DDX11L1:+:P
2 0.0004822892 down 8 -7.000913 WASH7P:-:PE WASH7P:-:4410
3 0.0036799249 down 13 -7.501513
4 0.0011680754 down 14 -7.121331
5 0.0017204192 down 19 -7.208420
6 0.0036207355 down 23 -8.909876
right
1 DDX11L1:+:1514,WASH7P:-:4002
2 WASH7P:-:414
3
4
5
6
All significant regions are in:
sig=all.results[all.results$FDR<0.05,]
How many significantly different pekas?:
> nrow(sig)
[1] 432
To view the top of the all.results
table:
all.results <- all.results[order(all.results$PValue),]
> head(all.results)
seqnames start end num.tests num.up.logFC num.down.logFC
1726 chr2 25642751 25642760 1 1 0
822 chr1 143647051 143647060 1 0 1
876 chr1 149785201 149785210 1 0 1
386 chr1 40530701 40530710 1 1 0
2519 chr2 199778551 199778560 1 1 0
1613 chr2 8683951 8683960 1 0 1
PValue FDR direction rep.test rep.logFC
1726 7.875683e-07 0.0004407602 up 6126 7.360348
822 1.197351e-06 0.0004407602 down 3065 -6.938285
876 1.197351e-06 0.0004407602 down 3263 -6.938285
386 1.574877e-06 0.0004407602 up 1616 7.389491
2519 1.574877e-06 0.0004407602 up 8988 7.389491
1613 2.198223e-06 0.0004407602 down 5724 -6.997646
overlap left right
1726 DTNB:-:I DTNB:-:347
822 100286793:-:579
876 H2BC18:-:P,H3C13:-:PE H2BC18:-:1273
386 CAP1:+:I CAP1:+:470 CAP1:+:1177
2519
1613
We of course discourage ranking the results by p value only ;-).
Now you are ready to save the results as a table, inspect further and generate a compelling scientific hypothesis. You can also compare the outcome with results obtained from peak-based couting approach.
One final note: In this example we have used preprocessed bam files, i.e. reads mapped to the regions of spurious high signal in ChIP-seq (i.e. the ENCODE “blacklisted regions”) were removed, as were the so called duplicated reads - reads mapped to the same genomic positions. While filtering out the blacklisted regions is always recommended, removal of duplicated reads is not recommended for DB analysis, as they may represent true signal. As always, your mileage may vary, depending on the project (library sequencing depth and complexity being the factors to watch out for in this context), so exploring several options is essential for obtaining meaningful results.
relevant information from sessionInfo()
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[2] GenomicFeatures_1.42.3
[3] org.Hs.eg.db_3.12.0
[4] AnnotationDbi_1.52.0
[5] csaw_1.24.3
[6] SummarizedExperiment_1.20.0
[7] Biobase_2.50.0
[8] MatrixGenerics_1.2.1
[9] matrixStats_0.58.0
[10] GenomicRanges_1.42.0
[11] GenomeInfoDb_1.26.7
[12] IRanges_2.24.1
[13] S4Vectors_0.28.1
[14] BiocGenerics_0.36.0
[15] edgeR_3.32.1
[16] limma_3.46.0