ChIP-seq down-stream analysis: ChIPseeker

Learning outcomes

Using ChIPseeker package

  • to profile ChIP signal by genomics location and by ChIP binding to TSS regions

  • to annotate peaks, visualise and compare annotations

  • to run and compare functional enrichment

Introduction

In this tutorial we use another package, ChIPseeker, to have a look at the ChIP profiles, annotate peaks and visualise annotations as well as to run functional enrichment. In a way, ChIPseeker can be seen as an alternative and newer workflow to ChIPpeakAnno (introduced in differential binding). It also offers additional functionality, e.g. especially when it comes to visualising ChIP profiles and comparing functional annotations.

It supports annotating ChIP peaks and provides functions to visualize ChIP peaks coverage over chromosomes and profiles of peaks binding to TSS regions. Comparison of ChIP peak profiles and annotation are also supported. Moreover, it supports evaluating significant overlap among ChIP-seq datasets. Currently, ChIPseeker contains 17,000 bed file information from GEO database. These datasets can be downloaded and compare with user’s own data to explore significant overlap datasets for inferring co-regulation or transcription factor complex for further investigation.

Data & Methods

We will build upon the main labs, using the same dataset and results from DiffBind analyses that we have saved under DiffBind.RData. The tutorial is based on the ChIPseeker package tutorial so feel free to have this open alongside to read and experiment more.


Setting-up

You can continue working in the diffBind directory. We need access to diffBind.RData object, and some libraries, whcih are preinstalled. We access them via:

module load R_packages/4.0.4

In an R session:

# Load libraries (install if needed)
library(DiffBind)
library(ChIPseeker)
library(ReactomePA)
library(clusterProfiler)
library(biomaRt)

library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene


ChIP profile

ChIP peaks coverage plot

After peak calling one may want to visualise distribution of peaks locations over the whole genome. Function covplot calculates coverage of peaks regions over chromosomes.

Let’s use data saved in DiffBind.RData objects. From this object we can easily extract peaks called for all our libraries as well as consensus peakset. In principle, we could also use ChIPseeker on raw .BED files.

# Let's start fresh removing all objects from R environment
rm(list = ls())

# loading diffBind.RData
load("diffBind.RData")

# Do you remember what objects we have saved in the diffBind.RData
ls()

# res.cnt3 object was the final one containing consensus peaks and differential binding results

# viewing all samples
dba.show(res.cnt3)

# this should show you our 8 libraries
> dba.show(res.cnt3)
          ID Tissue Factor Replicate Caller Intervals   Reads FRiP
1 REST_chip1   HeLa   REST         1 counts      5343 1637778 0.09
2 REST_chip2   HeLa   REST         2 counts      5343 1991560 0.06
3 REST_chip3 neural   REST         1 counts      5343 3197782 0.04
4 REST_chip4 neural   REST         2 counts      5343 4924672 0.05
5 REST_chip5  HepG2   REST         1 counts      5343 2988915 0.03
6 REST_chip6  HepG2   REST         2 counts      5343 4812034 0.04
7 REST_chip7  sknsh   REST         1 counts      5343 2714033 0.07
8 REST_chip8  sknsh   REST         2 counts      5343 4180463 0.04

Please note the number of intervals (i.e. peaks) is 5343. This is different from the original consensus peakset which had 6389 peaks. This original data is present in object cnt.res2. This is because 1046 peaks fall in the internal blacklisted regions. At the same time, the object which holds the results res.cnt3 contains information which peaks are detected in which sample, and this matrix is still in the original peakset format (i.e. has 6389 rows).

Information on the consensus peakset in res.cnt3:

> head(res.cnt3$called, n=3)
     REST_chip1 REST_chip2 REST_chip3 REST_chip4 REST_chip5 REST_chip6
[1,]          0          0          0          0          1          1
[2,]          0          0          0          0          1          1
[3,]          0          0          0          0          1          1
     REST_chip7 REST_chip8
[1,]          0          1
[2,]          0          1
[3,]          1          1

> nrow(res.cnt3$called)
[1] 6389

We have to do some data wrangling on this matirx to extract the rows of interest to us, i.e. rows corresponding to peaks which were NOT blacklisted. The code to do this is presented below. You may copy - paste it and you’ll arrive at the correct object to continue working. If you would like to understand what’s happening, you can inspect the objects created in each step using commands head, nrow etc.

#all peaks including blacklisted, this corresponds to our object of interest res.cnt3$called
peaks.all=res.cnt2$peaks[[1]]

#peaks in blacklists
peaks.blck=as.data.frame(res.cnt3$peaks.blacklisted[1])

#clean up colnames
library(janitor)
peaks.all=peaks.all %>% clean_names()

#merge between the two objects
library(dplyr)
peaks.all.blck=left_join(peaks.all,peaks.blck, by=c("start", "chr"="seqnames" ))

#indices of peaks NOT in blacklists > to keep in the data
peaks.all.no_blck.ind=is.na(peaks.all.blck$group)

#examine if the numbers add up
table(peaks.all.no_blck.ind)
##      peaks.all.no_blck.ind
##      FALSE  TRUE
##       1046  5343

#subset res.cnt3$called
called.peaks=res.cnt3$called[peaks.all.no_blck.ind,]

nrow(called.peaks)
## [1] 5343

To plot peaks over genomic locations we need to extract from res.cnt3 peaks of interest, e.g. consensus peaks or present in a single replicate etc. Here, we will focus on peaks present in HeLa replicates.

# extracting consensus peak set with 5343 peaks
peaks.consensus <- dba.peakset(res.cnt3, bRetrieve = T)

peaks.consensus is a GRangers object:

> peaks.consensus
GRanges object with 5343 ranges and 8 metadata columns:
       seqnames              ranges strand | REST_chip1 REST_chip2 REST_chip3
          <Rle>           <IRanges>  <Rle> |  <numeric>  <numeric>  <numeric>
     1     chr1         29190-29590      * |          0          0          0
     2     chr1       100300-100700      * |          0          0          0
     3     chr1       151013-151413      * |          0          0          0
     4     chr1       246634-247034      * |          0          0          0
     5     chr1       408268-408668      * |          0          0          0
   ...      ...                 ...    ... .        ...        ...        ...
  5339     chr2 242910501-242910901      * |    0.00000    0.00000   48.35128
  5340     chr2 243011991-243012391      * |    0.00000    0.00000    0.00000
  5341     chr2 243030594-243030994      * |    0.00000    9.46756    4.83513
  5342     chr2 243093019-243093419      * |    2.56875    0.00000    0.00000
  5343     chr2 243184803-243185203      * |   77.06258  156.21468    0.00000
       REST_chip4 REST_chip5 REST_chip6 REST_chip7 REST_chip8
        <numeric>  <numeric>  <numeric>  <numeric>  <numeric>
     1          0   58.99879    59.8714    36.0184    45.3907
     2          0    1.07271    71.7064    57.9120    14.8551
     3          0    2.14541   110.6924   120.0614    44.5654
     4          0    1.07271    75.1873    58.6182    14.0298
     5          0    4.29082   130.1854   101.6991    48.6918
   ...        ...        ...        ...        ...        ...
  5339   28.73111    1.07271     0.0000    0.00000     0.0000
  5340    0.00000    0.00000    12.5312    7.76868     0.0000
  5341    6.91675    2.14541    18.7968    9.88741    12.3793
  5342    0.00000    6.43623    67.5293   48.73080    23.9333
  5343    0.00000    2.14541   188.6644  160.31726   120.4916
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

We select interesting peaks and work on them. First let’s check the peak locations and scores along the chromosomes.

# extracting HeLA peaks
peaks.HeLa_rep1 <-  peaks.consensus[called.peaks[,1]==1] # peaks called in rep 1
peaks.HeLa_rep2 <- peaks.consensus[called.peaks[,2]==1] # peaks called in rep 2

# adding an unified affinity scores column (re-formatting data)
peaks.HeLa_rep1$Score <- peaks.HeLa_rep1$REST_chip1
peaks.HeLa_rep2$Score <- peaks.HeLa_rep2$REST_chip2

# plotting coverage for replicate 1, using affinity scores as a weight for peaks height
covplot(peaks.HeLa_rep1, weightCol = "Score")

# zooming in to a selected region is also possible
covplot(peaks.HeLa_rep1, weightCol = "Score", xlim=c(0, 1e07))

#save the plots
pdf("chipseeker-coverage-plots-HeLa-r1.pdf")
covplot(peaks.HeLa_rep1, weightCol = "Score")
covplot(peaks.HeLa_rep1, weightCol = "Score", xlim=c(0, 1e07))
dev.off()

We can also compare peaks across replicates. This should give us visual assessment of variability between replicates: peaks locations and strength should match in an ideal scenario.

# creating genomicRangesList object holding replicates 1 and 2
grL.HeLa = GRangesList(HeLa_rep1=peaks.HeLa_rep1, HeLa_rep2=peaks.HeLa_rep2, compress=FALSE)


# plotting using affinity scores as a weight for peaks height
covplot(grL.HeLa, weightCol = "Score")

# zooming in
covplot(grL.HeLa, weightCol = "Score", xlim=c(0, 1e07))

#save the plots
pdf("chipseeker-coverage-plots-HeLa-r1-r2.pdf")
covplot(grL.HeLa, weightCol = "Score")
covplot(grL.HeLa, weightCol = "Score", xlim=c(0, 1e07))
dev.off()

What do you think?

  • are these peaks reproducible?

  • which pair of replicates is most consistent, HeLa, neural, HepG2 or sknsh? (hint: you may need to generate more plots to answer this)

  • why is it good to always look at the data instead of simply trusting the output of the summary statistics, after all, we do rely on diffBind to call peaks being consistent?

Profile of ChIP peaks binding to TSS regions

For calculating the profile of ChIP peaks binding to TSS regions, we need to prepare the TSS regions, which are defined as the flanking sequence of the TSS sites. Then we can align the peaks that are mapping to these regions, and generate the tagMatrix used for plotting.

Here, we will select peaks present per cell type, i.e. found in two replicates. We will also create tagMatrix list to enable groups comparisons across cell lines.

# extracting peaks for each cell line present across replicates
peaks.HeLa <- peaks.consensus[called.peaks[,1]==1 & called.peaks[,2]==1]
peaks.neural <- peaks.consensus[called.peaks[,3]==1 & called.peaks[,4]==1]
peaks.HepG2 <- peaks.consensus[called.peaks[,5]==1 & called.peaks[,6]==1]
peaks.sknsh <- peaks.consensus[called.peaks[,7]==1 & called.peaks[,8]==1]

# getting TSS regions
promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)

# calculating tagMatrix
tagMatrix.1 <- getTagMatrix(peaks.HeLa, windows=promoter)
tagMatrix.2 <- getTagMatrix(peaks.neural, windows=promoter)
tagMatrix.3 <- getTagMatrix(peaks.HepG2, windows=promoter)
tagMatrix.4 <- getTagMatrix(peaks.sknsh, windows=promoter)

# preparing tagMatrix list to enable cell lines comparisions
tagMatrixList <- list(HeLa=tagMatrix.1, neural=tagMatrix.2, HepG2=tagMatrix.3, sknsh=tagMatrix.4)

# plotting tagMatrix heatmaps for each cell line
tagHeatmap(tagMatrixList, xlim=c(-3000, 3000), color=NULL)

# plotting average profile of ChIP peaks among different cell lines
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000))

#save the plot
pdf("chipseeker-average-peak-profile.pdf")
plotAvgProf(tagMatrixList, xlim=c(-3000, 3000))
dev.off()


Peaks Annotation

Peak annotations is performed by ``annotatePeak`` function. Here, we can define TSS region, by default set to -3kb to 3kb. The output of annotatePeak is csAnno object than we can convert to GRanges with as.GRanges() function or to data frame with as.data.frame() function.

Similar to annotations with ChIPpeakAnno we will need TxDB object containing annotations, transcript-related features of a particular genome. We can use Bioconductor packages providing annotations for various model organisms. It may be however good to know that one can also prepare their own TxDb object by retrieving information from UCSC or BioMart using GenomicFeature package. Here, we will use TxDb.Hsapiens.UCSC.hg19.knownGene annotations provided by Bioconductor.

Some annotations may overlap and by default ChIPseeker annotates peaks with the priority: promoter, 5’ UTR, 3’ UTR, exon, intron, downstreamn, intergenic, where downstream is defined as the downstream of gene end. This priority can be changed with genomicAnnotationPriority parameter.

While annotating peaks we can include optional parameter annoDb containig further genome wide annotation data. If added, this will add SYMBOL, GENENAME, ENSEMBL/ENTREZID to the peaks annotations. Again, we will use Bioconductor org.Hs.eg.db for human genome wide annotation data.

# extracting all consensus peaks (repeating commands for clarity)
peaks.consensus <- dba.peakset(res.cnt3, bRetrieve = T)

# extracting peaks for each cell line present across replicates (repeating commands for clarity)
peaks.HeLa <- peaks.consensus[res.cnt3$called[,1]==1 & res.cnt3$called[,2]==1]
peaks.neural <- peaks.consensus[res.cnt3$called[,3]==1 & res.cnt3$called[,4]==1]
peaks.HepG2 <- peaks.consensus[res.cnt3$called[,5]==1 & res.cnt3$called[,6]==1]
peaks.sknsh <- peaks.consensus[res.cnt3$called[,7]==1 & res.cnt3$called[,8]==1]

# annotating peaks
peaks.HeLa_ann <- annotatePeak(peaks.HeLa, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db")
peaks.neural_ann <- annotatePeak(peaks.neural, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db")
peaks.HepG2_ann <- annotatePeak(peaks.HepG2, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db")
peaks.sknsh_ann <- annotatePeak(peaks.sknsh, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db")

# previewing annotations summary for HeLa peaks
peaks.HeLa_ann

> peaks.HeLa_ann
Annotated peaks generated by ChIPseeker
996/996  peaks were annotated
Genomic Annotation Summary:
              Feature  Frequency
9    Promoter (<=1kb) 12.6506024
10   Promoter (1-2kb)  5.2208835
11   Promoter (2-3kb)  3.9156627
4              5' UTR  0.2008032
3              3' UTR  1.5060241
1            1st Exon  0.1004016
7          Other Exon  3.3132530
2          1st Intron 11.4457831
8        Other Intron 21.2851406
6  Downstream (<=300)  1.2048193
5   Distal Intergenic 39.1566265


# previewing peaks annotations for HeLa peaks
head(as.data.frame(peaks.HeLa_ann))

We find our genomic annotations in _annotation_ column. Plots, pie and barplot, are supported to visualise these annotations.

# creating barplot for HeLa peaks genomics annotations
plotAnnoBar(peaks.HeLa_ann)

# creating vennpie plot
vennpie(peaks.HeLa_ann)

# creating upsetplot showing overlapping annotations
upsetplot(peaks.HeLa_ann)

Let’s save these plots:

pdf("chipseeker-annoplots-hela.pdf")
plotAnnoBar(peaks.HeLa_ann)
vennpie(peaks.HeLa_ann)
upsetplot(peaks.HeLa_ann)
dev.off()

We can also use plotAnnoBar to compare annotations between different datasets, here cell lines. For that, we just need to create a list containing peaks annotations of datasets to compare.

# creating list holding annotations for different cell lines
list.annotations <- list(HeLa=peaks.HeLa_ann, neural=peaks.neural_ann, HepG2=peaks.HepG2_ann, sknskh=peaks.sknsh_ann)

# creating barplot for HeLa, neural, HepG2 and sknsh peaks genomic annotations
plotAnnoBar(list.annotations)

Finally, we can also visualise distribution of TF-binding loci relative to TSS, for single annotation set or using annotations list for comparisons.

# plotting distance to TSS for HeLa peaks
plotDistToTSS(peaks.HeLa_ann)

# plotting distance to TSS for all cell lines in our annotation list
plotDistToTSS(list.annotations)


pdf("chipseeker-DistToTSS-hela.pdf")
plotDistToTSS(peaks.HeLa_ann)
plotDistToTSS(list.annotations)
dev.off()

What do you think?

  • would you expect such distribution of features?

  • do these distributions differ between cell-lines?


Functional analysis

Having obtained annotations to nearest genes, we can perform functional enrichment analysis to identify predominant biological themes among these genes by incorporating biological knowledge provided by biological ontologies, incl. GO (Gene Ontology, Ashburner et al. 2000), KEGG (Kyoto Encyclopedia of Genes and Genomes, Kanehisa et al. 2004), DO (Disease Ontology, Schriml et al. 2011) or Reactome (Croft et al. 2013).

Here, we can also use seq2gene function for linking genomic regions to genes in a many-to-many mapping. This function consider host gene (exon/intron), promoter region and flanking gene from intergenic region that may undergo control via cis-regulation.

One can build on using ChIPseeker for functional enrichment and annotation as there are several packages by the same author to identify biological themes, i.e. ReactomePA for reactome pathways enrichment, DOSE for Disease Ontology, clusterProfiler for Gene Ontology and KEGG enrichment analysis. Especially clustserProfiler comes handy when visualising and comparing biological themes, also when comparing functions derived from other omics technologies for integrative analyses.

Here, we will experiment with few functions only. We will search for enriched reactome pathways using genes annotated to peaks by nearest location and allowing for many-to-many mapping. We will also learn how to compare functional annotations between peak sets using GO terms as an example.

We will start by defying our genes background, i.e. genes on chromosome 1 and 2. For this we can use functions from biomaRt

# defining chromosomes
chrom=c(1,2)

# defining source
ensembl=useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)

# running query: extracting ENTREZID for genes on chromosome 1 and 2
genes.chr1chr2 <- getBM(attributes= "entrezgene_id",
        filters=c("chromosome_name"),
        values=list(chrom), mart=ensembl)

# reformatting output to character string (as required later on by clusterProfiler functions)
genes.universe <- as.character(as.numeric(as.matrix(genes.chr1chr2)))

Reactome pathway enrichment of genes defined as a) nearest feature to the peaks and b) allowing for many-to-many mapping

# a: selecting annotated peaks for functional enrichment in object
data.peaks_ann <- peaks.neural_ann

# a: fining enriched Reactome pathways using chromosome 1 and 2 genes as a background
pathway.reac1 <- enrichPathway(as.data.frame(data.peaks_ann)$geneId, universe = genes.universe)

# a: previewing enriched Reactome pathways
head(pathway.reac1)

This is the overrepresented pathway:

> head(pathway.reac1)
                       ID     Description GeneRatio BgRatio      pvalue
R-HSA-112316 R-HSA-112316 Neuronal System    29/435 61/1797 4.56747e-05
               p.adjust     qvalue
R-HSA-112316 0.02101036 0.02101036
                                                                                                                                                                 geneID
R-HSA-112316 2782/8514/57576/58512/2899/55970/5567/3737/3752/3782/777/2752/127833/3756/3776/3775/3754/3790/170850/9378/80059/347730/60482/785/3760/90134/2571/2744/1385
             Count
R-HSA-112316    29
# b: selecting peaks
data.peaks <- peaks.HeLa

# b: running seq2gene function for many-to-many mapping based on sequence regions (note: no prior peaks annotations here, many-to-many mapping is done from the sequence)
genes.m2m <- seq2gene(data.peaks, tssRegion = c(-3000, 3000), flankDistance = 3000, TxDb=txdb)

# b: finding enriched Reactome pathways given many to many mapping and chromosome 1 and 2 genes as a background
pathway.reac2 <- enrichPathway(genes.m2m, universe = genes.universe)

# b: creating dotplot to visualise enrichment results
dotplot(pathway.reac2)

#save the plot
pdf("chipseeker-dotplot-reactome-HeLa.pdf")
dotplot(pathway.reac2)
dev.off()

Let’s search for enriched GO terms, and let’s see how we can do it for all the peak sets together so we can easily compare the results on a dotplot. Also, let’s learn how to simplify the output of GO terms using simplify function, useful in cases where lots of GO terms turn-up to be significant and it becomes difficult to interpret results. simply function removes redundant GO terms obtained from encrichGO calling internally GoSemSim function to calculate similarities among GO terms and removes those highly similar terms by keeping one representative term.

# creating a gene list with ENTREZID ideas extracted from our annotation list, containing annotated peaks for all four cell lines
list.genes = lapply(list.annotations, function(i) as.data.frame(i)$geneId)
names(list.genes) <- sub("_", "\n", names(list.genes))

# running enrichedGO function to find enriched MF correlation_libraries_normalised on the gene list

compMF <- compareCluster(geneCluster = list.genes,
                       fun           = "enrichGO",
                       pvalueCutoff  = 0.05,
                       pAdjustMethod = "BH",
                       OrgDb='org.Hs.eg.db',
                       ont="MF")

# comparing results on a dotplot
dotplot(compMF)

# simplifying results although here we do not have problems with too many GO terms
compMF.flr <- simplify(compMF, cutoff = 0.7, by = "p.adjust", select_fun = min, measure = "Wang", semData = NULL)

# creating a dotplot on reduced GO terms
dotplot(compMF.flr)

And let’s save the plots:

pdf("chipseeker-GO-MF.pdf")
 dotplot(compMF)
 dotplot(compMF.flr)
dev.off()

Concluding remarks and next steps

There are different flavours to functional annotations, and what and how functional annotations should be done is context dependent, i.e. they should be adjusted given available data and biological question being asked. There are many methods out there, all relying on the available annotations and databases, being constantly improved and developed. As a rule of thumb to understand the results and be able to draw biological conclusions, it may be good to think about i) the statistical test behind the method, ii) what is compared against what (i.e. genes vs. background) and which databases are being used (i.e. Reactome, GO, DO, KEGG).

For more examples on what can be done in terms on functional annotations, we recommend reading tutorials on clusterProfiler and DOSE, where you can further learn about semantic similarity analysis, disease enrichment analysis, GSEA analysis and much more.