Peak Annotation

Learning outcomes

Using ChIPseeker package

  • to profile ATAC-seq signal by genomics location and by proximity to TSS regions

  • to annotate peaks with the nearest gene

  • to run and compare functional enrichment

Introduction

In this tutorial we use an R / Bioconductor package ChIPseeker, to have a look at the ATAC / ChIP profiles, annotate peaks and visualise annotations. We will also perform functional annotation of peaks using clusterProfiler.

The tutorial is based on the ChIPseeker package tutorial so feel free to have this open alongside to read and experiment more.

Data & Methods

We will build upon the main labs:

  • ATAC-seq: all detected peaks (merged consensus peaks);

  • ATAC-seq: differentially accessible peaks;

Setting-up

You can continue working in the atacseq/analysis/counts directory. This directory contains merged peaks called earlier using macs3 callpeak as well as count tables derived from summarising of non-subset data (we won’t need the count tables for this exercise). We will use file nk_merged_peaksid.bed and annotation libraries, which are preinstalled. We access them via:

module load R_packages/4.1.1

We activate R console upon typing R in the terminal.

We begin by loading necessary libraries:

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

library(GenomicAlignments)
library(GenomicFeatures)

library(clusterProfiler)
library(biomaRt)
library(org.Hs.eg.db)
library(ReactomePA)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene


Peaks Coverage Plot

After peak calling one may want to visualise distribution of peaks locations over the whole genome. Function covplot calculates coverage of peaks regions over chromosomes.

Let’s load in the data (peaks called on nun subset data) and transform the data frame to GenomicRanges object (GRanges):

pth2peaks_bed="nk_merged_peaksid.bed"

peaks.bed=read.table(pth2peaks_bed, sep="\t", header=FALSE, blank.lines.skip=TRUE)
rownames(peaks.bed)=peaks.bed[,4]

peaks.gr <- GRanges(seqnames=peaks.bed[,1], ranges=IRanges(peaks.bed[,2], peaks.bed[,3]), strand="*", mcols=data.frame(peakID=peaks.bed[,4]))

If you are not familiar with GRanges objects, this is how the structure is:

GRanges object with 83180 ranges and 1 metadata column:
          seqnames              ranges strand |          mcols.peakID
             <Rle>           <IRanges>  <Rle> |           <character>
      [1]     chr1         10003-10442      * |     nk_merged_macs3_1
      [2]     chr1         28932-29454      * |     nk_merged_macs3_2
      [3]     chr1       180755-181134      * |     nk_merged_macs3_3
      [4]     chr1       181359-181895      * |     nk_merged_macs3_4
      [5]     chr1       183598-183831      * |     nk_merged_macs3_5
      ...      ...                 ...    ... .                   ...
  [83176]     chrX 155997332-155997955      * | nk_merged_macs3_83176
  [83177]     chrX 156016605-156016865      * | nk_merged_macs3_83177
  [83178]     chrX 156025043-156025495      * | nk_merged_macs3_83178
  [83179]     chrX 156028799-156029148      * | nk_merged_macs3_83179
  [83180]     chrX 156030182-156030752      * | nk_merged_macs3_83180
  -------
  seqinfo: 91 sequences from an unspecified genome; no seqlengths

To inspect peak coverage along the chromosomes:

covplot(peaks.gr, chrs=c("chr14", "chr15"))

#to save the image to file
pdf("PeakCoverage.pdf")
covplot(peaks.gr, chrs=c("chr14", "chr15"))
dev.off()



Peak Annotation

To annotate peaks with closest genomic features:

bed.annot = annotatePeak(peaks.gr, tssRegion=c(-3000, 3000),TxDb=txdb, annoDb="org.Hs.eg.db")

Let’s inspect the results:

> bed.annot
Annotated peaks generated by ChIPseeker
82916/83180  peaks were annotated
Genomic Annotation Summary:
              Feature   Frequency
9    Promoter (<=1kb) 24.99879396
10   Promoter (1-2kb)  4.17289787
11   Promoter (2-3kb)  3.47098268
4              5' UTR  0.31598244
3              3' UTR  2.09971537
1            1st Exon  1.80905977
7          Other Exon  3.00424526
2          1st Intron 12.60191037
8        Other Intron 23.51536495
6  Downstream (<=300)  0.08321675
5   Distal Intergenic 23.92783058

Ca 25% of peaks localise to TSS, as expected in an ATAC-seq experiment.

Let’s see peak annotations:

annot_peaks=as.data.frame(bed.annot)

This is the resulting data frame:

  seqnames  start    end width strand      mcols.peakID       annotation
1     chr1  10003  10442   440      * nk_merged_macs3_1 Promoter (1-2kb)
2     chr1  28932  29454   523      * nk_merged_macs3_2 Promoter (<=1kb)
3     chr1 180755 181134   380      * nk_merged_macs3_3 Promoter (1-2kb)
4     chr1 181359 181895   537      * nk_merged_macs3_4 Promoter (<=1kb)
5     chr1 183598 183831   234      * nk_merged_macs3_5 Promoter (<=1kb)
6     chr1 190831 192057  1227      * nk_merged_macs3_6 Promoter (2-3kb)
  geneChr geneStart geneEnd geneLength geneStrand    geneId      transcriptId
1       1     11869   14409       2541          1 100287102 ENST00000456328.2
2       1     14404   29570      15167          2    653635 ENST00000488147.1
3       1    182696  184174       1479          1 102725121 ENST00000624431.2
4       1    182696  184174       1479          1 102725121 ENST00000624431.2
5       1    182696  184174       1479          1 102725121 ENST00000624431.2
6       1    187891  187958         68          2 102466751 ENST00000612080.1
  distanceToTSS         ENSEMBL    SYMBOL
1         -1427 ENSG00000223972   DDX11L1
2           116 ENSG00000227232    WASH7P
3         -1562 ENSG00000223972  DDX11L17
4          -801 ENSG00000223972  DDX11L17
5           902 ENSG00000223972  DDX11L17
6         -2873 ENSG00000278267 MIR6859-1
                                     GENENAME
1  DEAD/H-box helicase 11 like 1 (pseudogene)
2           WASP family homolog 7, pseudogene
3 DEAD/H-box helicase 11 like 17 (pseudogene)
4 DEAD/H-box helicase 11 like 17 (pseudogene)
5 DEAD/H-box helicase 11 like 17 (pseudogene)
6                             microRNA 6859-1

It can be saved to a file:

write.table(annot_peaks, "nk_merged_annotated.txt",
        append = FALSE,
        quote = FALSE,
        sep = "\t",
        row.names = FALSE,
        col.names = TRUE,
        fileEncoding = "")

We can also visualise the annotation summary:

pdf("AnnotVis.pdf")
upsetplot(bed.annot, vennpie=TRUE)
dev.off()


Distribution of loci with respect to TSS:

pdf("TSSdist.pdf")
plotDistToTSS(bed.annot, title="Distribution of ATAC-seq peaks loci\nrelative to TSS")
dev.off()


Functional Analysis

Having obtained annotations to nearest genes, we can perform functional enrichment analysis to identify predominant biological themes among these genes by incorporating knowledge provided by biological ontologies, e.g. GO (Gene Ontology, Ashburner et al. 2000) and Reactome (Croft et al. 2013).

In this tutorial we use the merged consensus peaks set. This analysis can also be performed on results of differential accessibility / occupancy.

Let’s first annotate the peaks with Reactome.

Reactome pathway enrichment of genes defined as the nearest feature to the peaks:

#finding enriched Reactome pathways using chromosome 1 and 2 genes as a background
pathway.reac <- enrichPathway(as.data.frame(annot_peaks)$geneId)

#previewing enriched Reactome pathways
head(pathway.reac)

This is the result (we skip column 8, as it is very broad - contains the gene IDs in set):

> colnames(as.data.frame(pathway.reac))
[1] "ID"          "Description" "GeneRatio"   "BgRatio"     "pvalue"
[6] "p.adjust"    "qvalue"      "geneID"      "Count"

> pathway.reac[1:10,c(1:7,9)]
                         ID
R-HSA-9012999 R-HSA-9012999
R-HSA-9013149 R-HSA-9013149
R-HSA-9013148 R-HSA-9013148
R-HSA-4420097 R-HSA-4420097
R-HSA-9006925 R-HSA-9006925
R-HSA-5683057 R-HSA-5683057
R-HSA-194138   R-HSA-194138
R-HSA-449147   R-HSA-449147
R-HSA-5663202 R-HSA-5663202
R-HSA-9013106 R-HSA-9013106
                                                                                   Description
R-HSA-9012999                                                                 RHO GTPase cycle
R-HSA-9013149                                                                RAC1 GTPase cycle
R-HSA-9013148                                                               CDC42 GTPase cycle
R-HSA-4420097                                                             VEGFA-VEGFR2 Pathway
R-HSA-9006925                                     Intracellular signaling by second messengers
R-HSA-5683057                                                   MAPK family signaling cascades
R-HSA-194138                                                                 Signaling by VEGF
R-HSA-449147                                                         Signaling by Interleukins
R-HSA-5663202 Diseases of signal transduction by growth factor receptors and second messengers
R-HSA-9013106                                                                RHOC GTPase cycle
              GeneRatio   BgRatio       pvalue     p.adjust       qvalue Count
R-HSA-9012999  424/9073 443/10856 5.713537e-16 8.678863e-13 7.415570e-13   424
R-HSA-9013149  180/9073 185/10856 1.792656e-09 1.361522e-06 1.163340e-06   180
R-HSA-9013148  155/9073 159/10856 1.512873e-08 7.660180e-06 6.545166e-06   155
R-HSA-4420097   98/9073  99/10856 3.655317e-07 1.388107e-04 1.186054e-04    98
R-HSA-9006925  287/9073 309/10856 7.154392e-07 1.882887e-04 1.608815e-04   287
R-HSA-5683057  301/9073 325/10856 8.286217e-07 1.882887e-04 1.608815e-04   301
R-HSA-194138   106/9073 108/10856 8.676899e-07 1.882887e-04 1.608815e-04   106
R-HSA-449147   421/9073 462/10856 1.150075e-06 2.183704e-04 1.865845e-04   421
R-HSA-5663202  362/9073 395/10856 1.447213e-06 2.442575e-04 2.087034e-04   362
R-HSA-9013106   74/9073  74/10856 1.631772e-06 2.478661e-04 2.117868e-04    74

We can see familar terms which can be connected to sample biology: Signaling by Interleukins, MAPK family signaling cascades.

Let’s search for enriched GO terms:

pathway.GO <- enrichGO(as.data.frame(annot_peaks)$geneId, org.Hs.eg.db, ont = "MF")

These results look in agreement with analyses using reactome:

                   ID                                Description       qvalue
GO:0004674 GO:0004674   protein serine/threonine kinase activity 1.923215e-14
GO:0030695 GO:0030695                  GTPase regulator activity 2.997318e-10
GO:0045296 GO:0045296                           cadherin binding 4.941368e-09
GO:0015631 GO:0015631                            tubulin binding 8.978474e-09
GO:0060090 GO:0060090                 molecular adaptor activity 8.978474e-09
GO:0005085 GO:0005085 guanyl-nucleotide exchange factor activity 8.979500e-09
GO:0051020 GO:0051020                             GTPase binding 2.355108e-08
GO:0003779 GO:0003779                              actin binding 6.483723e-08
GO:0031267 GO:0031267                       small GTPase binding 1.222450e-07
GO:0030674 GO:0030674     protein-macromolecule adaptor activity 1.900703e-07
GO:0042578 GO:0042578        phosphoric ester hydrolase activity 2.558341e-06

Please remember that the results of functional analysis like the one presented above can be only as good as the annotations.