Peak Annotation

Date:

2025-09-15


Learning outcomes

Using ChIPseeker package

  • to profile ATAC-seq signal by genomic location and by proximity to TSS regions

  • to annotate peaks with the nearest feature

  • to perform functional annotation


Note

We continue working with data from (Tsao et al. 2022). We will use the count table derived from non subset data (already prepared).


Introduction

In this tutorial we use an R / Bioconductor package ChIPseeker (Yu, Wang, and He 2015), to have a look at the ATAC / ChIP profiles, annotate peaks and visualise annotations. We will also perform functional annotation of peaks using clusterProfiler (Yu et al. 2012).

This tutorial is based on the ChIPseeker package tutorial so feel free to have this open alongside to read and experiment more.


Data & Methods

We will build upon the main lab:


Setting Up

You can continue working in the directory atacseq/analysis.

We will use counts table which summarises reads in peaks detected using non-subset data in file AB_Batf_KO_invivo.genrich_joint.merged_peaks.featureCounts and annotation libraries, which are preinstalled.

We can create a new directory which will be our workdir in the subsequent code:

mkdir DA
cd DA

We can link all necessary files:

ln -s ../counts

ln -s ../../results/DA/annotation/

Please note the counts directory used in this step is not the same one we created when generating the counts table in the course of the previous tutorial. If in doubt, you can link the files like this:

ln -s /proj/epi2025/atacseq_proc/analysis/counts

We access the R environment via:

module load R_packages/4.3.1


We activate R console upon typing R in the terminal.


We begin by loading necessary libraries:

library(tidyverse)
library(dplyr)
library(kableExtra)

library(GenomicRanges)

library(ChIPseeker)
library(biomaRt)
library(txdbmaker)

library(clusterProfiler)
library(org.Mm.eg.db)
library(ReactomePA)

We need to set the path to workdir first, as other paths are relative to it:

workdir=getwd()

To set working directory to your desired path (other than the current directory the R session started at) you can use these commands:

workdir="/path/to/workdir"

workdir=setwd()

We take advantage of the module system on Rackham in this tutorial. The code was tested under R 4.3.1 The lab was developed under different R version, as stated in session info.


Preparing Annotations

We will use TSS annotations from Ensembl. We can fetch them from biomaRt:

txdb_mm39 = makeTxDbFromBiomart(biomart="ensembl",
                               dataset="mmusculus_gene_ensembl",
                               circ_seqs=NULL,
                               host="https://www.ensembl.org",
                               taxonomyId=NA,
                               miRBaseBuild=NA)


ensembl <- useEnsembl(biomart = "genes", dataset = "mmusculus_gene_ensembl")
all_genes=names(transcriptsBy(txdb_mm39, "gene"))

gene_annot_mm39=getBM(attributes =
   c('ensembl_gene_id', 'ensembl_transcript_id','external_gene_name','entrezgene_id','description','gene_biotype','chromosome_name','start_position','end_position','strand','transcription_start_site'),
       filters = 'ensembl_gene_id',
       values = all_genes,
        mart = ensembl)

We have prepared these for you, so you can load them via:

annotdir=file.path(workdir,"annotation")

pth_txdbens=file.path(annotdir,"Ensembl.txdb.GRCm39.Rdata")
txdb_mm39=loadDb(pth_txdbens)

gene_annot_mm39=read.delim(file.path(annotdir,"mm39_gene_names.tab"), sep="\t", header=TRUE, quote = "")


You can inspect the objects:

txdb_mm39
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: BioMart
## # Organism: Mus musculus
## # Taxonomy ID: 10090
## # Resource URL: www.ensembl.org:443
## # BioMart database: ENSEMBL_MART_ENSEMBL
## # BioMart database version: Ensembl Genes 115
## # BioMart dataset: mmusculus_gene_ensembl
## # BioMart dataset description: Mouse genes (GRCm39)
## # BioMart dataset version: GRCm39
## # Full dataset: yes
## # miRBase build ID: NA
## # Nb of transcripts: 278396
## # Db created by: txdbmaker package from Bioconductor
## # Creation time: 2025-09-08 11:35:27 +0200 (Mon, 08 Sep 2025)
## # txdbmaker version at creation time: 1.0.1
## # RSQLite version at creation time: 2.4.3
## # DBSCHEMAVERSION: 1.2

head(gene_annot_mm39)
##      ensembl_gene_id ensembl_transcript_id external_gene_name entrezgene_id
## 1 ENSMUSG00000000103    ENSMUST00000187148               Zfy2         22768
## 2 ENSMUSG00000000103    ENSMUST00000115891               Zfy2         22768
## 3 ENSMUSG00000001700    ENSMUST00000237355            Gramd2b        107022
## 4 ENSMUSG00000001700    ENSMUST00000237422            Gramd2b        107022
## 5 ENSMUSG00000001700    ENSMUST00000235794            Gramd2b        107022
## 6 ENSMUSG00000001700    ENSMUST00000237716            Gramd2b        107022
##                                                         description
## 1 zinc finger protein 2, Y-linked [Source:MGI Symbol;Acc:MGI:99213]
## 2 zinc finger protein 2, Y-linked [Source:MGI Symbol;Acc:MGI:99213]
## 3     GRAM domain containing 2B [Source:MGI Symbol;Acc:MGI:1914815]
## 4     GRAM domain containing 2B [Source:MGI Symbol;Acc:MGI:1914815]
## 5     GRAM domain containing 2B [Source:MGI Symbol;Acc:MGI:1914815]
## 6     GRAM domain containing 2B [Source:MGI Symbol;Acc:MGI:1914815]
##     gene_biotype chromosome_name start_position end_position strand
## 1 protein_coding               Y        2106015      2170409     -1
## 2 protein_coding               Y        2106015      2170409     -1
## 3 protein_coding              18       56533409     56636864      1
## 4 protein_coding              18       56533409     56636864      1
## 5 protein_coding              18       56533409     56636864      1
## 6 protein_coding              18       56533409     56636864      1
##   transcription_start_site
## 1                  2150346
## 2                  2170409
## 3                 56533409
## 4                 56533447
## 5                 56552242
## 6                 56602339



If you would rather annotate TSS using the gene models from UCSC you can use the Bioconductor package directly:

library(TxDb.Mmusculus.UCSC.mm39.knownGene)
txdb=TxDb.Mmusculus.UCSC.mm39.knownGene

Please note that UCSC and Ensembl use different contig naming schemes, so it is advisable to use the annotation matching the genome reference used for read mapping.



Data

We can now load data. We will subset the count table to only contain the peaks on assembled chromosomes.

count_table_fname="AB_Batf_KO_invivo.genrich_joint.merged_peaks.featureCounts"
cnt_table_pth=file.path(file.path(workdir,"counts"),count_table_fname)

cnt_table=read.table(cnt_table_pth, sep="\t", header=TRUE, blank.lines.skip=TRUE, comment.char = "#")
rownames(cnt_table)=cnt_table$Geneid
rownames(cnt_table)=c(gsub("AB_Batf_KO_invivo.genrich_joint.","",rownames(cnt_table)))
colnames(cnt_table)=c(colnames(cnt_table)[1:6],gsub(".filt.bam","",colnames(cnt_table)[7:10]))

colnames(cnt_table)[7:10]=c("B1_WT_Batf-floxed","B2_WT_Batf-floxed","A1_Batf_cKO","A2_Batf_cKO")

#remove peaks not on the assembled chromosomes
cnt_table_chr=cnt_table|>
  dplyr::filter(Chr%in%c(1:19) | Chr%in%c("X","Y"))

reads.peak=cnt_table_chr[,c(7:10)]

head(reads.peak)
##                B1_WT_Batf-floxed B2_WT_Batf-floxed A1_Batf_cKO A2_Batf_cKO
## merged_peaks_1               299               238         325         330
## merged_peaks_2               106                83         162         174
## merged_peaks_3                19                24          25          21
## merged_peaks_4                27                31          40          29
## merged_peaks_5               114               101          65         151
## merged_peaks_6               129               137         120         204
  • All peaks: n = 65027.

  • Peaks on assembled chromosomes: n = 64879. These peaks will be used for further analysis.



Peak Annotation

We will be working on a GRanges object peaks_gr containing non-subset peaks (i.e. all assebled chromosome peaks).

Let’s create the object:

peaks_gr=GRanges(seqnames=cnt_table_chr$Chr, ranges=IRanges(cnt_table_chr$Start, cnt_table_chr$End), strand="*", mcols=data.frame(peakID=rownames(cnt_table_chr)))

and inspect it:

peaks_gr
## GRanges object with 64879 ranges and 1 metadata column:
##           seqnames            ranges strand |       mcols.peakID
##              <Rle>         <IRanges>  <Rle> |        <character>
##       [1]        1   3050939-3052959      * |     merged_peaks_1
##       [2]        1   3053048-3054634      * |     merged_peaks_2
##       [3]        1   3054861-3055532      * |     merged_peaks_3
##       [4]        1   3057260-3057785      * |     merged_peaks_4
##       [5]        1   3059375-3061360      * |     merged_peaks_5
##       ...      ...               ...    ... .                ...
##   [64875]        Y 90814281-90815165      * | merged_peaks_64875
##   [64876]        Y 90815739-90816707      * | merged_peaks_64876
##   [64877]        Y 90818033-90819321      * | merged_peaks_64877
##   [64878]        Y 90819900-90820364      * | merged_peaks_64878
##   [64879]        Y 90821996-90824312      * | merged_peaks_64879
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths

We’re ready to annotate the peaks to their closest feature. Regions up to 3kb from TSS are annotated as “promoter”.

peakAnno=annotatePeak(peaks_gr, tssRegion=c(-3000, 3000),TxDb=txdb_mm39)

Summary of the regions annotated to peaks:

peakAnno
## Annotated peaks generated by ChIPseeker
## 64879/64879  peaks were annotated
## Genomic Annotation Summary:
##               Feature   Frequency
## 9    Promoter (<=1kb) 36.09642565
## 10   Promoter (1-2kb)  8.69464696
## 11   Promoter (2-3kb)  6.88204196
## 4              5' UTR  0.12022380
## 3              3' UTR  1.63843462
## 1            1st Exon  0.04469859
## 7          Other Exon  3.04104564
## 2          1st Intron 10.21594044
## 8        Other Intron 18.13838068
## 6  Downstream (<=300)  0.13717844
## 5   Distal Intergenic 14.99098321

Over 30% peaks localised to TSS, as expected in an ATAC-seq experiment.



Let’s inspect peak annotations:

peakAnno_df=as.data.frame(peakAnno)

  seqnames   start     end width strand   mcols.peakID        annotation geneChr geneStart geneEnd geneLength geneStrand             geneId       transcriptId distanceToTSS
1        1 3050939 3052959  2021      * merged_peaks_1 Distal Intergenic       1   3143476 3144545       1070          1 ENSMUSG00000102693 ENSMUST00000193812        -90517
2        1 3053048 3054634  1587      * merged_peaks_2 Distal Intergenic       1   3143476 3144545       1070          1 ENSMUSG00000102693 ENSMUST00000193812        -88842
3        1 3054861 3055532   672      * merged_peaks_3 Distal Intergenic       1   3143476 3144545       1070          1 ENSMUSG00000102693 ENSMUST00000193812        -87944
4        1 3057260 3057785   526      * merged_peaks_4 Distal Intergenic       1   3143476 3144545       1070          1 ENSMUSG00000102693 ENSMUST00000193812        -85691
5        1 3059375 3061360  1986      * merged_peaks_5 Distal Intergenic       1   3143476 3144545       1070          1 ENSMUSG00000102693 ENSMUST00000193812        -82116
6        1 3066555 3069092  2538      * merged_peaks_6 Distal Intergenic       1   3143476 3144545       1070          1 ENSMUSG00000102693 ENSMUST00000193812        -74384

We may want to include more gene related information:

peakAnno_df=peakAnno_df|>left_join(gene_annot_mm39, by=c("transcriptId"="ensembl_transcript_id"))

It can be saved to a file as a table:

write.table(peakAnno_df, "Batf_WT_KO.merged_peaks.tsv",
    append = FALSE,
    quote = FALSE,
    sep = "\t",
    row.names = FALSE,
    col.names = TRUE,
    fileEncoding = "")

Or as R data object rds

saveRDS(peakAnno_df, file = "Allpeaks_annot.Ensembl.rds")

We can inspect read density at annotated TSS regions:

promoter=getPromoters(TxDb=txdb_mm39, upstream=3000, downstream=3000)
tagMatrix=getTagMatrix(peaks_gr, windows=promoter)

TSS_profile=plotAvgProf(tagMatrix, xlim=c(-3000, 3000), xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")

Plot of all peaks in relation to transcription start sites (TSS) is presented on Figure below. Expected is an enrichment of signal in the vicinity of TSS.

TSS profile

TSS profile

We can also plot summary of the annotations:

peakAnnoplot=upsetplot(peakAnno, vennpie=TRUE)
Annotation summary

Annotation summary

To save these plots:

pdf("TSSdist.pdf")
  TSS_profile
dev.off()

pdf("AnnotVis.pdf")
  peakAnnoplot
dev.off()



Functional Analysis

Having obtained annotations to nearest genes, we can perform functional enrichment analysis to identify predominant biological themes among these genes by incorporating knowledge provided by biological ontologies, e.g. GO (Gene Ontology, (Ashburner et al. 2000)) and Reactome (Griss et al. 2020).

We will perform Enrichment Analysis using functions enrichGO from clusterProfiler and enrichPathway from ReactomePA. These functions do not take size effect into account.

In this tutorial we use the merged consensus peaks set. This analysis can also be performed on results of differential accessibility / occupancy.

Let’s first annotate the peaks with Reactome.

Reactome annotations support entrez gene ID space.

Reactome pathway enrichment of genes defined as the nearest feature to the peaks:

entrez_ids=peakAnno_df$entrezgene_id
entrez_ids=entrez_ids[!is.na(unique(entrez_ids))]

pathway.reac=ReactomePA::enrichPathway(entrez_ids, organism = "mouse")

#previewing enriched Reactome pathways
colnames(as.data.frame(pathway.reac))
[1] "ID"          "Description" "GeneRatio"   "BgRatio"     "pvalue"      "p.adjust"    "qvalue"      "geneID"      "Count"

#we skip the preview of some columns which consist of long strings of gene IDs
pathway.reac[1:10,c(1:7,9)]
                         ID                                                 Description GeneRatio  BgRatio       pvalue     p.adjust       qvalue Count
R-MMU-9012999 R-MMU-9012999                                            RHO GTPase cycle  359/6685 394/8654 1.331007e-13 7.933191e-11 5.346233e-11   359
R-MMU-983168   R-MMU-983168 Antigen processing: Ubiquitination & Proteasome degradation  265/6685 284/8654 1.399152e-13 7.933191e-11 5.346233e-11   265
R-MMU-983169   R-MMU-983169      Class I MHC mediated antigen processing & presentation  310/6685 339/8654 2.043365e-12 7.723919e-10 5.205203e-10   310
R-MMU-73887     R-MMU-73887                                    Death Receptor Signaling  127/6685 133/8654 5.339536e-09 1.513759e-06 1.020132e-06   127
R-MMU-3700989 R-MMU-3700989                          Transcriptional Regulation by TP53  259/6685 288/8654 1.151777e-08 2.612231e-06 1.760401e-06   259
R-MMU-1280215 R-MMU-1280215                         Cytokine Signaling in Immune system  407/6685 467/8654 1.887907e-08 3.568145e-06 2.404598e-06   407
R-MMU-2555396 R-MMU-2555396                              Mitotic Metaphase and Anaphase  200/6685 219/8654 2.618777e-08 4.242419e-06 2.858996e-06   200
R-MMU-68882     R-MMU-68882                                            Mitotic Anaphase  199/6685 218/8654 3.117447e-08 4.418982e-06 2.977983e-06   199
R-MMU-8951664 R-MMU-8951664                                                 Neddylation  198/6685 218/8654 9.569964e-08 1.205815e-05 8.126075e-06   198
R-MMU-195258   R-MMU-195258                                        RHO GTPase Effectors  213/6685 237/8654 2.721884e-07 3.086617e-05 2.080093e-05   213

We can see familar terms which can be connected to sample biology:

  • Cytokine Signaling in Immune system

  • Class I MHC mediated antigen processing & presentation

Let’s search for enriched GO terms:

pathway.GO=enrichGO(entrez_ids, org.Mm.eg.db, ont = "BP")

pathway.GO[1:10,c(1:7,9)]

These results look roughly in agreement with analyses using reactome:

                   ID                                                       Description GeneRatio   BgRatio RichFactor FoldEnrichment   zScore     p.adjust
GO:0044772 GO:0044772                               mitotic cell cycle phase transition 390/14290 442/28905  0.8823529       1.784773 16.44034 3.595678e-64
GO:0022411 GO:0022411                                    cellular component disassembly 424/14290 496/28905  0.8548387       1.729119 16.19597 3.178532e-61
GO:0007264 GO:0007264                         small GTPase-mediated signal transduction 383/14290 439/28905  0.8724374       1.764717 15.96487 2.157761e-60
GO:0051656 GO:0051656                           establishment of organelle localization 423/14290 498/28905  0.8493976       1.718113 15.98419 1.140876e-59
GO:1901987 GO:1901987                         regulation of cell cycle phase transition 386/14290 453/28905  0.8520971       1.723574 15.34868 4.164055e-55
GO:0022613 GO:0022613                              ribonucleoprotein complex biogenesis 376/14290 442/28905  0.8506787       1.720705 15.09815 2.621815e-53
GO:0007059 GO:0007059                                            chromosome segregation 360/14290 419/28905  0.8591885       1.737918 15.04509 2.648496e-53
GO:0002764 GO:0002764                      immune response-regulating signaling pathway 413/14290 497/28905  0.8309859       1.680871 15.13974 4.541242e-53
GO:0043161 GO:0043161 proteasome-mediated ubiquitin-dependent protein catabolic process 342/14290 394/28905  0.8680203       1.755782 14.93600 5.798655e-53
GO:0002757 GO:0002757                      immune response-activating signaling pathway 405/14290 486/28905  0.8333333       1.685619 15.07275 9.214974e-53
  • immune response-regulating signaling pathway

  • immune response-activating signaling pathway

Similar strategy can be used for analysis of subset peaks, i.e. differentially accessible peaks, peaks following similar level changes over a range of conditions etc.

Please remember that the results of functional analysis like the one presented above can be only as accurate as the annotations.



Session Info



References

Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, et al. 2000. “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.” Nat Genet 25 (1): 25–29.

Griss, J., G. Viteri, K. Sidiropoulos, V. Nguyen, A. Fabregat, and H. Hermjakob. 2020. “ReactomeGSA - Efficient Multi-Omics Comparative Pathway Analysis.” Mol Cell Proteomics 19 (12): 2115–25.

Tsao, Hsiao-Wei, James Kaminski, Makoto Kurachi, R. Anthony Barnitz, Michael A. DiIorio, Martin W. LaFleur, Wataru Ise, et al. 2022. “Batf-Mediated Epigenetic Control of Effector CD8 + t Cell Differentiation.” Science Immunology 7 (68). https://doi.org/10.1126/sciimmunol.abi4919.

Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An r Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.

Yu, Guangchuang, Li-Gen Wang, and Qing-Yu He. 2015. “ChIPseeker: An r/Bioconductor Package for ChIP Peak Annotation, Comparison and Visualization.” Bioinformatics 31 (14): 2382–83. https://doi.org/10.1093/bioinformatics/btv145.