ATAC-seq¶

ATAC-seq (Assay for Transposase-Accessible Chromatin with high-throughput sequencing) is a method for determining chromatin accessibility across the genome. It utilizes a hyperactive Tn5 transposase to insert sequencing adapters into open chromatin regions. High-throughput sequencing then yields reads that indicate these regions of increased accessibility.

Figure 1. Overview of ATAC-seq (Buenrostro et al., 2015).¶

The sample processing workflow is similar to that of ChIP-seq. There are some small changes, however, and we’ll discuss them here.

First of all, the sample should be sequenced using the paired end (PE) protocol. Single end sequencing for ATAC-seq is strongly discouraged, for reasons mentioned below.

Data
Alignment Processing and QC
Peak Calling

Data ¶

In this tutorial we will use data from the study of Buenrostro et al. 2013, the first paper on the ATAC-seq method. The data is from a human cell line of purified CD4+ T cells. The original dataset had 2 x 200 million reads and would be too big to process in a training session, so the original dataset was downsampled to 200,000 randomly selected reads. Additionally, about 200,000 reads pairs that will map to chromosome 22 were included to have a good profile on this chromosome, similar (although downsampled) to what you might get with a typical ATAC-seq sample.

Prior to aligning reads to the reference genome, the reads must be properly trimmed off adapters - because we will allow “dovetailing” (with the mates seemingly extending “past” each other) of read pairs during alignment:

<--------------------Mate 1-----------------------
AGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGC
  CTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCT
  ----------------------Mate 2--------------------->

Read alignment can be performed using any aligner which performs global i.e. end-to-end alignment, such as bwa-mem, bowtie, bowtie2 (newer versions). In this tutorial we used bowtie2 for read mapping, for details see the header of bam file. We start from the bam file which contains subset reads aligned to hg38 assembly.

How to check bam header?

#load samtools
module load bioinfo-tools
module load samtools/1.8

#view header only
samtools view -H name.bam

Alignment Processing and QC ¶

First, you need to link the bam file in your working directory. This file has been filtered off alignments with low quality score and contains only properly paired read pairs. You can refer to bowtie2 manual for details. The file is also sorted and inexed.

mkdir -p ~/atacseq/bam
cd ~/atacseq/bam

ln -s /proj/g2021025/nobackup/atacseq/data/SRR891268_hg38.bowtie2.q30.sorted.bam .
ln -s /proj/g2021025/nobackup/atacseq/data/SRR891268_hg38.bowtie2.q30.sorted.bam.bai .

First, we would like to know how many fragments mapped to chrM, as reads derived from mitochondrial DNA represent noise in ATAC-seq datasets and can substantially inflate the background level in peak identification.

The output is TAB-delimited with each line consisting of reference sequence name, sequence length, number of mapped read-segments and number of unmapped read-segments.

module load bioinfo-tools
module load samtools/1.8

samtools idxstats SRR891268_hg38.bowtie2.q30.sorted.bam >SRR891268.idxstats.txt

#total fragments
awk '{sum += $3} END {print sum}' SRR891268.idxstats.txt

#chrM fragments
awk '$1 ~ /chrM/ {print $3}' SRR891268.idxstats.txt

> 165586/437490
[1] 0.3784909

You can see that almost 40% of the fragments mapped to mitochondrial DNA. This is often the case in ATAC-seq experiments (depending on the sample preparation protocol, it is possible to remove these fragments from the library prior to sequencing) and should be taken into account when planning the experiment. We remove these reads in the next step.

The alignment processing steps are similar to ChIP-seq data processing. In this example we do not filter out reads mapping to blaclisted regions (found in Encode accession ENCFF356LFX), this step may be necessary, depending on the dataset.

samtools view -h SRR891268_hg38.bowtie2.q30.sorted.bam | awk '($3 != "chrM")' | samtools view -Shbo SRR891268_hg38.bowtie2.q30.sorted.noM.bam -

samtools index SRR891268_hg38.bowtie2.q30.sorted.noM.bam

samtools stats SRR891268_hg38.bowtie2.q30.sorted.noM.bam >SRR891268.stats.txt

The last command collects statistics from BAM files and outputs in a text format. To see the summary:

grep ^SN SRR891268.stats.txt | cut -f 2-

# the interesting part
insert size average:    231.6
insert size standard deviation: 188.8

parsing BAM statistics

(base) [agata@rackham3 bam]$ grep ^SN SRR891268.stats.txt | cut -f 2-
raw total sequences:    271904
filtered sequences:     0
sequences:      271904
is sorted:      1
1st fragments:  135952
last fragments: 135952
reads mapped:   271904
reads mapped and paired:        271904  # paired-end technology bit set + both mates mapped
reads unmapped: 0
reads properly paired:  271904  # proper-pair bit set
reads paired:   271904  # paired-end technology bit set
reads duplicated:       0       # PCR or optical duplicate bit set
reads MQ0:      0       # mapped and MQ=0
reads QC failed:        0
non-primary alignments: 0
total length:   13222531        # ignores clipping
bases mapped:   13222531        # ignores clipping
bases mapped (cigar):   13222531        # more accurate
bases trimmed:  0
bases duplicated:       0
mismatches:     25158   # from NM fields
error rate:     1.902661e-03    # mismatches / bases mapped (cigar)
average length: 48
maximum length: 50
average quality:        38.3
insert size average:    231.6
insert size standard deviation: 188.8
inward oriented pairs:  118009
outward oriented pairs: 1287
pairs with other orientation:   0
pairs on different chromosomes: 0

You will remove duplicated reads (which likely are PCR duplicates) and collect detailed insert size metrics.

module load picard/2.23.4

java -Xmx64G -jar $PICARD_HOME/picard.jar MarkDuplicates -I SRR891268_hg38.bowtie2.q30.sorted.noM.bam -O SRR891268_hg38.bowtie2.q30.sorted.noM.rmdup.bam -M dedup_metrics.txt -VALIDATION_STRINGENCY LENIENT -REMOVE_DUPLICATES true -ASSUME_SORTED true

java -Xmx64G -jar $PICARD_HOME/picard.jar CollectInsertSizeMetrics -I SRR891268_hg38.bowtie2.q30.sorted.noM.rmdup.bam -O SRR891268_insert_size_metrics.txt -H SRR891268_insert_size_histogram.pdf -M 0.5

View the resulting histogram of insert sizes SRR891268_insert_size_histogram.pdf. Generating this important QC plot is only possible for PE libraries. Could you guess what the peaks at approximately 50bp, 200bp, 400bp and 600bp correspond to?

To give some context compare to plots on Figure 2.

Figure 2. Examples of insert size distribution for ATAC-seq experiments.¶
Naked DNA	Failed ATAC-seq	Noisy ATAC-seq	Successful ATAC-seq

Peak Calling ¶

We have now finished the data preprocessing. Next, to find regions corresponding to potential open chromatin regions, we want to identify regions where reads have piled up (peaks) greater than the background read coverage.

The tools which are currently used are Genrich and MACS2. Genrich has a mode dedicated to ATAC-Seq (however, Generich is still not published), and MACS2 which is designed for ChIP-seq rather than ATAC-seq; both are presented here. The differences between these two are discussed here.

It is very important at this point that we center the reads on the 5’ extremity (read start site) as this is where Tn5 cuts. You want your peaks around the nucleosomes and not directly on the nucleosome. However, if we only assess the coverage of the start sites of the reads, the data would be too sparse and it would be impossible to call peaks. Thus, we will extend the start sites of the reads by 100bp (50 bp in each direction) to assess coverage. This is performed automatically by Genrich, and using command line options extsize and shift in MACS2.

Figure 3. Scheme of ATAC-Seq reads relative to nucleosomes.¶

When Tn5 cuts an accessible chromatin locus it inserts adapters separated by 9bp, see Figure 4. This means that to have the read start site reflect the centre of where Tn5 bound, the reads on the positive strand should be shifted 4 bp to the right and reads on the negative strand should be shifted 5 bp to the left as in Buenrostro et al. 2013.

Figure 4. Nextera Library Construction.¶

Genrich ¶

Genrich can apply the read shifts when ATAC-seq mode -j is selected.

mkdir ../genrich
cd ../genrich

# we link the pre-processed bam file
ln -s ../bam/SRR891268_hg38.bowtie2.q30.sorted.noM.rmdup.bam


# in case not already loaded
module load bioinfo-tools
module load samtools/1.8

# sort the bam file by read name (required by generich)
samtools sort -n -o SRR891268_hg38.nsort.bam -T sort.tmp SRR891268_hg38.bowtie2.q30.sorted.noM.rmdup.bam


/sw/courses/epigenomics/ATACseq_bulk/software/Genrich/Genrich -j -t SRR891268_hg38.nsort.bam  -o SRR891268_genrich.narrowPeak

The output file produced by Genrich is in ENCODE narrowPeak format, listing the genomic coordinates of each peak called and various statistics. You are already familiar with this format from the tutorial on ChIP-seq data processing.

chr start end name score strand signalValue pValue qValue peak

signalValue - Measurement of overall (usually, average) enrichment for the region.
pValue - Measurement of statistical significance (-log10). Use -1 if no pValue is assigned.
qValue - Measurement of statistical significance using false discovery rate (-log10). Use -1 if no qValue is assigned.

How many peaks were detected?

wc -l SRR891268_genrich.narrowPeak
130617 SRR891268_genrich.narrowPeak

Unfortunately, Genrich does not work very well with our small training dataset (every covered region is called a peak). This is because most of the data is on chr22 whereas the background model was built on the whole genome (Genrich consideres length of all reference seqences included in bam header).

Let’s try again with properly prepared bam file, i.e such that the header contains only the chromosome of interest (chr22).

# we need indexed bam
samtools index SRR891268_hg38.bowtie2.q30.sorted.noM.rmdup.bam

#subset bam and change header
samtools view -h SRR891268_hg38.bowtie2.q30.sorted.noM.rmdup.bam chr22 | grep -P "@HD|@PG|chr22" | samtools view -Shbo SRR891268_hg38.chr22_rh.bam


# sort by read name
samtools sort -n -o SRR891268_hg38.nsort.chr22_rh.bam -T sort.tmp SRR891268_hg38.chr22_rh.bam

# call peaks
/sw/courses/epigenomics/ATACseq_bulk/software/Genrich/Genrich -j -t SRR891268_hg38.nsort.chr22_rh.bam  -o SRR891268_chr22_genrich.narrowPeak

#how many peaks
wc -l SRR891268_chr22_genrich.narrowPeak
1017 SRR891268_chr22_genrich.narrowPeak

SRR891268_chr22_genrich.narrowPeak

base) [agata@rackham3 genrich]$ head SRR891268_chr22_genrich.narrowPeak
chr22   10780284        10780524        peak_0  1000    .       281.787170      3.991808        -1      65
chr22   10780861        10781385        peak_1  601     .       314.802521      3.122990        -1      64
chr22   11035932        11036136        peak_2  1000    .       340.313049      4.330980        -1      119
chr22   11628851        11629079        peak_3  947     .       216.004501      3.597563        -1      140
chr22   17056082        17056924        peak_4  479     .       403.322388      3.122990        -1      59
chr22   17084605        17085951        peak_5  1000    .       1812.207520     6.269803        -1      347
chr22   17087584        17087987        peak_6  775     .       312.297943      3.597563        -1      268
chr22   17098352        17098515        peak_7  1000    .       202.232758      3.991808        -1      88
chr22   17158386        17159859        peak_8  1000    .       1956.280884     5.569584        -1      864
chr22   17171258        17172080        peak_9  1000    .       1497.645020     6.110136        -1      630

MACS ¶

We need to convert BAM file to BEDPE to correctly apply read shifts to center fragments on the insertion sites.

mkdir ../macs
cd ../macs

module load BEDTools/2.25.0
bedtools bamtobed -bedpe -i ../genrich/SRR891268_hg38.nsort.chr22_rh.bam >SRR891268_22_pe.bed

module load MACS/2.2.6
macs2 callpeak -t SRR891268_22_pe.bed -n SRR891268_macs_chr22_bedpe -f BEDPE -g 50818468 --nomodel --extsize 100 --shift -50 --call-summits

We chose genome size -g 50818468 - because it is the length of chromosome 22, which is the only one included in the bam file.

Please note that we selected --extsize 100 to match the behaviour of Genrich. Normally --extsize 200 would be selected. --shift needs to be minus half of the size of --extsize to be centered on the 5’, so normally -100. --shift -100 --extsize 200 will amplify the cutting sites’ enrichment from ATAC-seq data. So in the end, the peak is where Tn5 transposase likes to attack.

How many peaks were detected?

wc -l SRR891268_macs_chr22_bedpe_peaks.narrowPeak
126 SRR891268_macs_chr22_bedpe_peaks.narrowPeak

Not an impressive number of peaks, and an order of magnitude less than what Genrich has detected.

SRR891268_macs_chr22_bedpe_peaks.narrowPeak

(base) [agata@rackham3 macs]$ head SRR891268_macs_chr22_bedpe_peaks.narrowPeak
chr22   17084956        17085006        SRR891268_macs_chr22_bedpe_peak_1       34      .       5.56950 7.93216 3.42459 25
chr22   17159228        17159296        SRR891268_macs_chr22_bedpe_peak_2       15      .       4.02477 5.24479 1.59371 8
chr22   17304005        17304074        SRR891268_macs_chr22_bedpe_peak_3       47      .       6.19006 9.93479 4.79986 34
chr22   17594848        17594902        SRR891268_macs_chr22_bedpe_peak_4       39      .       5.81619 8.61915 3.91236 26
chr22   17628785        17628887        SRR891268_macs_chr22_bedpe_peak_5       24      .       4.44050 6.62287 2.46661 23
chr22   18906171        18906221        SRR891268_macs_chr22_bedpe_peak_6       19      .       3.69554 5.74542 1.91218 18
chr22   19122587        19122695        SRR891268_macs_chr22_bedpe_peak_7       45      .       6.38978 9.49147 4.53140 71
chr22   19144817        19144867        SRR891268_macs_chr22_bedpe_peak_8       44      .       6.06021 9.42748 4.49093 22
chr22   19419994        19420044        SRR891268_macs_chr22_bedpe_peak_9       16      .       3.63053 5.38505 1.67296 15
chr22   19432330        19432380        SRR891268_macs_chr22_bedpe_peak_10      25      .       4.86770 6.74392 2.54948 33

Comparing results of MACS and Genrich ¶

How many peaks actually overlap?

cd ..

bedtools intersect -a macs/SRR891268_macs_chr22_bedpe_peaks.narrowPeak  -b genrich/SRR891268_chr22_genrich.narrowPeak  -f 0.50 -r >peaks_common.bed

wc -l peaks_common.bed
125 peaks_common.bed

Inspetion of the peak tracks in IGV reveals small differences in peaks called by MACS and Genrich. The very shallow signal in this example does not produce peaks of good quality by neither method. Usually MACS tends to detect many shorter peaks whereas Genrich tends to merge these shorter peaks into longer intervals. This short example demonstartes how important is to obtain data with sufficient sequencing depth, to avoid issues with analysis (or to arrive at having data impossible to analyse).

Below is zoom on chr22:46,033,366-46,038,084 one of the locations where both MACS and Genrich found a peak.

Figure 5. Visalisation of read alignments and peaks detected by Genrich and MACS.¶

(Tutorial modified from https://training.galaxyproject.org/topics/epigenetics/tutorials/atac-seq/tutorial.html)

ATAC-seq¶

Data¶

Alignment Processing and QC¶

Peak Calling¶

Genrich¶

MACS¶

Comparing results of MACS and Genrich¶