Signal visualisation

Learning outcomes

  • Using deepTools to visualise ChIP / ATAC signal in relation to annotated TSS

  • Using pyGenomeTracks to plot genome browser tracks



Before we start

Analysis of ATAC/ ChIP-seq data wouldn’t be complete without visualising the signal in many ways at various stages of the analysis.

Signal can be visualised and summarised in relation to annotated features, such as in deepTools part of this tutorial, and this may serve as a diagnostic tool (i.e. we expect to see higher density of signal in transcription start site regions), for data exploration (i.e. we can detect features with various signal distribution patterns) or for plotting final figures.

We can also use tools to produce visually attractive plots of genomic tracks, such as ones we have inspected using Integrative Genome Browser during the tutorials. We show you how to start with these type of visualisations in section on pyGenomeTracks.

In both cases the options for plot customisation are many, and we encourage you to explore the parameters.


Signal visualisation with deepTools

We can visualise signal in relation to annotated features.

One such kind of features relevant for TFs are transcription start sites (TSS). In this exercise we use ChIP-seq data from ChIPseq lab. The data we are going to plot is from one replicate of ChIP-seq experiment investigating binding of REST transcriptional represssor in HeLa cells. and annotations for chromosomes 1 and 2. To do so we will:

  • produce coverage track in bedgraph format;

  • convert bedgraph to bigWig using UCSC utilities;

  • calculate scores per genome regions using the bigWig file;

  • plot a heatmap of scores associated with genomic regions

Setting up

We first link necessary files. Assuming we are in the home directory:

mkdir vis
cd vis

mkdir deepTools
cd deepTools

ln -s /proj/epi2023/vis/deepTools/chrom.sizes.hg19
ln -s /proj/epi2023/vis/deepTools/refGene_hg19_TSS_chr12_sorted_corr.bed
ln -s /proj/epi2023/vis/deepTools/ENCFF000PED.chr12.rmdup.sort.bam
ln -s /proj/epi2023/vis/deepTools/ENCFF000PED.chr12.rmdup.sort.bam.bai

Normalised coverage tracks

Let’s start from generating a normalised coverage track in a format called bedgraph from bam file. The bam file contains data subset to chr1 and chr2, hence we use effectiveGenomeSize 492449994 (length of chr1 and chr2). The data is SE, so we extend the reads to average fragment length ..(which we know from the Cross correlation) (which we know from the Cross correlation) using parameter --extendReads 110. Last but not least, we scale the track to average 1x coverage using option --normalizeUsing RPGC.

#if required
module load bioinfo-tools

module load deepTools/3.3.2

bamCoverage --bam ENCFF000PED.chr12.rmdup.sort.bam \
 --outFileName ENCFF000PED.chr12.cov.norm1x.bedgraph \
 --normalizeUsing RPGC --effectiveGenomeSize 492449994 --extendReads 110 \
 --binSize 50 --outFileFormat bedgraph

This track can be used for various visualisations and comparisons.

We can convert it to another format bigWig:

module load ucsc-utilities/v398

bedGraphToBigWig ENCFF000PED.chr12.cov.norm1x.bedgraph chrom.sizes.hg19 hela_1.bw

module unload ucsc-utilities

Hint

If these above steps did not work, you can link the precomputed coverage tracks:

ln -s /proj/epi2023/vis/deepTools/ENCFF000PET.cov.norm1x.bedgraph
ln -s /proj/epi2023/vis/deepTools/hela_1.bw

Plotting signal in relation to TSS

We begin by summarising coverage in bins in relation to a set of reference points, TSS in our case.

We can compute the matrix of scores for visualisation using computeMatrix. This tool calculates scores per genome regions and prepares an intermediate file that can be used with plotHeatmap and plotProfiles.

Typically, the genome regions are genes (or TSS as in this case), but any other regions defined in a BED file can be used. computeMatrix accepts multiple score files (bigWig format) and multiple regions files (BED format). This tool can also be used to filter and sort regions according to their score.

module load deepTools/3.3.2

computeMatrix reference-point -S hela_1.bw \
-R refGene_hg19_TSS_chr12_sorted_corr.bed -b 5000 -a 5000 \
--outFileName matrix.tss.dat --outFileNameMatrix matrix.tss.txt \
--referencePoint=TSS -p 10

We can now create a heatmap for scores associated with genomic regions, i.e. plot the binding profile around TSS

plotHeatmap --matrixFile matrix.tss.dat \
--outFileName tss.hela_1.pdf \
--sortRegions descend --sortUsing mean

Have a look at tss.hela_rep1.pdf. Can this plot be improved?

This is a very basic plot. We can add on to it, for example we can cluster genes based on the signal profile around TSS. For more possibilities please check plotHetmap.

plotHeatmap --matrixFile matrix.tss.dat \
--outFileName tss.hela_rep1_k3.pdf \
--sortRegions descend --sortUsing mean \
--kmeans 3


You can also use the same tools to plot signal along a scaled gene body using computeMatrix scale-regions. More examples are given on deepTools homepage.


Plotting Tracks using pyGenomeTracks

pyGenomeTracks can be used to plot browser tracks in multiple formats. In this tutorial we will plot ATAC-seq data from ATACseq lab along with detected peaks and gene models.

The process has two steps, first we define the formats in file track.ini and next we plot the desired regions. Although there is a learning curve to using pyGenomeTracks, and it often requires few tries to get the settings right, this is a convenint and reproducible manner to produce identically formatted plots of multiple regions.

Let’s link the necessary files and produce coverage tracks (assuming we are in vis):

mkdir pyGT
cd pyGT

ln -s /proj/epi2023/vis/pyGT/ENCFF398QLV.chr14.norm1x.bedgraph
ln -s /proj/epi2023/vis/pyGT/ENCFF045OAB.chr14.norm1x.bedgraph

ln -s /proj/epi2023/vis/pyGT/nk_genrich.bed
ln -s /proj/epi2023/vis/pyGT/nk_stim_genrich.bed
ln -s /proj/epi2023/vis/pyGT/nk_macs_broad.bed
ln -s /proj/epi2023/vis/pyGT/nk_stim_macs_broad.bed
ln -s /proj/epi2023/vis/pyGT/ENCFF045OAB.macs.broad_peaks.broadPeak
ln -s /proj/epi2023/vis/pyGT/ENCFF045OAB.genrich.narrowPeak
ln -s /proj/epi2023/vis/pyGT/ENCFF398QLV.macs.broad_peaks.broadPeak
ln -s /proj/epi2023/vis/pyGT/ENCFF398QLV.genrich.narrowPeak

ln -s /proj/epi2023/vis/pyGT/hg38.refGene.gtf

cp /proj/epi2023/vis/pyGT/tracks1.ini .
cp /proj/epi2023/vis/pyGT/tracks2.ini .

We can now create the track.ini file. You can check possible options in compatible tracks

We will visualise the following:

  • data as bedgraph

  • peaks as bed (narrowPeak and broadPeak)

  • gene models as gtf

We need to know the paths to files, let’s check the current directory:

pwd

In my case it was /proj/epi2022/nobackup/agata/tests/vis/pyGT, yours will be different, so substitute acccordingly.

Let’s build a simple ini file:

[x-axis]
where = top

[spacer]
height = 0.3

[bedgraph]
file = /proj/epi2022/nobackup/agata/tests/vis/pyGT/ENCFF398QLV.chr14.norm1x.bedgraph
# height of the track in cm (optional value)
height = 4
title = ATAC-seq NK
color = mediumturquoise
min_value = 0

[spacer]
height = 0.3

[bedgraph]
file = /proj/epi2022/nobackup/agata/tests/vis/pyGT/ENCFF045OAB.chr14.norm1x.bedgraph
# height of the track in cm (optional value)
height = 4
title = ATAC-seq stimulated NK
color = forestgreen
min_value = 0

[spacer]
height = 0.5

[bed]
file = /proj/epi2022/nobackup/agata/tests/vis/pyGT/nk_genrich.bed
# height of the track in cm (optional value)
height = 4
title = consensus Genrich peaks in NK
color = lightpink

[spacer]
height = 0.3

[bed]
file = /proj/epi2022/nobackup/agata/tests/vis/pyGT/nk_stim_genrich.bed
# height of the track in cm (optional value)
height = 4
title = consensus Genrich peaks in stimulated NK
color = crimson

[spacer]
height = 0.5

[genes]
file = /proj/epi2022/nobackup/agata/tests/vis/pyGT/hg38.refGene.gtf
file_type = gtf
title = gene models
style = flybase
arrow_interval = 3
display = stacked
fontsize = 10
gene_rows = 10
height = 7
all_labels_inside = true
merge_transcripts = true

Hint

It is generally not advised to edit files used on Linux systems in word processing editors such as MsWord and similar (due to meta characters added by them for formatting purposes - they may not be visible, but they are present in text copied directly from such editors. For generating the *ini files in this example, and general script writing, it is recommended to use text editors developed for programming. Not only they do not add any invisible characters to the text, but often include convenient utilities such as syntax highlighting for a wide choice of programming languages.

One example of such editor is Sublime.

You can copy the contents of tracks.ini to the editor, modify the paths and paste back to rackham:

#create a file and open a simple editor
nano tracks.ini

# now copy the file contents

#to close and save the file
Ctrl-X
#to save under given name press Y, then "enter"

This file is available as tracks1.ini.

pyGenomeTracks is installed via a conda environment, so we activate it first

#unloading module python may be necessary
module unload python
module load conda/latest
conda activate /sw/courses/epigenomics/software/conda/pygenometracks3_6

Let’s plot one of the regions we have viewed in the ATAC-seq peak detection part chr14:93,095,621-93,125,599

pyGenomeTracks --tracks tracks1.ini --region chr14:93095621-93125599  --trackLabelFraction 0.2 --dpi 130 -o plot1.png

We can plot wider region using the same settings

pyGenomeTracks --tracks tracks1.ini --region chr14:93295621-93725599  --trackLabelFraction 0.2 --dpi 130 -o plot2.png

Let’s tweak some settings in the ini file.

  • We can add max_value to bedgraph tracks to use the same scale for both samples;

  • We can change the style of gene models display to style = UCSC

Modified file:

[x-axis]
where = top

[spacer]
height = 0.3

[bedgraph]
file = /proj/epi2022/nobackup/agata/tests/vis/pyGT/ENCFF398QLV.chr14.norm1x.bedgraph
# height of the track in cm (optional value)
height = 4
title = ATAC-seq NK
color = mediumturquoise
min_value = 0
max_value = 600

[spacer]
height = 0.3

[bedgraph]
file = /proj/epi2022/nobackup/agata/tests/vis/pyGT/ENCFF045OAB.chr14.norm1x.bedgraph
# height of the track in cm (optional value)
height = 4
title = ATAC-seq stimulated NK
color = forestgreen
min_value = 0
max_value = 600

[spacer]
height = 0.5

[bed]
file = /proj/epi2022/nobackup/agata/tests/vis/pyGT/nk_genrich.bed
# height of the track in cm (optional value)
height = 1
title = consensus Genrich peaks in NK
color = lightpink

[spacer]
height = 0.3

[bed]
file = /proj/epi2022/nobackup/agata/tests/vis/pyGT/nk_stim_genrich.bed
# height of the track in cm (optional value)
height = 1
title = consensus Genrich peaks in stimulated NK
color = crimson

[spacer]
height = 0.5

[genes]
file = /proj/epi2022/nobackup/agata/tests/vis/pyGT/hg38.refGene.gtf
file_type = gtf
title = gene models
style = UCSC
arrow_interval = 3
display = stacked
fontsize = 10
gene_rows = 10
height = 7
all_labels_inside = true
merge_transcripts = true

This file is available as tracks2.ini.

Hint

To modify a file we can use a simple text editor present on most unix / linux distributions nano.

Type nano tracks2.ini and you can edit the file. To save press Ctrl-X, confirm y, change file name if you like.

We can plot much wider region using these new settings:

pyGenomeTracks --tracks tracks2.ini --region chr14:91295621-95725599  --trackLabelFraction 0.2 --dpi 130 -o plot3.png

And so on, until we are satisfied with the figure.

There are more files linked in your working directory, you can try to visualise some of them. Try to select other regions, too.

Important List of available colours can be found at https://matplotlib.org/stable/gallery/color/named_colors.html .