In-situ methods: CUT&RUN and ChIP data for CTCF

Background 

CTCF (CCCTC-binding factor) is a very general architectural protein that mediates interchromosomal or intrachromosomal interactions. By recruiting additional structural proteins like condensin, they create chromatin loops and contact points. CTCF is a sequence-specific binding factor with a well-known sequence motif. There are 55,000-65,000 theoretical binding sites in the human genome (predicted by the consensus sequence). Of these, 5000 appear to be occupied by CTCF in every tissue/cell type, whereas up to 60% are used cell-type specifically. DNA methylation modulates CTCF binding (lowering CTCF affinity). Given the plethora of knowledge and datasets available one would think that there is a universal agreement on where CTCF binds in which cell type. However, different ChIP-Seq data and alternative profiling methods actually give different qualitative and certainly quantitative answers.

So we will look at a comparison of CUT&RUN and different ChIP datasets to understand why and how different datasets report different binding sites.

Fig. 1: Features of CTCF binding sites in the genome [1].

Datasets 

The datasets used correspond to two publications, Skene and Henikoff [2] and Pugacheva et al. [3]. They can be found on GEO (GSE84474 and GSE137216). See below a table with the samples included in this example:

Table 1. Files in Skene et al. 2017 used in this exercise.
GEO Accession	Sample
GSM2247150	NBIS_Skene2017_K562_CTCF_ChIP_loMN
GSM2247152	NBIS_Skene2017_K562_CTCF_ChIP_medMN
GSM2247154	NBIS_Skene2017_K562_CTCF_ChIP_hiMN
GSM2433137	NBIS_Skene2017_K562_CTCF_CnR_15s
GSM2433139	NBIS_Skene2017_K562_CTCF_CnR_3m
GSM2433140	NBIS_Skene2017_K562_CTCF_CnR_45s
GSM2433141	NBIS_Skene2017_K562_CTCF_CnR_5s
GSM2433142	NBIS_Skene2017_K562_CTCF_CnR_9m

In Skene and Henikoff, they compare CUT&Run to their own ChIP (special optimized native ChIP for CTCF they say). In Pugacheva et. all, they perform a crosslinking ChIP for CTCF. They’ve been using different antibodies just to be sure that they are ChIPing the right thing.

Table 2. Files in Pugacheva et al. 2020 used in this exercise.
GEO Accession	Sample
GSM4073237	NBIS_Pugacheva2020_K562_ChIP_CTCF_MonoC_Abs
GSM4073240	NBIS_Pugacheva2020_K562_ChIP_CTCF_MonoN_Abs_Rep1
GSM4073238	NBIS_Pugacheva2020_K562_ChIP_CTCF_RabbitC_Abs
GSM4073242	NBIS_Pugacheva2020_K562_ChIP_Mix_of_CTCF_PolyC_Abs
GSM4073239	NBIS_Pugacheva2020_K562_ChIP_IGG_Abs_Control

You can find many more CTCF datasets In the Cistrome database!

Data preprocessing 

Primary analysis of the initial FASTQ files was performed beforehand. Reads were mapped with bowtie2 with default parameters. Resulting BAM files were deduplicated using Picard and blacklisted regions were removed. Peak files were generated using MACS2 using very standard parameters: --qval cutoff 0.01. In this case no replicates were used to call the peaks. BigWig files were generated with deepTools to 1x coverage. Resulting files are available under /sw/courses/epigenomics/quantitative_chip_simon/K562_CTCF_CnR

Attention

You can download bigWig and peak annotations. Most of what we are going to do can be done locally on a regular laptop. When this is not the case, Uppmax-specific instructions will be given. In case something does not work properly, the output of most of these is available also in the workshop folder and in this documentation.

When running things on Uppmax, copy the files to your home directory:

cd
mkdir -p cnr_chip/bw
cd cnr_chip/bw

cp /sw/courses/epigenomics/quantitative_chip_simon/K562_CTCF_CnR/bw/*.bw .

# Or you can create symlinks for the bigWig files instead:
for i in /sw/courses/epigenomics/quantitative_chip_simon/K562_CTCF_CnR/bw/*.bw; do ln -s ${i}; done

Manual inspection of bigwig profiles with IGV 

Besides quality control tools and exploratory analyses, it is usually a good idea to have a look at the data without any further processing to have an idea about how it looks like. It can help spot technical issues or processing problems.

Attention

When running these commands, the same relative path structure is expected to be kept. So if you made a directory for this tutorial, be sure you are in that directory when starting.

If you have not done it yet, download a copy of the bigWig and peaks to your local laptop.

# Wherever you have this directory, it's from now on your main working directory
mkdir cnr_chip
cd cnr_chip

mkdir bw
mkdir peaks

scp <youruser>@rackham.uppmax.uu.se:/sw/courses/epigenomics/quantitative_chip_simon/K562_CTCF_CnR/bw/*.bw ./bw/

scp <youruser>@rackham.uppmax.uu.se:/sw/courses/epigenomics/quantitative_chip_simon/K562_CTCF_CnR/peaks/*.narrowPeak ./peaks/

Open IGV and import the bigWig files (file > load from file…). You should see something like this:

Q: Note down some observations concerning following questions:

How are the samples similar and how are they different?
Can you make a prediction about the peak calling result?

Go to this site chr1:27,389,733-27,973,162 (corresponding to Figure 8 ).

Q: Are peaks shared, or unique to a specific technique?

Quality Control 

Quality Control metrics are very important to understand whether the experiment worked or not and spot possible caveats that may come up before further analysis. As mentioned in other tutorials, some metrics that can be useful are:

Library complexity estimates and duplication rates.
Sample clustering (PCA, correlation).
Insert size distribution.
Cumulative enrichment.

Correlation plots 

One way to look at similarity between ChIP dataset is to partition the signal in bins of a fixed size and compute a pairwise correlation value between distributions.

This can be done using deepTools.

This is a time consuming step that would need to be done on Uppmax. Move to the directory where you copied bigWig files on Uppmax. If you followed the same names, it would be:

cd ~/cnr_chip

Load deepTools (and bioinfo-tools) module beforehand:

module load bioinfo-tools
module load deepTools

deepTools needs that you first compute a bin matrix. From this, many other things can be done, such as correlation plots and PCA:

# Labels basically follow alphabetical order (which is the way blob *.bw will expand).
# But if you want to be totally sure, it’s better to spell out the order, as there may be
# some surprises: https://unix.stackexchange.com/questions/368318/does-the-bash-star-wildcard-always-produce-an-ascending-sorted-list

multiBigwigSummary bins -b \
    ./bw/NBIS_Pugacheva2020_K562_ChIP_CTCF_MonoC_Abs.GRCh38.bw \
    ./bw/NBIS_Pugacheva2020_K562_ChIP_CTCF_MonoN_Abs_Rep1.GRCh38.bw \
    ./bw/NBIS_Pugacheva2020_K562_ChIP_CTCF_RabbitC_Abs.GRCh38.bw \
    ./bw/NBIS_Pugacheva2020_K562_ChIP_IGG_Abs_Control.GRCh38.bw \
    ./bw/NBIS_Pugacheva2020_K562_ChIP_Mix_of_CTCF_PolyC_Abs.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_ChIP_hiMN.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_ChIP_loMN.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_ChIP_medMN.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_CnR_15s.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_CnR_3m.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_CnR_45s.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_CnR_5s.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_CnR_9m.GRCh38.bw \
    -bs 5000 -p 4 -o ./bins_table.npz --outRawCounts ./bins_table.tab \
    --labels \
    Pugacheva2020_ChIP_MonoC \
    Pugacheva2020_ChIP_MonoN \
    Pugacheva2020_ChIP_RabbitC \
    Pugacheva2020_ChIP_Control \
    Pugacheva2020_ChIP_Mix \
    Skene2017_ChIP_hiMN \
    Skene2017_ChIP_loMN \
    Skene2017_ChIP_medMN \
    Skene2017_CnR_15s \
    Skene2017_CnR_3m \
    Skene2017_CnR_452 \
    Skene2017_CnR_5s \
    Skene2017_CnR_9m

Note

The parameter --outRawCounts is not necessary and usually not generated, as the same values are saved in bins_table.npz in a way they occupy less space. But raw counts are text, so you can basically peek at the values directly using head or more.

If something does not work properly or you are not sure if a command is going to work, it is a good idea to use the --region parameter, which will do the analysis only on a given genomic region, and will run faster (failing fast is a good philosophy that will save you loads of time on the computer).

For instance, if you add --region chr1:300000:900000 to the previous command:

multiBigwigSummary bins -b \
    ./bw/NBIS_Pugacheva2020_K562_ChIP_CTCF_MonoC_Abs.GRCh38.bw \
    ./bw/NBIS_Pugacheva2020_K562_ChIP_CTCF_MonoN_Abs_Rep1.GRCh38.bw \
    ./bw/NBIS_Pugacheva2020_K562_ChIP_CTCF_RabbitC_Abs.GRCh38.bw \
    ./bw/NBIS_Pugacheva2020_K562_ChIP_IGG_Abs_Control.GRCh38.bw \
    ./bw/NBIS_Pugacheva2020_K562_ChIP_Mix_of_CTCF_PolyC_Abs.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_ChIP_hiMN.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_ChIP_loMN.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_ChIP_medMN.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_CnR_15s.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_CnR_3m.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_CnR_45s.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_CnR_5s.GRCh38.bw \
    ./bw/NBIS_Skene2017_K562_CTCF_CnR_9m.GRCh38.bw \
    -bs 5000 -p 4 -o ./bins_table.npz --outRawCounts ./bins_table.tab \
    --labels \
    Pugacheva2020_ChIP_MonoC \
    Pugacheva2020_ChIP_MonoN \
    Pugacheva2020_ChIP_RabbitC \
    Pugacheva2020_ChIP_Control \
    Pugacheva2020_ChIP_Mix \
    Skene2017_ChIP_hiMN \
    Skene2017_ChIP_loMN \
    Skene2017_ChIP_medMN \
    Skene2017_CnR_15s \
    Skene2017_CnR_3m \
    Skene2017_CnR_452 \
    Skene2017_CnR_5s \
    Skene2017_CnR_9m \
    --region chr1:300000:900000

It will run immediately if everything is working, and will produce a tiny dataset with 120 bins.

The .npz matrix is then used by deepTools to produce other plots. For our correlation plot:

plotCorrelation --corData ./bins_table.npz \
    --plotFile ./correlation_spearman_all.png \
    --whatToPlot heatmap \
    --corMethod spearman \
    --plotNumbers

This will generate a correlation plot based on genome-wide 5kb bins.

Q: Check out how the datasets cluster - does it make sense? Is the overall clustering following the biological target/control or underlying batch effect?

Note

If something here does not work, you can get the bins files from: cp /sw/courses/epigenomics/quantitative_chip_simon/K562_CTCF_CnR/tmp/bins* ..

Cumulative enrichment 

Also known as fingeprint plots, these give a feeling about the signal to noise ratio of each signal. You can understand more about what they exactly mean in deepTools documentation.

You can plot this with deepTools as well. This requires the BAM files and takes quite a bit to compute. You can symlink the bam files from: /sw/courses/epigenomics/quantitative_chip_simon/K562_CTCF_CnR/bam/ the same way as before:

# Or your preferred folder
cd ~/cnr_chip
mkdir bam
cd bam

# In this case symlink is strongly preferred, as these files are bigger
for i in /sw/courses/epigenomics/quantitative_chip_simon/K562_CTCF_CnR/bam/*; do ln -s ${i}; done

Attention

This is a time consuming step that is recommended to do on Uppmax. You can also leave this for later and look at the results.

You can plot them.

cd ~/cnr_chip

plotFingerprint -b \
  ./bam/NBIS_Skene2017_K562_CTCF_ChIP_hiMN.GRCh38.bam \
  ./bam/NBIS_Skene2017_K562_CTCF_ChIP_loMN.GRCh38.bam \
  ./bam/NBIS_Skene2017_K562_CTCF_ChIP_medMN.GRCh38.bam \
  ./bam/NBIS_Skene2017_K562_CTCF_CnR_15s.GRCh38.bam \
  ./bam/NBIS_Skene2017_K562_CTCF_CnR_3m.GRCh38.bam \
  ./bam/NBIS_Skene2017_K562_CTCF_CnR_45s.GRCh38.bam \
  ./bam/NBIS_Skene2017_K562_CTCF_CnR_5s.GRCh38.bam \
  ./bam/NBIS_Skene2017_K562_CTCF_CnR_9m.GRCh38.bam \
  -o ./fingerprint_Skene2017.pdf \
  --labels \
  Skene2017_ChIP_hiMN \
  Skene2017_ChIP_loMN \
  Skene2017_ChIP_medMN \
  Skene2017_CnR_15s \
  Skene2017_CnR_3m \
  Skene2017_CnR_452 \
  Skene2017_CnR_5s \
  Skene2017_CnR_9m \
  -p 4

Note

Again, here the --region trick also works to plot a subset of the data.

The resulting plot should look like:

Q: What do you think it means in terms of quality of experiments? Do you see different groups of samples? Go to IGV and browse through the tracks. How does each group look? Fingerprint supports the difference CUT&RUN authors argue - That it has less background.

Clearly, the CUT&Run data scores better by QC compared to the CTCF ChIP presented in this paper. But how about comparing to the Pugacheva CTCF ChIP datasets?

plotFingerprint -b \
  ./bam/NBIS_Pugacheva2020_K562_ChIP_CTCF_MonoC_Abs.GRCh38.bam \
  ./bam/NBIS_Pugacheva2020_K562_ChIP_CTCF_MonoN_Abs_Rep1.GRCh38.bam \
  ./bam/NBIS_Pugacheva2020_K562_ChIP_CTCF_RabbitC_Abs.GRCh38.bam \
  ./bam/NBIS_Pugacheva2020_K562_ChIP_IGG_Abs_Control.GRCh38.bam \
  ./bam/NBIS_Pugacheva2020_K562_ChIP_Mix_of_CTCF_PolyC_Abs.GRCh38.bam \
  -o ./fingerprint_Pugacheva2020.pdf \
  --labels Pugacheva2020_ChIP_MonoC Pugacheva2020_ChIP_MonoN Pugacheva2020_ChIP_RabbitC     Pugacheva2020_ChIP_Control Pugacheva2020_ChIP_Mix \
  -p 4

Q: How does Pugacheva CTCF ChIP measure up with the CUT&Run? Given that the Skene CTCF ChIP was done under native conditions (no crosslinker) and Pugacheva CTCF ChIP (formaldehyde crosslinked), what do you think could be the problem with the native ChIP?

Peak calling 

Peaks were called with MACS2 using standard parameters.

Attention

This is a step that could be better fine-tuned to specific experimental settings.

Again, it is usually a good idea to visually inspect the tracks, so you can have a feeling on whether the peaks were correctly called and how the samples look like.

Q: Given the QC you did above, does the peak calling confirm the quality differences amongst the samples? Does a higher signal/noise ratio allow to identify more peaks? Are peaks more confidently called?

Number of peaks per sample 

A simple wc count allows you to quickly check how many peaks you got:

NBIS_Pugacheva2020_K562_ChIP_Mix_of_CTCF_PolyC_Abs_peaks.narrowPeak
NBIS_Pugacheva2020_K562_ChIP_CTCF_MonoC_Abs_peaks.narrowPeak
NBIS_Pugacheva2020_K562_ChIP_CTCF_RabbitC_Abs_peaks.narrowPeak
NBIS_Pugacheva2020_K562_ChIP_CTCF_MonoN_Abs_Rep1_peaks.narrowPeak
 NBIS_Pugacheva2020_K562_ChIP_IGG_Abs_Control_peaks.narrowPeak

NBIS_Skene2017_K562_CTCF_CnR_9m_peaks.narrowPeak
NBIS_Skene2017_K562_CTCF_CnR_3m_peaks.narrowPeak
NBIS_Skene2017_K562_CTCF_CnR_452_peaks.narrowPeak
NBIS_Skene2017_K562_CTCF_ChIP_loMN_peaks.narrowPeak
NBIS_Skene2017_K562_CTCF_ChIP_medMN_peaks.narrowPeak
NBIS_Skene2017_K562_CTCF_CnR_15s_peaks.narrowPeak
NBIS_Skene2017_K562_CTCF_CnR_5s_peaks.narrowPeak
NBIS_Skene2017_K562_CTCF_ChIP_hiMN_peaks.narrowPeak

Peaks overlap using intervene 

intervene is an easy-to-use tool to look for overlaps between BED files. It relies on bedtools, but it saves some work when looking at different sets of files. You can install it using pip as they explain.

Attention

There is no intervene module on Uppmax. If you want to run it there, you can activate a conda environment that is precomputed: conda activate /sw/courses/epigenomics/quantitative_chip_simon/condaenv/intervene. Otherwise you can download the peaks files to your local computer and install intervene there, if you prefer.

You can generate venn diagrams (pairwise or more). For example, we may want to look at how much two of the CTCF ChIP peaks from Pugacheva 2020 agree:

cd cnr_chip
mkdir peaks
cp /sw/courses/epigenomics/quantitative_chip_simon/K562_CTCF_CnR/peaks/*.narrowPeak peaks/

conda activate /sw/courses/epigenomics/quantitative_chip_simon/condaenv/intervene

intervene venn --in ./peaks/NBIS_Pugacheva2020_K562_ChIP_CTCF_Mono*.narrowPeak

This will output a intervene_results folder with a pdf file:

As you can see there is a lot of overlap but MonoC dataset called much more peaks than MonoN.

Note that this can also be refined by setting overlap thresholds. See intervene documentation for the possibilities. The default behavior is to count any overlap as overlap.

A way of looking broadly at a set of BED files an the overlap between them is to do pairwise comparison:

intervene pairwise --in ./peaks/*.narrowPeak -o pairwise_results

will generate a plot like this in ./pairwise_results/.

Very different numbers of peaks!

Q: How do you think the number of peaks relates to the fingerprint? (the ones with the most accentuated fingerprint are the ones showing larger amounts of peaks, however the difference is not such for some of the CnR samples. Why can that be?

This has actually been noted in the CnR paper. They say that too scarce background read coverage can throw off traditional peak callers, thus they develop their own peak caller for CnR data.

Q: Now that you have peaks, think about what you could do with the peak information. How to make sense of the peaks? Would you use the dataset with the most or the least peaks for downstream analysis?

Note

MACS doesn’t just give peaks, it also assigns a score. High/confident peaks have high score. Small peaks have low score. It is not apparent from the analysis above, but it is quite likely that if you would pick the top 5000 scored peaks from each dataset, the overlap would be better.

Comparison between methods 

As an example, we are going to look at one peak set representative for each method: CnR_45s, CTCF_ChIP_medMN from Skene 2017 and ChIP_MonoC from Pugacheva 2020.

We can look at the overlap between them in a venn diagram:

There are quite some differences between approaches, and one of the peak sets is very small compared to the others.

Note

These venn diagrams are not size proportional. Automatically drawing size-proportional set intersections is a complex problem. eulerr is an R package that does a really nice job at approximating this. However it’s only for drawing, not for computing the actual intersections.

Heatmaps 

The fact that there are loci marked as peak in one dataset that do not appear in another does not mean that there is no signal there. It could be that a peak is not called due to lack of statistical power, the signal being weaker or a combination of other factors. Remember the example in the slides - some peaks don’t make the threshold, but still show enrichment:

So another way to look at this is to plot the signal of every dataset in a given peakset, for instance. This was run on a subset of 20000 peaks from Skane2017_CnR_45s.

Warning

These plots were generated in the past using seqplots. Unfortunately, this package is now deprecated. You can still use older versions of R and this package, or just generate plots like this using another tool, like deepTools. You can see more in their documentation. And there are also examples in other tutorials, such as the MINUTE-ChIP data tutorial in this section.

The subset file this was created on can be copied from the tmp folder:

scp <youruser>@rackham.uppmax.uu.se:/sw/courses/epigenomics/quantitative_chip_simon/K562_CTCF_CnR/tmp/Skene2017_CTCF_CnR_45s_small.bed .

Or from your Uppmax node:

cp /sw/courses/epigenomics/quantitative_chip_simon/K562_CTCF_CnR/tmp/Skene2017_CTCF_CnR_45s_small.bed .

Q: how does the data compare? Note the difference in scale amongst the samples. What other features are different? Which dataset would be better to identify the CTCF binding motif?

Note

In Skene and Henikoff paper, the authors make a biological interpretation regarding the difference in peaks seen in native ChIP and CUT&Run. They call the peaks uniquely present in CUT&Run “indirect binding sites” because they infer that those peaks are not directly bound by CTCF. The tagging of sites potentially in intact nuclei works through space, thus tagging not only the loci that are bound by CTCF, but also those regions that are nearby in space (Hi-C contacts). So it is important to note that CUT&Run may report ‘indirect’ or ‘shadow’ peaks that do not represent bona fide binding sites. For CTCF in principle the distinction should be easy since the ‘shadow’ peaks should not have CTCF binding motif.