.. below role allows to use the html syntax, for example :raw-html:`<br />`
.. role:: raw-html(raw)
    :format: html


==============================
Public ChIP-seq resources
==============================

**Aim of this exercise**

- To see what public datasets are out there

- To be able to find and download public ChIP-seq data in useful formats

This exercise is mostly run in a web browser, so it’s easiest to run it on you local computer.

.. Contents
.. =========

.. contents::
    :local:


1. ENOCDE
===========

`ENCODE project website <https://www.encodeproject.org>`_

Besides ENCODE data, the ENCODE data portal also contains data from the Roadmap Epigenome project, as well as the modENCODE & modERN projects (for fly and worm).

To explore this data repostioty, go to the `ENCODE project website <https://www.encodeproject.org>`_ and select *Data* and then *Search*. Say that we want to see all data sets for the histone mark **H3K27me3**. Start by by typing ``"H3K27me3"`` in the search box in the top right corner. **How many results do you see?** This refers to everything in the encode data base: experiments, series of experiments, publications etc.

You can select subsets of the results from the panel on the left. Select *Experiment* to only see experiments. **How many results do you see now? Are all these ChIP-seq experiments?**

Now, let’s make a finer selection: Select only released experiments (from *Quality*), and then only experiments on human material. **How many experiments do we have now?**

Perhaps we are only interested in data from the brain. Under *Organ*, select *brain*. **How many experiments do we have now?**

You can see a list of all experiments to the right. Click on the first one, and open the page in a new browser window or tab. This will take you to a page describing this experiment, and what protocols and analysis pipelines were used. If you scroll down and select the tab *File details* you will se a list of all files that are available for download.

**Do you know what these files are?** Try downloading some if you want to. But since some of these files are large, remember to remove them when you are done looking at them.

Now have seen how to find and download a single ChIP-seq data set. If we instead want to download all H3K27me3 data from the human brain mapped to the human genome, go back to the ENCODE search page, where you had selected the relevant experiments. Then click the button *Download*. This will download a file to your computer.

Open this file. It contains URLs to all data files for the selected experiments. If we want to download e.g. only the bed files with peaks that are stable in both replicates of each experiment, we need to do some extra steps.

First, we download the meta data table. The URL is the first line in the file you just downloaded:

.. code-block:: bash

	wget "https://www.encodeproject.org/metadata/?status=released&biosample_ontology.organ_slims=brain&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&searchTerm=H3K27me3&type=Experiment&files.analyses.status=released&files.preferred_default=true" -O meta_data.tsv


You can open this file, e.g. in excel to have a look. We want to select all lines corresponding to bed files with stable/replicated peaks, using the GRCh38 genome, and save these in a new file ``metadata_peak_files.tsv``:

.. code-block:: bash

	grep bed.gz meta_data.tsv | grep "stable\|replicated" | grep GRCh38 > meta_data_peak_files.tsv


From this file we can now get the URLs, which are in column 48:

.. code-block:: bash

	cut -d$'\t' -f48 meta_data_peak_files.tsv > meta_data_peak_files_urls.txt


Finally, we can now download all these files:

.. code-block:: bash

	wget -i metadata_peak_files_urls.txt


This will download load all peak files. These will still have non-informative names, e.g. ``ENCFF591RMN.bed.gz``. To see which experiment each file corresponds to, look in ``metadata_peak_files.tsv``


Roadmap epigenomics
-----------------------

There are several ways to download data from the Roadmap Epigenomics web site. But since these data sets are also available through ENCODE, it's probably easiest to use the ENCODE web site. In case you want to have a look, these pages also host the Roadmap Epigenomics data:

`Roadmap Epigenomics <https://www.ncbi.nlm.nih.gov/geo/roadmap/epigenomics/>`_


2. Cistrome
================

`Cistrome <http://cistrome.org/db/#/>`_

Cistrome is another database where ChIP-seq data has been collected and processed uniformly. This is a good complement to the big projects (ENCODE & Roadmap Epigenomics) since it collects data from many smaller studies.

As an example, we will look for data on the three human transcription factors *Grhl1*, *Grhl2* and *Grhl3*. Grhl stands for “Grainy head like”, which means that these proteins are similar to the *Grainy head* protein first found in fruit fly. In human there are 3 Grhl homologs: *Grhl1*, *Grhl2* and *Grhl3*. They are involved in development and would healing, and have been implicated in hearing loss and cancer.

To see which ChIP-seq data sets are available for the Grhl proteins, go to `Cistrome <http://cistrome.org/db/#/>`_, type "Grhl" in the search box and click on *Search*. We can then refine the search further by selecting *Homo sapiens* under *Species*. **Which Grhl proteins do we find data for?**

There is also an option to filter data on quality measures. To try this, click on *Options* and then *Samples passing peak quality controls*. **Which Grhl proteins do we still have data for after this filtering?**

Now, select the first data set in the list. It will be highlighted in blue. Scroll down to see details about this sample. Select the tab *QC reports*. **Can you make sense of this information? Does this look like an experiment that worked?**

In the cistrome database, motif finding programs were run on all transcription factor data sets. Select the tab *QC motifs*, and have a look at some of the top motifs. **Do they look similar the known Grhl site? Can you find what the Grhl site is?** Hint: Go to JASPAR and search for "Grhl".

It’s also possible to download files from cistrome. For each experiment, bigwig files with the read coverage signal and bed files with peaks are available. Note that batch download is not available for manually selected batches, just for e.g. all human transcription factor data or all mouse chromatin data etc.

.. ----

.. Written by: Jakub Westholm
.. rst by: Agata Smialowska