PiGx ChIP-seq

Introduction

PiGx ChIPseq is an analysis pipeline for preprocessing, peak calling and reporting for ChIP sequencing experiments. It is easy to use and produces high quality reports. The inputs are fastq files containing reads from the sequencing experiment, and a configuration file which describes the experiment. In addition to quality control of the experiment, the pipeline enables to set up multiple peak calling analysis and allows the generation of a UCSC track hub in an easily configurable manner.

Workflow

PiGx ChIPseq implements best practices for preprocessing and analysis of ChIPseq data. Figure 1 provides an overview of the different steps of the pipeline, as well as the outputs.

First, raw reads are trimmed using TrimGalore! to ensure a minimum read quality, and removal of adapter sequences. Next, reads are aligned to a reference genome using Bowtie2, and peaks are called for any combination of samples using MACS2. The reproducibility of peaks among experiments is controlled using IDR. Coverage bigWig files are calculated for the aligned reads using rtracklayer, and peak files are converted to bigBed format files using bedToBigBed, to be integrated together into a UCSC track hub to be viewed in the UCSC Genome Browser. Finally, quality metrics are calculated for the ChIP experiments, such as inter strand cross correlation and GC content, and signal profiles are calculated for genomic features using genomation. Together with the frequency of reads in peaks and the distribution of peaks in genomic features these metrics are used to compile a custom ChIP quality control report.

Figure 1: An overview of the PiGx ChIPseq workflow

Output

BAM files
bigWig files
Peak files (narrowPeak, broadPeak)
UCSC track hub folder
Read QC reports
ChIP QC Report

Installation

Pre-built binaries for PiGx are available through GNU Guix, the functional package manager for reproducible, user-controlled software management. Install the complete pipeline bundle with the following command:

guix install pigx

You can also install just the ChIPseq pipeline with

guix install pigx-chipseq

You can also install it from source manually. PiGx uses the GNU build system.
Please make sure that all required dependencies are installed and then follow these steps after unpacking the latest release tarball:

./configure --prefix=/some/where
                    make install

Quick Start

To check wether the pipeline was properly installed, run PiGx ChIPseq on a minimal test dataset, which we provide here.

Once downloaded run these commands to extract the data:

mkdir test_dir
                    tar -xzf test_data.tar.gz --directory test_dir/
                    cd test_dir

Along with the data this tarball includes a settings file and a sample sheet file. The pipeline can now be started using this command, which takes about 3 minutes to finish (system with 4 cores and 32GB of RAM).

pigx chipseq -s test_dir/settings.yaml test_dir/sample_sheet.csv

Inside test_dir a new directory out is created, which includes specific directories containing output data for the respective step of the pipeline.

The ChIP_Seq_Report.html inside the Reports directory gives a general overview about the ChIP experiment and the peak calling.

Preparing Input

In order to run the pipeline, the user must supply

a sample sheet
a settings file

both files are described below.

In order to generate template settings and sample sheet files, type

pigx chipseq --init

in the shell, and a boilerplate sample_sheet.csv and settings.yaml will be written to your current directory.

Sample Sheet

The sample sheet is a tabular file in csv format or an exel table (xls/xlsx) defining the samples used in any subsequent analysis.

SampleName	Read	Read2
ChIP1	ChIP.fq.gz
Cont1	Cont.fq.gz
ChIP2	ChIP.fq.gz
Cont2	Cont.fq.gz
ChIPpe	ChIPpe_R1.fq.gz	ChIPpe_R2.fq.gz

SampleName is the name for the sample
Read/Read2 are the fastq file names of paired end reads
- the location of these files is specified in settings.yaml
- for single-end data, leave the Read2 column in place, but have it empty

The creation of the sample sheet is straight forward using the following command:

pigx chipseq --init=sample-sheet

This creates a template that can be filled with your own samples.

Technical Replicates

The sample sheet offers support for technical replicates, by repeating the sample name (first column) for different input files (second,third column). The quality check will be performed for any input file and replicates will be merged during the mapping.

SampleName	Read	Read2
ChIPpe	ChIPpe_R1.fq.gz	ChIPpe_R2.fq.gz
ChIPpe	ChIPpe_t2_R1.fq.gz	ChIPpe_t2_R2.fq.gz

Settings File

The settings file is a file in yaml format specifying general settings and the details of the analysis.

The settings file is straight forward using the following command:

pigx chipseq --init=settings

This settings file will have all lines commented out, you have to remove the trailing # from Locations, General and Analysis sections, then you can be modified it to fit your analysis.

It has the following required sections:

Locations

Here you define paths to be used in the pipeline, some of the items are required and some optional (can stay blank):

item	required	description
input-dir	yes	directory of the input files (`fastq` files)
output-dir	yes	output directory for the pipeline
genome-file	yes	path to the reference genome in `fasta` forma
index-dir	no	directory containing pre-built mapping indices for the given reference genome (created with `bowtie2-build`)
gff-file	no	location of a `GTF` file with genome annotations for the given reference genome

Reference and annotations

The user must supply paths pointing to a reference genome for the organism, as well as annotated genes. These might for example be downloaded from:

UCSC: The reference genome can be retrieved by selecting the group Mapping and Sequencing with track Assembly, then choosing region genome and output format Sequence and finally clicking Get output . The gene annotations can be retrieved by selecting the group Genes and Gene Predictions with track NCBI Refseq or Ensembl Genes, then choosing region genome and output format GTF - gene transfer format and finally clicking Get output .

Ensembl, where the reference genomoe is listed under DNA FASTA, the gene annotations under Gene sets GTF.

The user is free to choose any resource for these annotation files, however it is very important to note that the chromosome naming styles must match between different annotation files (e.g. chromosome 21 is named as chr21 in UCSC style, but 21 in NCBI/Ensemble style).

General

These are settings which apply to all analysis (unless adjusted in single analysis):

item	required	description
assembly	yes	version of reference genome (e.g. hg19,mm9, …)
params	no	list of default parameters for tools and scripts (for tools check respective manual for available parameters)

Execution

The execution section in the settings file allows the user to specify whether the pipeline is to be submitted to a cluster, or run locally, and the degree of parallelism. For a full list of possible parameters, see the execution section of the settings file created with pigx chipseq --init=settings.

A minimal settings file could look like this, but please consider that no analysis will be performed without adding analysis information :

locations:
                      input-dir: in/reads/
                      output-dir: out/
                      genome-file: genome/my_genome.fa
                      index-dir:
                      gff-file: genome/mm_chr19.gtf
                    
                    general:
                      assembly: hg19
                      params:
                        extend: 200
                        scale_bw: 'yes'
                        bowtie2:
                            k: 1
                        idr:
                            idr-threshold: 0.1
                        macs2:
                            g: hs
                            keep-dup: auto
                            q: 0.05
                        extract_signal:
                            expand_peak: 200
                            bin_num: 20
                    
                    execution:
                      submit-to-cluster: no
                      jobs: 6
                      nice: 19

Analysis Sections

The analysis part of the setting file describes the experiment. It has following sections:

section	required	description
peak_calling	yes	defines which samples will be used to detect regions of enriched binding ( multiple combinations and variations are possible, see here for details )
idr	no	specifies pairs of peak calling analysis that are compared to determine the reproducibilty of the general experiment (see here for details)
hub	no	describes the general layout of a UCSC hub that can be created from the processed data and allows the visual inspection of results at a UCSC genome browser (see here for details)
feature_combination	no	defines for a list of peak calling and/or idr analysis the combination of regions shared among this list (see here for details)

The creation of these sections is straight forward considering the following snippets as template. Comments and examples within the snippets provide guidance of what is possible and what to take care of.

Peak Calling

The previously defined samples are used for subsequent peak calling analysis to detect regions of enriched binding. In this section any number of comparisons can be defined, while multiple combinations and variations are possible. In terms of peak calling the ChIP (also called treatment) is the sample in which we want to detect enriched regions compared to the Cont(rol) (or background) sample. Each analysis can be run with a unique set of parameters and default parameters for all analysis can be defined in the settings file , check available parameters and description here. For more information have a look at the publication for the software we are using “Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137”.

# define peak calling analysis
                    peak_calling:
                        # analysis can have any name, but the names have to be unique 
                        Peaks1: 
                            # sample(s) to be used as treatment sample 
                            ChIP: ChIP1
                            # sample(s) to be used as control sample
                            Cont: Cont1
                            params:
                                macs2:
                                    # each analysis can be adjusted independently
                                    # add/modify available parameters of the analysis
                                    nomodel: ''
                                    extsize: 300
                        Peaks2:
                            ChIP: ChIP2
                            Cont: Cont2
                            params:
                                macs2:
                                    # each analysis can be adjusted independently
                                    nomodel: ''
                                    extsize: 147
                        Peaks4:
                            ChIP:
                                # multiple samples can be used as treatment
                                - ChIP1
                                - ChIP2
                            Cont:
                                # multiple samples can be used as control
                                - Cont1
                                - Cont2
                            params:
                                macs2:
                                    nomodel: ''
                    
                        Peaks5:
                            # the number of samples per group can differ 
                            ChIP: ChIP2
                            Cont:
                                - Cont1
                                - Cont2
                            params:
                                macs2:
                                    nomodel: ''
                    
                        Peaks6:
                            # analysis can be performed without control
                            ChIP: ChIP1
                            Cont:
                            params:
                                macs2:
                                    nomodel: ''

(optional) IDR

Assuming that the some samples are (biological/technical) replicates, in order to measure the consistency between them use the irreproducible discovery rate (IDR) “Li, Q., Brown, J. B., Huang, H., & Bickel, P. J. (2011). Measuring reproducibility of high-throughput experiments. The annals of applied statistics, 5(3), 1752-1779.”, which is in general a good (but very stringent) quality control.

idr:
                        # idr analysis can have any name, but the names have to be unique 
                        ChIP_IDR:
                            # define the pair of samples, add more combinations for more replicates
                            ChIP1: Peaks1
                            ChIP2: Peaks2

(optional) Hub

In the hub section the general layout of a UCSC Track Hubs is described with some minimal items. The track hub is generated from the processed data and allows the visual inspection of results at a UCSC genome browser (for supported genomes).

IMPORTANT: In order to use the hub feature, make sure you downloaded the reference genome in UCSC style (see here for instructions ).

The required items to define the hub are the following:

item	example	description
name	PiGx_Hub	name of the hub directory
shortLabel	PiGx_Short	short name of hub is displayed as name above track groups
longLabel	PiGx_Hub_Long	descriptive longer label for hub is displayed as hub description
email	my.mail[at]domain.com	whom to contact for questions about the hub or data
descriptionUrl	pigx_hub.html	URL to HTML page with a description of the hub’s contents
super_tracks	see below	specification of hub layout (track groups, tracks)

This is a small example how this could look like:

hub:
                        name: PiGx_Hub
                        shortLabel: PiGx_Short
                        longLabel: PiGx_Hub_Long
                        email: my.mail@domain.com
                        descriptionUrl: pigx_hub.html
                        super_tracks:
                            # track groups can have any name, but the names have to be unique 
                            Tracks1:
                                # tracks can have any name, but the names have to be unique 
                                track11:
                                    # to add peaks as a track, define "type: macs" 
                                    name: Peaks1
                                    type: macs
                                track12:
                                    # to add coverage signal as a track, define "type: bigwig"
                                    name: ChIP1
                                    type: bigWig
                                # descriptive longer label for track group is
                                # displayed as description in track settings
                                long_label: Tracks1_long

(optional) Feature Combination

To find the combination of enriched binding regions, which is shared among a set of peak calling and/or idr analysis results, define a feature in the feature_combination section. Only items defined in the peak_calling and idr sections can be used here.

feature_combination:
                        # features can have any name, but the names have to be unique
                        Feature1:
                            # define feature based on only one result
                            - ChIP_IDR
                        Feature2:
                            # define feature based on more than one result
                            - Peaks6
                            - Peaks5
                        Feature3:
                            # define feature based on different analysis types
                            - ChIP_IDR
                            - Peaks5

Running the pipeline

To run PiGx on your experimental data, first prepare the sample sheet and settings file (see above), and then from the terminal type

$ pigx-chipseq -s settings.yaml sample_sheet.csv

If you are not sure wether you set everything up correctly, use the dryrun option which will only show what work would be performed, but does not actually run the pipeline.

$ pigx-chipseq -s settings.yaml sample_sheet.csv -n

To see all available options type the --help option

Output Description

PiGx ChIPseq creates an output folder, with a specific directory structure (for details see here) that contains the following outputs and more.

Quality control

General quality control metrics are computed using FastQC and MultiQC. The MultiQC report is particularly useful, collating quality control metrics from many steps of the pipeline in a single html report, which may be found under the Reports folder in the PiGx output folder.

A custom ChIP quality control report can be found in the Reports foler as well, presenting useful quality metrics for the ChIP experiments.

Formatted UCSC Hub

The UCSC_Hub directory contains a folder with the name specified in the hub section of the analysis part of the settings file. This folder might be copied to a web-accessible server to host the hub itself, which can then be accessed using the UCSC Genome Browser.

Output Folder Structure

The pipeline will create a specific directory structure, the respective contents are explained below:

|-- Analysis
                    |-- Annotation
                    |-- BigWig
                    |-- Bowtie2_Index
                    |-- FastQC
                    |-- Log
                    |-- Mapped
                    |-- Peaks
                    |-- Reports
                    |-- Trimmed
                    |-- UCSC_HUB

Folder	description
Analysis	Contains RDS files with intermediary analysis steps. RDS are binary files which efficiently store R objects.
Annotation	Formatted GTF annotation.
BigWig	Symbolic links to the bigWig signal files.
Bowtie2_Index	Processed genome file along with the Bowtie2_Index.
FastQC	FastQC sequencing quality report.
Log	Detailed output from execution of each step of the pipeline.
Mapped	Mapped reads in .bam format, and corresponding bigWig files.
Peaks	Peaks called with MACS2. Depending on the parameters, contains either narrowPeak or broadPeak format. The file sample_qsort.bed contains uniformly processed peaks, sorted by their corresponding p value.
Reports	Contains MultiQC and ChIP quality reports in html format.
Trimmed	Trimgalore adaptor and quality trimmed files.
UCSC_Hub	Contains a completely formatted UCSC hub, with track descriptions, peaks and bigWig tracks.

Troubleshooting

Execution on a cluster

Currently, PiGx only supports Sun Grid Engine for cluster execution. If you’re uncertain about your cluster, try typing qsub in the shell (Sun Grid Engine uses qsub to submit jobs).

Disappearing jobs on the cluster

PiGx ChIPseq comes with sensible defaults for resource requests when running on a cluster, but based on the genome version and other parameters, these might not be sufficient and your cluster might terminate your jobs. The cluster resource requests may be overridden in the settings file. See the execution section of the settings file created with pigx chipseq --init=settings.

Guix locale error

If you get a warning:

##########################

                    perl: warning: Setting locale failed.
                    perl: warning: Please check that your locale settings:
                        LANGUAGE = (unset),
                        LC_ALL = (unset),
                        LC_MEASUREMENT = "de_DE.UTF-8",
                        LC_PAPER = "de_DE.UTF-8",
                        LC_MONETARY = "de_DE.UTF-8",
                        LC_NAME = "de_DE.UTF-8",
                        LC_ADDRESS = "de_DE.UTF-8",
                        LC_NUMERIC = "de_DE.UTF-8",
                        LC_TELEPHONE = "de_DE.UTF-8",
                        LC_IDENTIFICATION = "de_DE.UTF-8",
                        LC_TIME = "de_DE.UTF-8",
                        LANG = "en_US.UTF-8"
                        are supported and installed on your system.
                    perl: warning: Falling back to the standard locale ("C").

                    ##########################

and the pipeline breaks at multiqc, printing this into the Log :

####################################################
                    Traceback (most recent call last):
                      File "/gnu/store/gaijkjaiv4hirsazalpbjp3ifymybkzb-multiqc-1.5/bin/.multiqc-real", line 767, in <module>
                        multiqc()
                      File "/gnu/store/2h1kcjw1r1306chd45452kwqzq2xb001-python-click-6.7/lib/python3.6/site-packages/click/core.py", line 722, in __call__
                        return self.main(*args, **kwargs)
                      File "/gnu/store/2h1kcjw1r1306chd45452kwqzq2xb001-python-click-6.7/lib/python3.6/site-packages/click/core.py", line 676, in main
                        _verify_python3_env()
                      File "/gnu/store/2h1kcjw1r1306chd45452kwqzq2xb001-python-click-6.7/lib/python3.6/site-packages/click/_unicodefun.py", line 118, in _verify_python3_env
                        'for mitigation steps.' + extra)
                    RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII as encoding for the environment.  Consult http://click.pocoo.org/python3/for mitigation steps.

                    Additional information: on this system no suitable UTF-8
                    locales were discovered.  This most likely requires resolving
                    by reconfiguring the locale system.

                    Click discovered that you exported a UTF-8 locale
                    but the locale system could not pick up from it because
                    it does not exist.  The exported locale is "en_US.UTF-8" but it
                    is not supported

                    #######################################################

Then you have to install glibc-locales

guix install glibc-locales

and export this global variable to use these guix locales:

export GUIX_LOCPATH=$HOME/.guix-profile/lib/locale

Or you might consider adding this directly to your ~/.bashrc or ~/.bash_profile.

echo -e "# use guix locales\nexport GUIX_LOCPATH="'$HOME'"/.guix-profile/lib/locale"  >>  ~/.bashrc

Manage dependencies for manual installation

By default the configure script expects tools to be in a directory listed in the PATH environment variable. If the tools are installed in a location that is not on the PATH you can tell the configure script about them with variables. Run ./configure --help for a list of all variables and options.

You can prepare a suitable environment with Conda or with GNU Guix. Inspect the configure.ac file for an up-to-date list of required packages.

Assuming you have Guix installed, the following command spawns a sub-shell in which all dependencies are available:

guix environment -l guix.scm

FAQ

Q: I get the following error:

Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
                    /home/agosdsc/projects/pigx_chipseq/test_dir/out
                    If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with the --unlock argument.

What happend and what should I do?

A: The pipeline crashed at some point, possible reasons are mentioned by the error. Do as the error message proposes and pass the --unlock argument once and then run the pipeline again without --unlock.

Q: command name? In my installation, the command appears to be pigx-chipseq and not pigx chipseq as the docs say.

A: The command that you have to use depends on the way the pipeline was installed, either as part of the pigx bundle (guix install pigx) or as a single pipeline (guix install pigx-chipseq)

Q: The docs mention how to deal with technical replicates, but how should biological replicates be handled?

A: Usually biological replicates are kept seperately, which means you should give them distinct names in the sample sheet to allow for distinct alignments, and then for peak-calling you pick up the sample names from the sample sheet to reenter them in the ‘analysis’ section of the settings file.

Q: I get an error when trying to run the pipeline:

SystemExit in line 26 of /gnu/store/ik8kz81k720a8sy8nc4119jy5njb98gd-pigx-chipseq-0.0.20/libexec/pigx_chipseq/scripts/Check_Config.py:
                    ERROR: Config file is not properly formated:
                    Genome fasta headers contain whitespaces.
                     Please reformat the headers
                    
                      File "/gnu/store/ik8kz81k720a8sy8nc4119jy5njb98gd-pigx-chipseq-0.0.20/libexec/pigx_chipseq/Snake_ChIPseq.py", line 38, in <module>
                      File "/gnu/store/ik8kz81k720a8sy8nc4119jy5njb98gd-pigx-chipseq-0.0.20/libexec/pigx_chipseq/scripts/Check_Config.py", line 26, in validate_config

A: Our pipeline does not like if in the FASTA headers of the reference genome there is more information than the chromosome name, so please use this oneliner to format your genome-file:

cat refGenome.fa  | awk '{ print $1}’ >  refGenome_noWhiteSpace.fa

Questions

If you have further questions please e-mail: pigx@googlegroups.com or use the web form to ask questions https://groups.google.com/forum/#!forum/pigx/