Skip to content

Bioinformatic pipeline for the analysis of Hi-C data

Notifications You must be signed in to change notification settings

CSOgroup/hic_pipeline

Repository files navigation

Hi-C processing pipeline

About the pipeline

This pipeline is heavily inspired by the juicer pipeline. For any information about the internal workings of the pipeline, please check their repository. You can see this pipeline as a wrapper around that one.

The main usage of this pipeline is to run the analysis of multiple Hi-C samples and organize their results in a coherent way.

Using a pre-made setup

If the pipeline was already installed in your system, simply run the following:

install_hic_pipeline.sh

If requested, logout from the system, login again, and run again install_hic_pipeline.sh to complete the setup. If not, move to the next step.

Testing the pipeline

Test that the pipeline works by running:

run_hic_pipeline.sh test_input.csv

and

run_merge_hic_pipeline.sh test_mega.csv

where test_input.csv and test_mega.csv are sample files which can be downloaded from this Github repository. Remember to modify accordingly the genome_sequence and chromsizes fields to point to your juicer installation.

1) Processing single Hi-C samples

This is the main step of the pipeline.

./run_hic_pipeline.sh <input-samples.csv>

Input format

The run_hic_pipeline.sh script accepts a .csv file (with header) as input having one line for each sample to be processed. The required columns are:

  • sample_path: path to the sample results
  • raw_path: path to the sample fastq files. Fastq files are assumed to be paired-ended and located in the same folder. Read1 and Read2 are denoted by _R1_ and _R2_ inside the file names.
  • restriction_enzyme: which restriction enzyme to use (MboI, HindIII, Arima, etc...)
  • genome_assembly: genome assembly (hg19, mm10, etc...)
  • genome_sequence (OPTIONAL): path to the fasta file for the reference genome
  • chromsizes (OPTIONAL): path to the chromosome sizes file for the reference genome

You can check the test_input.csv file for reference.

Reference genomes

The genome_assembly column of the input file should match one of the available genome assemblies that are available in your system. Additionally, the REFERENCES_PATH environment path should be defined in your system and should point to the location of all the available referneces. In the case in which this variable is not defined, you have to manually specify genome_sequence and chromsizes fields in the input file.

Steps

The pipeline will run the following analyses:

  1. Fastq quality control with fastqc
  2. Read alignment and .hic file generation using juicer
  3. Conversion of .hic files to .mcool files for compatibility with cooler format

2) Aggregating replicates into mega-maps

If you have multiple replicates of the same experiment, most likely you will want to merge them in a single file, to improve data depth and following analyses. To do that you can run:

./run_merge_hic_pipeline.sh <mega-samples.csv>

Input format

The run_merge_hic_pipeline.sh script accepts a .csv file (with header) as input having one line for each aggregated hi-c map. The required columns are:

  • sample_path: path to the aggregated map results
  • restriction_enzyme: which restriction enzyme to use (MboI, HindIII, Arima, etc...). Notice that this implies that you cannot merge Hi-C samples which have been generated by different restriction enzymes
  • genome_assembly: genome assembly (hg19, mm10, etc...). For obvious reasons, you cannot merge Hi-C samples which have been generated by different genome assemblies
  • replicate_paths: paths to the replicate sample results (generated by the previous step), separated by colon (:)
  • chromsizes (OPTIONAL): path to the chromosome sizes file for the reference genome. Same reasoning as previous column.

You can check the test_mega.csv file for reference.

Same rules apply to the genome_assembly fiels as for the input file for the single Hi-C processing (see above).

Installing the pipeline from scratch

Clone the repository and enter in the folder:

git clone https://github.com/CSOgroup/hic_pipeline.git
cd hic_pipeline

Install the dependencies using conda/mamba, creating a new environment (hic_pipeline):

./install_hic_pipeline.sh

If requested, logout from the system, login again, and run again install_hic_pipeline.sh to complete the setup. If not, move to the next step.

About

Bioinformatic pipeline for the analysis of Hi-C data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages