This pipeline is heavily inspired by the juicer pipeline. For any information about the internal workings of the pipeline, please check their repository. You can see this pipeline as a wrapper around that one.
The main usage of this pipeline is to run the analysis of multiple Hi-C samples and organize their results in a coherent way.
If the pipeline was already installed in your system, simply run the following:
install_hic_pipeline.sh
If requested, logout from the system, login again, and run again install_hic_pipeline.sh
to complete the setup. If not, move to the next step.
Test that the pipeline works by running:
run_hic_pipeline.sh test_input.csv
and
run_merge_hic_pipeline.sh test_mega.csv
where test_input.csv
and test_mega.csv
are sample files which can be downloaded from this Github repository. Remember to modify accordingly the genome_sequence
and chromsizes
fields to point to your juicer installation.
This is the main step of the pipeline.
./run_hic_pipeline.sh <input-samples.csv>
The run_hic_pipeline.sh
script accepts a .csv file (with header) as input having one line for each sample to be processed. The required columns are:
- sample_path: path to the sample results
- raw_path: path to the sample fastq files. Fastq files are assumed to be paired-ended and located in the same folder. Read1 and Read2 are denoted by
_R1_
and_R2_
inside the file names. - restriction_enzyme: which restriction enzyme to use (
MboI
,HindIII
,Arima
, etc...) - genome_assembly: genome assembly (
hg19
,mm10
, etc...) - genome_sequence (OPTIONAL): path to the fasta file for the reference genome
- chromsizes (OPTIONAL): path to the chromosome sizes file for the reference genome
You can check the test_input.csv file for reference.
The genome_assembly column of the input file should match one of the available genome assemblies that are available in your system. Additionally, the REFERENCES_PATH
environment path should be defined in your system and should point to the location of all the available referneces. In the case in which this variable is not defined, you have to manually specify genome_sequence and chromsizes fields in the input file.
The pipeline will run the following analyses:
- Fastq quality control with
fastqc
- Read alignment and
.hic
file generation usingjuicer
- Conversion of
.hic
files to.mcool
files for compatibility with cooler format
If you have multiple replicates of the same experiment, most likely you will want to merge them in a single file, to improve data depth and following analyses. To do that you can run:
./run_merge_hic_pipeline.sh <mega-samples.csv>
The run_merge_hic_pipeline.sh
script accepts a .csv file (with header) as input having one line for each aggregated hi-c map. The required columns are:
- sample_path: path to the aggregated map results
- restriction_enzyme: which restriction enzyme to use (
MboI
,HindIII
,Arima
, etc...). Notice that this implies that you cannot merge Hi-C samples which have been generated by different restriction enzymes - genome_assembly: genome assembly (
hg19
,mm10
, etc...). For obvious reasons, you cannot merge Hi-C samples which have been generated by different genome assemblies - replicate_paths: paths to the replicate sample results (generated by the previous step), separated by colon (:)
- chromsizes (OPTIONAL): path to the chromosome sizes file for the reference genome. Same reasoning as previous column.
You can check the test_mega.csv file for reference.
Same rules apply to the genome_assembly fiels as for the input file for the single Hi-C processing (see above).
Clone the repository and enter in the folder:
git clone https://github.com/CSOgroup/hic_pipeline.git
cd hic_pipeline
Install the dependencies using conda/mamba, creating a new environment (hic_pipeline
):
./install_hic_pipeline.sh
If requested, logout from the system, login again, and run again install_hic_pipeline.sh
to complete the setup. If not, move to the next step.