Skip to content

Latest commit

 

History

History

Data

Data

Hi-C data

No Hi-C dataset is included in this repository. You need to provide the Hi-C data yourself to run demo1_prepCorr_HiC.

Obtaining

You can always use your own Hi-C data. Alternatively, Hi-C datasets can be easily downloaded from the public databases. For example, we obtained ours from Rao et al, (2014) on GEO. Specifically, we used the following files that contain Hi-C data from five different cell types:

GSE63525_GM12878_primary_intrachromosomal_contact_matrices.tar.gz
GSE63525_HUVEC_primary_intrachromosomal_contact_matrices.tar.gz
GSE63525_NHEK_primary_intrachromosomal_contact_matrices.tar.gz
GSE63525_K562_primary_intrachromosomal_contact_matrices.tar.gz
GSE63525_KBM7_primary_intrachromosomal_contact_matrices.tar.gz

Each tarball contains multiple subdirectories (see their readme file provided on the same page, in the link above). We chose a resolution and a chromosome number (for our main-text analysis, we used chromosome 10 at 50kb resolution), and kept data from all pairs (subdirectory "MAPQG0").

Data Formatting (pre-pre-processing)

The input Hi-C data for demo1 should be passed in a MAT file, that contains a symmetric matrix with non-negative entries. Upper or lower triangular matrices are allowed, which will be symmetrized by the script. There can be NaN entries, but they will be replaced by zeros. The matrix can have empty rows/columns, with all zeros, but they will be removed during the pre-processing.

Some simple formatting may be necessary, depending on how the original data is stored. For example, if you downloaded the Rao et al. (2014) dataset following the directions above, you need to:

  • Re-arrange the Hi-C count vector (third column in the "RAWobserved" file) into a symmetric square matrix.
  • Normalize the Hi-C matrix such that the row-sums are uniform, for example using the Knight-Ruiz algorithm. (If you are using Rao et al. (2014) dataset, a pre-computed normalization vector is provided as their "KRnorm" file. Note that they chose to fix the row-sums to the average row-sum of the original Hi-C matrix, rather than 1 which is more typical.)

See Rao et al.'s readme file for documentations on the data formats.

Importing to script

Make sure that the file path is correctly specified in demo1. To match the script as is, store the input Hi-C data in a variable named HiC_test on Matlab, and save to Data/HiC_test.mat.

Correlation matrix

For demonstration purposes, demo2_MultiCD generates and runs on a simulated correlation matrix; no external correlation matrix input is necessary in this case.

For application to real data, demo2 works with the correlation matrix generated by the pre-processing of Hi-C data (output of demo1). To save the correlation matrix to a file, uncomment the last lines in demo1. Make sure that the path to this file is correctly specified in demo2.