diHOCOMOCO

This code was used to prepare https://github.com/VorontsovIE/diHOCOMOCO/releases/tag/beforeHocomoco11 and https://github.com/VorontsovIE/diHOCOMOCO/releases/tag/hocomoco11 motif collection. Its purpose is to assess and choose the best motif across multiple motifs. All the stages are described in Rakefile but it is not recommended to run it, it won't do what you can expect.

General workflow is the following:

put PCM motifs, alignments and ChIP-seq control FASTA into certain folders. Please, prepare data on this stage manually. Existing scripts heavily rely on data layout of previous steps.
prepare PWMs from PCMs
calculate average local dinucleotide background frequencies in ChIP-seq datasets of the corresponding TF
estimate score-to-pvalue conversion for each model using calculated local background
Calculate ROC AUC and logROC AUC values for each TF motif variant and each TF dataset. Cross-species experiments are also performed.
Aggregate logROC AUC values into weighted logROC AUC (with weights corresponding to estimated dataset qualities). Motif packs (slices) to be aggregated are selected by manual curration (see motif list slices in curation folder). Each slice corresponds to similar patterns of the same TF. Most TFs have a single slice, but sometimes there exist different binding modes (e.g. mono-box and double-box binding). Datasets of secondary species are also taken into account with lower weight (in a sort of pseudocount addition procedure).
Choose the best motif for each motif slice - motif with maximal weighted logROC AUC value. Collect all the data (pcm, pwm, thresholds, alignments) for these motifs.

Folder layout is the following:

Models should be put into models/pcm/(mono or di)/all/{TF}/ folder.
Controls should be put into control/control/ folder.
Alignments go to models/words/mono/ folders (in subfolders hocomoco_legacy and chipseq).

Model naming is the following:

{TF name}~{collection}~{motif name}.pcm for mononucleoide motifs or {TF name}~{collection}~{motif name}.dpcm for dinucleotide motifs. E.g. KLF4_HUMAN~CM~KLF4_HUMAN.PEAKS034751.pics.S.pcm
Collection is one of the following abbreviations: CM (novel ChIP-seq beasd mononucleotide motifs), CD (novel ChIP-seq based dinucleotide motifs), HL (Hocomoco10 legacy motifs), DIHL (dinucleotide Hocomoco10 legacy motifs) and some others.

ChIP-seq control datasets are multi-FASTA files named like this: control/control/KLF4_MOUSE.PEAKS037825.sissrs.control.mfa.

These steps listed as separate tasks in Rakefie can be performed using rake scripts. For example to perform final_collection_summary task, one should execute rake final_collection_summary command.

Some of the tasks work in a different way: they just print other commands that should be run. One can pipe that commands into bash or (better) to parallel (a program from GNU moreutils package). For example to run 8 threads of logROC AUC calculations in parallel one can invoke the following command: rake calculate_occurence_scores_mono | parallel -j 8

After running all the steps, resulting collection will appear in the folder final_bundle.

Supplementary data for figures in paper were prepared using scripts in data_for_figures.sh.

Name		Name	Last commit message	Last commit date
Latest commit History 294 Commits
TFOntologies		TFOntologies
clusterize		clusterize
curation		curation
final_collection		final_collection
lib		lib
pmflogo		pmflogo
rakelib		rakelib
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
Rakefile		Rakefile
ape-3.0.2.jar		ape-3.0.2.jar
ape.jar		ape.jar
banned_models.txt		banned_models.txt
calculate_auc.rb		calculate_auc.rb
calculate_tpr_by_motif.sh		calculate_tpr_by_motif.sh
check_result.tsv		check_result.tsv
data_for_figures.sh		data_for_figures.sh
renameMultifastaSequences.rb		renameMultifastaSequences.rb
retracted_factors.txt		retracted_factors.txt
revcomp_models.txt		revcomp_models.txt
sarus-01Mar2018.jar		sarus-01Mar2018.jar
sarus.jar		sarus.jar
secondary_models.txt		secondary_models.txt
todo.txt		todo.txt
uniprot_HomoSapiens_and_MusMusculus_lots_of_infos.tsv		uniprot_HomoSapiens_and_MusMusculus_lots_of_infos.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

diHOCOMOCO

About

Releases 2

Packages

Languages

VorontsovIE/diHOCOMOCO

Folders and files

Latest commit

History

Repository files navigation

diHOCOMOCO

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages