diff --git a/README.rst b/README.rst index 1fec174..1377f41 100644 --- a/README.rst +++ b/README.rst @@ -35,7 +35,7 @@ Workflow Documentation ------------- -To see the full documentation of MAmoitf, please refer to: http://bioinfo.sibs.ac.cn/shaolab/mamotif/index.php +To see the full documentation of MAmotif, please refer to: http://mamotif.readthedocs.io/en/latest/ Installation ------------ @@ -137,8 +137,8 @@ MAnorm output MAmotif will invoke MAnorm and output the normalization results and MA-plot for samples under comparison. -Motif output -^^^^^^^^^^^^ +MotifScan output +^^^^^^^^^^^^^^^^ MAmotif will also output tables to summarize the enrichment of motifs and the motif target number and motif-score of each peak region. diff --git a/docs/source/index.rst b/docs/source/index.rst index 5981b3d..7310322 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -4,7 +4,7 @@ MAmotif .. image:: https://travis-ci.org/shao-lab/MAmotif.svg?branch=master :alt: Travis Build :target: https://travis-ci.org/shao-lab/MAmotif -.. image:: https://readthedocs.org/projects/mamoitf/badge/?version=latest +.. image:: https://readthedocs.org/projects/mamotif/badge/?version=latest :alt: Documentation Status :target: http://mamotif.readthedocs.io/en/latest/?badge=latest .. image:: https://img.shields.io/pypi/v/mamotif.svg diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index 3fbf3de..80826de 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -9,3 +9,354 @@ Tutorial Installation ============ + +Like many other Python packages and bioinformatics softwares, MAmotif can be obtained easily from PyPI_ or Bioconda_(WIP). +The command below shows how to install the latest release of MAmotif in a convenient way, but you can also install it +from source code alternatively. + +Prerequisites +------------- + +.. tip:: + MAmotif is implemented under **Python 2.7** and will support **Python 3.X** in the following updates. + +* **Python 2.7** +* setuptools +* numpy +* pandas +* statsmodels +* scipy +* matplotlib + +Install with pip +---------------- +The latest release of MAmotif is available at PyPI_, you can install via ``pip``:: + + $ pip install mamotif + +.. _PyPI: https://pypi.python.org/pypi/MAmotif + +Install with conda (WIP) +------------------------ + +You can also install MAmotif with conda_ through Bioconda_ channel:: + + $ conda install -c bioconda mamotif + +.. _conda: https://conda.io/docs/ +.. _Bioconda: https://bioconda.github.io/ + +Install from source code +------------------------ + +It's highly recommended to install MAmotif with ``pip`` or ``conda``. If you prefer to install it from source code, +please read the following steps: + +The source code of MAmotif is hosted on GitHub_, and setuptools_ is required for installation. + +.. _setuptools: https://setuptools.readthedocs.io/en/latest/ +.. _GitHub: https://github.com/shao-lab/MAmotif + +First, clone the repository of MAmotif:: + + $ git clone https://github.com/shao-lab/MAmotif.git + +Then, install MAmotif in the source directory:: + + $ cd MAmotif + $ python setup.py install + +.. note:: + * You may need to install all dependencies listed in ``requirements.txt``. + * You may need to modify ``$PATH`` and ``$PYTHONPATH`` manually to make it work. + +Galaxy Installation +------------------- + +WIP + +Usage of MAmotif +================ + +To check whether MAmotif is properly installed, you can inspect the version of MAmotif by ``-v/--version`` option:: + + $ manorm -v + $ manorm --version + +Command-Line Usage +------------------ + +You need to build some prerequisites before running MAmotif: + +Build genomes +^^^^^^^^^^^^^ + +Preprocess sequences and genome-wide nucleotide frequency for the corresponding genome assembly. + +:: + + $ genomecompile [-h] [-v] -G hg19.fa -o hg19_genome + +**Note:** You only need to run this command once for each genome + +Options +""""""" + +-h, --help Show help message and exit. +-v, --version Show version number and exit. +-G **[Required]** Genome sequences in fasta format. +-o **[Required]** Path to write the output files. + +Build motifs (Optional) +^^^^^^^^^^^^^^^^^^^^^^^ + +**Note:** MAmotif provides some preprocessed motif PWM files under **data/motif** of the MotifScan package. + +Build motif PWM/motif-score cutoff for custom motifs that are not included in our pre-complied motif collection: + +:: + + $ motifcompile [-h] [-v] –M motif_pwm_demo.txt –g hg19_genome -o hg19_motif + +Options +""""""" + +-h, --help Show help message and exit. +-v, --version Show version number and exit. +-M **[Required]** Raw motif PFM (Position Frequency Matrix) file. +-g **[Required]** Path of pre-compiled genome directory (generated by `genomecompile`) +-o **[Requried]** Prefix of output file. + +run MAmotif +^^^^^^^^^^^ + +MAmotif provide a console script ``mamotif`` for running the program, the basic usage is as follows: + +:: + + $ mamotif --p1 sample1_peaks.bed --p2 sample2_peaks.bed --r1 sample1_reads.bed --r2 sample2_reads.bed -g hg19_genome + –m hg19_motif_p1e-4.txt -o sample1_vs_sample2 + +.. tip:: + Please use ``-h/--help`` for the details of all options. + +Options +""""""" + +-h, --help Show help message and exit. +-v, --version Show version number and exit. +--p1 **[Required]** Peaks file of sample1. +--p2 **[Required]** Peaks file of sample2. +--r1 **[Required]** Reads file of sample1. +--r2 **[Required]** Reads file of sample2. +--s1 Reads shiftsize of sample1. Default: 100 +--s2 Reads shiftsize of sample2. Default: 100 +-g **[Required]** Path of pre-compiled genome directory (generated by `genomecompile`). +-m **[Required]** Pre-compiled motif file (generated by `motifcompile`). +-a Gene annotation file, which is used to generate random controls when performing enrichment analysis. +-w Width of window to calculate read density. Default: 1000 +-d Summit-to-summit distance cutoff for common peaks. Default: ``-w``/2 +-n Number of simulations to test the enrichment of peaks overlap between two samples. +--m_cutoff *M-value* cutoff to distinguish biased (sample-specific) peaks from unbiased peaks. +-p *P-value* cutoff to define biased peaks. +-l Motif list file. +-r Perform MAmotif on {all,promoter,distal} regions. +--upstream Upstream distance to TSS to define promoter regions. +--downstream Downstream distance to TSS to define promoter regions. +--peak_length The length of input regions to perform motif scan around peak summit/midpoint. +--negative Using negative test (sample2 vs sample1). +--correction Type of multiple test correction [benjamin, bonferroni]. +-s Detailed output mode. Write the normalization results for original peaks and the genome coordinates + of target sites for each motif. +-o Comparison name, this is used as the folder name and prefix of output files. + +Input Format +============ + +Format of Peaks file +-------------------- + +Standard **BED** format and **MACS xls** format are supported, other supported format are listed below:: + + * 3-columns tab split format + + # chr start end + chr1 2345 4345 + chr1 3456 5456 + chr2 6543 8543 + + * 4-columns tab split format + + # chr start end summit + chr1 2345 4345 254 + chr1 3456 5456 127 + chr2 6543 8543 302 + +.. note:: + The fourth column **summit** is the relative position to **start**. + + +Format of Reads file +-------------------- + +Only **BED** format are supported for now. More format will be embedded in the following updates. + +Format of Motif PWM file +------------------------ + +MAmotif supports JASPAR_ 2014/2016/2018 motif matrix format. + +JASPAR2014:: + + >MA0004.1 Arnt + 4 19 0 0 0 0 + 16 0 20 0 0 0 + 0 1 0 20 0 20 + 0 0 0 0 20 0 + +JASPAR2016/2018:: + + >MA0004.1 Arnt + A [ 4 19 0 0 0 0 ] + C [ 16 0 20 0 0 0 ] + G [ 0 1 0 20 0 20 ] + T [ 0 0 0 0 20 0 ] + +.. _JASPAR: http://jaspar.genereg.net/ + +Format of Gene annotation file +------------------------------ + +MAmotif supports RefSeq_ format for gene annotation. + +.. _RefSeq: http://genome.ucsc.edu/cgi-bin/hgTables + +MAmotif Output +============== + +After finished running MAmotif, all output files will be written to the directory you specified with "-o" argument. + +Main output +----------- + +:: + + 1.Motif Name + 2.Target Number: Number of motif-present peaks + 3.Average of Target M-value: Average M-value of motif-present peaks + 4.Deviation of Target M-value: M-value Std of motif-present peaks + 5.Non-target Number: Number of motif-absent peaks + 6.Average of Non-target M-value: Average M-value of motif-absent peaks + 7.Deviation of Non-target M-value: M-value Std of motif-absent peaks + 8.T-test Statistics: T-Statistics for M-values of motif-present peaks against motif-absent peaks + 9.T-test P-value: Right-tailed P-value of T-test + 10.T-test P-value By Benjamin correction + 11.RanSum-test Statistics + 12.RankSum-test P-value + 13.RankSum-test P-value By Benjamin correction + 14.Maximal P-value: Maximal corrected P-value of T-test and RankSum-test + +MAnorm output +------------- + +MAmotif will invoke MAnorm and output the normalization results and MA-plot for samples under comparison. + +1. output_prefix_all_MAvalues.xls + +This is the main output result of MAnorm which contains the M-A values and normalized read density of each peak, +common peaks from two samples are merged together:: + + 1.chr: chromosome name + 2.start: start position of the peak + 3.end: end position of the peak + 4.summit: summit position of the peak (relative to start) + 5.m_value: M value (log2 Fold change) of normalized read densities under comparison + 6.a_value: A value (average signal strength) of normalized read densities under comparison + 7.p_value + 8.peak_group: indicates where the peak is come from + 9.normalized_read_density_in _sample1 + 10.normalized_read_density_in_sample2 + + +.. note:: + Coordinates in .xls file is under **1-based** coordinate-system. + +2. output_filters/ + + * sample1_biased_peaks.bed + * sample2_biased_peaks.bed + * output_name_unbiased_peaks.bed + +3. output_tracks/ + + * output_name_M_values.wig + * output_name_A_values.wig + * output_name_P_values.wig + +4. output_figures/ + + * output_name_MA_plot_before_normalization.png + * output_name_MA_plot_after_normalization.png + * output_name_MA_plot_with_P-value.png + * output_name_read_density_on_common_peaks.png + +MotifScan output +---------------- + +MAmotif will also output tables to summarize the enrichment of motifs and the motif target number and motif-score +of each peak region. + +If you specified "-s" with MAmotif, it will also output the genome coordinates of every motif target site. + +1. motif_enrichment.csv + +Enrichment of motifs in given peaks compared to random regions. All analyzed motifs are listed and sorted by enrichment +p-value in the ascending order. + +2. peak_motif_score.csv + +The table can be divided into two parts, the first 5 columns are the region information part which briefly derived from +the region file that user specified and the second part is the motif score information. Each motif has a score measuring +the binding affinity for each region sequence. + ++------+-------+-------+--------+-------+------------+-------------+-----+ +| chr | start | end | summit | score | IRF2.score | GATA2.score | ... | ++======+=======+=======+========+=======+============+=============+=====+ +| chr1 | 10012 | 10256 | 10135 | 64.21 | 0.82 | 0.35 | ... | ++------+-------+-------+--------+-------+------------+-------------+-----+ +| ... | | | | | | | | ++------+-------+-------+--------+-------+------------+-------------+-----+ + +3. peak_motif_tarnum.csv + +It is a also detail information table for each region’s motif target number for each motif. The file structure is +similar to the peak_motif_score.csv, except the bold font represents the motif target number instead of the motif +score. + ++------+-------+-------+--------+-------+-------------+--------------+-----+ +| chr | start | end | summit | score | IRF2.number | GATA2.number | ... | ++======+=======+=======+========+=======+=============+==============+=====+ +| chr1 | 10012 | 10256 | 10135 | 64.21 | 0.82 | 0.35 | ... | ++------+-------+-------+--------+-------+-------------+--------------+-----+ +| ... | | | | | | | | ++------+-------+-------+--------+-------+-------------+--------------+-----+ + +4. motif_target_sites/* + +Only appears when option -s is on. The directory contains all the motif target site information of all candidate motifs. +Each motif forms an independent file that named after [motif_name]_target_site.txt. The fisrt 3 columns are the motif +target site coordinate on the genome. The 4th column is the corresponding target sequence and the motif score of the +this motif occurrence is indicated in the last column. + ++------+-------+-------+----------+-------------+ +| chr | start | end | sequence | motif score | ++======+=======+=======+==========+=============+ +| chr1 | 10012 | 10256 | AATCGAT | 0.57 | ++------+-------+-------+----------+-------------+ +| ... | | | | | ++------+-------+-------+----------+-------------+ + +5. plot/ + +Under this directory, motif enrichment plot and distribution relative to peak summit/center will be generated for each +motif.