Skip to content

Python package

yaowen edited this page Dec 4, 2021 · 7 revisions

MetaLogo Python Package

MetaLogo provides stand alone package for user to draw figures in their own computer or server. There are two ways to make sequence logos using MetaLogo package. One is importing MetaLogo into your python scripts and create logos with specific parameters, the other is directly execute MetaLogo in system terminal and pass arguments into MetaLogo to custom the logos. These two ways share the same set of parameters, which we will explain in this tutorial.

Run MetaLogo in your terminal

When you installed MetaLogo, you can run MetaLogo in your terminal like this:

$ metalogo --seq_file MetaLogo/examples/example.fa --output_dir . --output_name test.png --withtree

If you do not want to install MetaLogo as a package in your system, you can also directly run MetaLogo like below:

$ python -m MetaLogo.MetaLogo.entry --seq_file MetaLogo/examples/example.fa --output_dir . --output_name test.png --withtree

Note the current workdir is the directory where MetaLogo source code exists.

Make sure you have install all the requirements and be under MetaLogo project directory.

If the command run successfully, you will get a plot named test.png in your current directory.

server_top

Below are the parameters you can pass into MetaLogo.

usage: metalogo [-h] [--config CONFIG]
                [--type {Horizontal,Circle,Radiation,Threed}]
                [--seq_file SEQ_FILE] [--seq_file_type {fasta,fastq}]
                [--sequence_type {auto,dna,rna,aa}] [--task_name TASK_NAME]
                [--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
                [--group_strategy {auto,length,identifier}]
                [--clustering_method CLUSTERING_METHOD]                            
                [--group_resolution GROUP_RESOLUTION]
                [--group_limit GROUP_LIMIT]
                [--group_order {length,length_reverse,identifier,identifier_reverse}]
                [--color_scheme {basic_dna_color,basic_rna_color,basic_aa_color}]
                [--color_scheme_json_str COLOR_SCHEME_JSON_STR]
                [--color_scheme_json_file COLOR_SCHEME_JSON_FILE]
                [--height_algorithm {bits,bits_without_correction,probabilities}]
                [--align] [--padding_align]
                [--align_metric {dot_product,js_divergence,cosine,entropy_bhattacharyya}]
                [--connect_threshold CONNECT_THRESHOLD]
                [--gap_score GAP_SCORE]
                [--display_range_left DISPLAY_RANGE_LEFT]
                [--display_range_right DISPLAY_RANGE_RIGHT] [--withtree]
                [--logo_margin_ratio LOGO_MARGIN_RATIO]
                [--column_margin_ratio COLUMN_MARGIN_RATIO]
                [--char_margin_ratio CHAR_MARGIN_RATIO] [--hide_version_tag]
                [--hide_left_axis] [--hide_right_axis] [--hide_top_axis]
                [--hide_bottom_axis] [--hide_x_ticks] [--hide_y_ticks]
                [--hide_z_ticks] [--x_label X_LABEL] [--y_label Y_LABEL]
                [--z_label Z_LABEL] [--show_group_id] [--show_grid]
                [--title_size TITLE_SIZE] [--label_size LABEL_SIZE]
                [--tick_size TICK_SIZE] [--group_id_size GROUP_ID_SIZE]
                [--figure_size_x FIGURE_SIZE_X]
                [--figure_size_y FIGURE_SIZE_Y] [--auto_size]
                [--align_color ALIGN_COLOR] [--align_alpha ALIGN_ALPHA]
                [--output_dir OUTPUT_DIR] [--output_name OUTPUT_NAME]
                [--fa_output_dir FA_OUTPUT_DIR] [--uid UID]
                [--logo_format {png,pdf}] [--analysis]
                [--clustalo_bin CLUSTALO_BIN] [--fasttree_bin FASTTREE_BIN]
                [--fasttreemp_bin FASTTREEMP_BIN]
                [--treecluster_bin TREECLUSTER_BIN] [-v]
optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       The config file contain sequences (default: None)
  --type {Horizontal,Circle,Radiation,Threed}
                        Choose the layout type of sequence logo (default: Horizontal)
  --seq_file SEQ_FILE   The input file contain sequences (default: None)
  --seq_file_type {fasta,fastq}
                        The type of input file (default: fasta)
  --sequence_type {auto,dna,rna,aa}
                        The type of sequences (default: auto)
  --task_name TASK_NAME
                        The title to displayed on the figure (default: MetaLogo)
  --min_length MIN_LENGTH
                        The minimum length of sequences to be included (default: 8)
  --max_length MAX_LENGTH
                        The maximum length of sequences to be included (default: 20)
  --group_strategy {auto,length,identifier}
                        The strategy to separate sequences into groups (default: auto)
  --clustering_method CLUSTERING_METHOD
                        The method for tree clustering (default: max)
  --group_resolution GROUP_RESOLUTION
                        The resolution for sequence grouping (default: 0.5)
  --group_limit GROUP_LIMIT
                        The limit for group number (default: 20)
  --group_order {length,length_reverse,identifier,identifier_reverse}
                        The order of groups (default: length)
  --color_scheme {basic_dna_color,basic_rna_color,basic_aa_color}
                        The color scheme (default: basic_dna_color)
  --color_scheme_json_str COLOR_SCHEME_JSON_STR
                        The json string of color scheme (default: None)
  --color_scheme_json_file COLOR_SCHEME_JSON_FILE
                        The json file of color scheme (default: None)
  --height_algorithm {bits,bits_without_correction,probabilities}
                        The algorithm for character height (default: bits)
  --align               If show alignment of adjacent sequence logo (default: False)
  --padding_align       If padding logos to make multiple logo alignment (default: False)
  --align_metric {dot_product,js_divergence,cosine,entropy_bhattacharyya} 
                        The metric for align score (default: dot_product)
  --connect_threshold CONNECT_THRESHOLD
                        The align threshold (default: 0.8)
  --gap_score GAP_SCORE
                        The gap score for alignment (default: -1.0)
  --display_range_left DISPLAY_RANGE_LEFT
                        The start position of display range (Global alignment with padding required) (default: 0)
  --display_range_right DISPLAY_RANGE_RIGHT
                        Then end position of display range (Global alignment with padding requirement) (default: -1)
  --withtree            If show tree besides sequence logo (default: False)
  --logo_margin_ratio LOGO_MARGIN_RATIO
                        Margin ratio between the logos (default: 0.1)
  --column_margin_ratio COLUMN_MARGIN_RATIO
                        Margin ratio between the columns (default: 0.05)
  --char_margin_ratio CHAR_MARGIN_RATIO
                        Margin ratio between the chars (default: 0.05)
  --hide_version_tag    If show version tag of MetaLogo (default: False)
  --hide_left_axis      If hide left axis (default: False)
  --hide_right_axis     If hide right axis (default: False)
  --hide_top_axis       If hide top axis (default: False)
  --hide_bottom_axis    If hide bottom axis (default: False)
  --hide_x_ticks        If hide ticks of X axis (default: False)
  --hide_y_ticks        If hide ticks of Y axis (default: False)
  --hide_z_ticks        If hide ticks of Z axis (default: False)
  --x_label X_LABEL     The label for X axis (default: None)
  --y_label Y_LABEL     The label for Y axis (default: None)
  --z_label Z_LABEL     The label for Z axis (default: None)
  --show_group_id       If show group ids (default: False)
  --show_grid           If show background grid (default: False)
  --title_size TITLE_SIZE
                        The size of figure title (default: 20)
  --label_size LABEL_SIZE
                        The size of figure xy labels (default: 10)
  --tick_size TICK_SIZE
                        The size of figure ticks (default: 10)
  --group_id_size GROUP_ID_SIZE
                        The size of group labels (default: 10)
  --figure_size_x FIGURE_SIZE_X
                        The width of figure (default: 20)
  --figure_size_y FIGURE_SIZE_Y
                        The height of figure (default: 10)
  --auto_size           Let MetaLogo determine the size of figures (default: False)
  --align_color ALIGN_COLOR
                        The color of alignment (default: blue)
  --align_alpha ALIGN_ALPHA
                        The transparency of alignment (default: 0.2)
  --output_dir OUTPUT_DIR
                        Output path of figure (default: figure_output)
  --output_name OUTPUT_NAME
                        Output name of figure (default: test.png)
  --fa_output_dir FA_OUTPUT_DIR
                        Output path of fas (default: sequence_input)
  --uid UID             Task id (default:
                        a878a8bd-6818-4bf8-91ad-f148d67f849a)
  --logo_format {png,pdf}
                        The format of figures (default: png)
  --analysis            If perform basic analysis on data (default: False)
  --clustalo_bin CLUSTALO_BIN
                        The path of clustalo bin (default:
                        dependencies/clustalo)
  --fasttree_bin FASTTREE_BIN
                        The path of fasttree bin (default:
                        dependencies/FastTree)
  --fasttreemp_bin FASTTREEMP_BIN
                        The path of fasttreeMP bin (default:
                        dependencies/FastTreeMP)
  --treecluster_bin TREECLUSTER_BIN
                        The path of treecluster bin (default: TreeCluster.py)
  -v, --version         show program's version number and exit

Most of the parameters are easy to understand, there are several parameters need to be explained here.

  --group_strategy {auto,length,identifier}
                        The strategy to separate sequences into groups
                        (default: auto)

This parameter specify the way you group sequences. In default, MetaLogo groups sequences by phylogenetic tree. Multiple sequence alignment and phylogenetic tree construction will be automatically performed to cluster the sequences.

However, you could still group sequences by other strategy. MetaLogo can identify group information of sequences from their sequence name. Blow is a example:

>seq1 group@1-fisrtgroup
AATATACAGATACCCATAC
>seq2 group@2-secondgroup
ATACAATACCCACAGATAC

You need to add a 'group@\d-\S' pattern in your sequence names. In the term, 'group@' is fixed and then followed by a number, a dash and a string to indicate group information. Then if you set --group_strategy as 'identifier', MetaLogo will draw sequence logos for different groups. It should be noted that in each group, lengths of sequences must be the same. Below is a output of identifier-grouped input (probabilities as height, 3D layout):

$cat test.fa 
>seq1 group@1-fisrtgroup
AATATACAGATACCCATAC
>seq2 group@2-secondgroup
ATACAATACCCACAGATAC

$metalogo --seq_file test.fa --output_dir . --output_name test.png  --height_algorithm probabilities --group_strategy identifier --type Threed --show_group_id

identifier_grouping

--group_order {auto,length,length_reverse,identifier,identifier_reverse}
              The order of groups (default: auto)

This parameter specify how to order the groups. 'auto' means automatically sorting groups. 'length' means sorting groups by sequence lengths, 'length_reverse' means sorting groups by sequence lengths in a decreasing order, 'identifier' means sorting groups by its group id indicated in sequence names, i.e. the number followed 'group@' term in sequence name, 'identifier_reverse' means a decreasing order.

--max_length, --min_length

These two parameters specify the length of sequences to be included in the logo drawing process. Sometimes the length range of sequences could be too large for visualization, users could limit the lengths of sequences for sequence logos.

--color_scheme_json_file

This parameter specify the color scheme json file for sequence logo. There are four built-in schemes, namely basic_dna_color,basic_rna_color,basic_aa_color. User can also pass a json format of a python dict into color scheme. Below is a example:

$ cat color.json 
{"A": "red", "T": "blue", "G": "yellow", "C": "green"}
$metalogo --seq_file MetaLogo/examples/ectf.fa  --color_scheme_json_file color.json --output_dir .

custom_color

--height_algorithm {bits,probabilities}
                  The algorithm for character height (default: bits)

This parameter tells MetaLogo to use probabilities or information contents for y axis in sequence logos. If there is only one sequence in one group, the information contents of each positions equal to zeros because error correction. This is the reason why we sometimes use probabilities as height in our tutorial.

$metalogo --seq_file MetaLogo/examples/ectf.fa  --output_dir . --height_algorithm probabilities

prob

--align               If show alignment of adjacent sequence logo (default: False)

When you pass this parameter to MetaLogo, it will tried to align each pair of groups and highlight the similar positions.

$metalogo --seq_file MetaLogo/examples/ectf.fa  --output_dir . --height_algorithm probabilities --align

align

For the align metric and threshold, you could check the --align_metric and --connect_threshold parameter.

--padding_align       If padding logos to make multiple logo alignment
                      (default: False)

This parameter is only valid for user-defined grouping scenario. If the --group_strategy is set as 'auto', this parameter will not work. In length-grouping or identifier-grouping, this parameter will make MetaLogo perform multiple logo alignment for all the groups rather than only for two adjacent groups.

$metalogo --seq_file MetaLogo/examples/ectf.fa  --output_dir . --height_algorithm probabilities --group_strategy length --align --padding_align --show_grid --connect_threshold 0.6

padding_align

--align_metric {dot_product,js_divergence,cosine,entropy_bhattacharyya}
                The metric for align score (default: dot_product)

This parameter specify the algorithm to measure position similarities between sequence logos. Detailed information could be found in our paper.

--connect_threshold CONNECT_THRESHOLD
                    The align threshold (default: 0.8)

This parameter specify the threshold to connect two positions between two adjacent groups according to logo alignment. If this threshold is positive (>0), MetaLogo will connect two positions if their similarity score is larger than the threshold. If this threshold is negative (>0), MetaLogo will connect two positions if their similarity score is in the top (ratio*100)% of all pairs, in which ratio equals to -1*threshold.

--align_color ALIGN_COLOR
                      The color of alignment (default: 10)
--align_alpha ALIGN_ALPHA
                      The transparency of alignment (default: 10)

These two parameter specify the color and transparency of connections between logos.

--analysis            If perform basic analysis on data (default: False)

Below is a example for logo alignment.

$ metalogo --input_file examples/ectf.fa   --show_group_id --align --padding_align --connect_threshold -0.3 --task_name 'Logo alignment' --show_grid

logo_alignment

Below is a example for logo alignment without global multiple logo alignment and padding.

$ metalogo --input_file examples/ectf.fa   --show_group_id --align  --connect_threshold -0.3 --task_name 'Logo alignment' --show_grid

logo_alignment

--logo_margin_ratio LOGO_MARGIN_RATIO
                      Margin ratio between the logos (default: 0.1)
--column_margin_ratio COLUMN_MARGIN_RATIO
                      Margin ratio between the columns (default: 0.05)
--char_margin_ratio CHAR_MARGIN_RATIO
                      Margin ratio between the chars (default: 0.05)

These three parameters specify the proportional margins between different items.

margins

Other parameters are easy to understand according to their names. Most of them are helpful for users to plot custom sequence logos.

If you pass --analysis, MetaLogo will perform basic analysis on the data you input and output related figures in the output directory. Please check the MetaLogo paper or Web Server for details.

MetaLogo will save all the intermediate results, you can specify the path by --fa_output_dir. Files includes:

    server.d359d94e-8619-4ff0-8b03-62995a023877.dep.fa  #de-duplicated fasta
    server.d359d94e-8619-4ff0-8b03-62995a023877.fasttree.cluster # tree clustering result
    server.d359d94e-8619-4ff0-8b03-62995a023877.fasttree.rawid.tree #phylogenetic tree with raw sequence name
    server.d359d94e-8619-4ff0-8b03-62995a023877.fasttree.tree #phylogenetic tree with new sequence name
    server.d359d94e-8619-4ff0-8b03-62995a023877.grouping.fa #grouping details
    server.d359d94e-8619-4ff0-8b03-62995a023877.msa.fa #multiple sequence alignment results
    server.d359d94e-8619-4ff0-8b03-62995a023877.msa.rawid.fa # multiple sequence alignment results with raw sequence name
    server.d359d94e-8619-4ff0-8b03-62995a023877.treedists.csv # sequence distances in the phylogenetic tree

Import MetaLogo into your scripts

After install MetaLogo as a python package, you can import MetaLogo into your scripts or notebook easily. Below is a simple example.

    from MetaLogo import logo
    sequences = [['seq1','ATACAGATACACATCACAG'],['seq2','ATACAGAGATACCAACAGAC'],['seq3','ATACAGAGTTACCCACGGAC']]
    bin_args = {
        'clustalo_bin':'../MetaLogo/dependencies/clustalo',
        'fasttree_bin':'../MetaLogo/dependencies/FastTree',
        'fasttreemp_bin':'../MetaLogo/dependencies/FastTreeMP',
        }
    lg = logo.LogoGroup(sequences,height_algorithm='probabilities',group_strategy='length', **bin_args)
    lg.draw()
    lg.savefig('test.png')

LogoGroup receives nearly same parameters as standalone MetaLogo entry point we described above.

    LogoGroup(self,  seqs=None, ax=None, group_order='length', group_strategy='length', group_resolution=0.5,
         clustering_method = 'max',
         start_pos = (0,0), logo_type = 'Horizontal', init_radius=1, 
         logo_margin_ratio = 0.1, column_margin_ratio = 0.05, char_margin_ratio = 0.05,
         align = True, align_metric='sort_consistency', connect_threshold=0.8, 
         radiation_head_n = 5, threed_interval = 4, color = basic_dna_color, task_name='MetaLogo',
         x_label = 'Position', y_label = 'bits',z_label = 'bits', show_grid = True, show_group_id = True,
         display_range_left = 0, display_range_right = -1,
         hide_left_axis=False, hide_right_axis=False, hide_top_axis=False, hide_bottom_axis=False,
         hide_x_ticks=False, hide_y_ticks=False, hide_z_ticks=False, 
         title_size=20, label_size=10, tick_size=10, group_id_size=10,align_color='blue',align_alpha=0.1,
         figure_size_x=-1, figure_size_y=-1,gap_score=-1, padding_align=False, hide_version_tag=False,
         sequence_type = 'auto', height_algorithm = 'bits',omit_prob = 0,
         seq_file = '', fa_output_dir = '.', output_dir = '.', uid = '',
         withtree = False,group_limit=20, target_sequence = '',
         clustalo_bin = '', fasttree_bin = '', fasttreemp_bin = '', treecluster_bin = '',
         auto_size=True,
         *args, **kwargs):

For sequences, you need to pass a sequence array into LogoGroup as the first positional parameter. In this sequence array, each item is a tuple of sequence name and its dna or protein sequence. Or you can provide a sequence file with the --seq_file parameter. For color scheme, here you need to pass a python dict into LogoGroup rather than any name string or json formatted dict.

For the structure of MetaLogo, the following figure indicate the class inheritance and method execution order when drawing a MetaLogo.

structure

When you using MetaLogo in your project, you could get the ax object of matplotlib as follows:

    lg = logo.LogoGroup(sequences,height_algorithm='probabilities',**bin_args)
    lg.draw()
    ax = logo.ax

If you set withtree as True, another matplotlib ax object is also avaliable.

    lg = logo.LogoGroup(sequences,withtree=True,**bin_args)
    lg.draw()
    ax = logo.ax
    ax_tree = logo.ax0

You could also pass ax to LogoGroup init function when you create LogoGroup instance. Blow is a example.

    import matplotlib.pyplot as plt
    from MetaLogo import logo
    from MetaLogo.colors import basic_dna_color_scheme,basic_aa_color_scheme,basic_rna_color_scheme

    sequences = [
                    ['seq1','ATACAGATACACATCACAG'],
                    ['seq2','ATGCAGACACAGATCATAG'],
                    ['seq3','ATACAGAGATACCAACAGAC'],
                    ['seq4','ATACAGAGTTACCCACGGAC'],
                    ['seq5','TTGGAGCGATGCGCCCGGACATC'],
                    ['seq6','TTGGAGCAAAGGCCGCGAATATC'],
                    ['seq7','CTAGAGATGC'],
                    ['seq8','ATAAACAAAC'],
                ]

    ax1 = plt.subplot(221)
    ax2 = plt.subplot(222)
    ax3 = plt.subplot(223)
    ax4 = plt.subplot(224,projection='3d')

    paras = {
        'height_algrithm':'probabilities',
        'padding_align':True,
        'task_name':'',
        'x_label':'',
        'y_label':'',
        'z_label':'',
        'hide_x_ticks':True,
        'hide_y_ticks':True,
        'hide_z_ticks':True,
        'hide_version_tag':True
    }
    custom_color = {'A':'red','T':'blue','G':'red','C':'black'}
    lg_horizontal = logo.LogoGroup(sequences,logo_type='Horizontal',color=basic_aa_color_scheme, ax=ax1,**paras)
    lg_circle = logo.LogoGroup(sequences,logo_type='Circle',ax=ax2,color=basic_dna_color_scheme,**paras)
    lg_radiation = logo.LogoGroup(sequences,logo_type='Radiation',color=basic_rna_color_scheme, ax=ax3,**paras)
    lg_3d = logo.LogoGroup(sequences,logo_type='Threed',color=custom_color, ax=ax4,**paras)

    lg_horizontal.draw()
    lg_circle.draw()
    lg_radiation.draw()
    lg_3d.draw()

subplot

If you want some basic analysis on your data, you could call several functions of MetaLogo to do these stuff. Below are some examples from the entry.py.

    fig = logogroup.get_grp_counts_figure().figure
    count_name = f'{args.output_dir}/{base_name}.counts.png'
    fig.savefig(count_name,bbox_inches='tight')
    plt.close(fig)

    fig = logogroup.get_seq_lengths_dist().figure
    lengths_name = f'{args.output_dir}/{base_name}.lengths.png'
    fig.savefig(lengths_name,bbox_inches='tight')
    plt.close(fig)


    fig = logogroup.get_entropy_figure()
    entropy_name = f'{args.output_dir}/{base_name}.entropy.png'
    fig.savefig(entropy_name,bbox_inches='tight')
    plt.close(fig)

    boxplot_entropy_name = f'{args.output_dir}/{base_name}.boxplot_entropy.png'
    fig = logogroup.get_boxplot_entropy_figure().figure
    fig.savefig(boxplot_entropy_name,bbox_inches='tight')
    plt.close(fig)

    if args.padding_align or args.group_strategy=='auto':
        clustermap_name = f'{args.output_dir}/{base_name}.clustermap.png'
        fig = logogroup.get_correlation_figure()
        if fig:
            fig.savefig(clustermap_name,bbox_inches='tight')

Next: Web Server