Skip to content

Running hominid on your data

Blekhman Lab edited this page Sep 27, 2016 · 9 revisions

Running hominid on your data

What does hominid do?

hominid performs Lasso regression to identify correlations between host genetic variation (SNPs) and microbiome taxonomic abundances.

hominid implements Lasso regression using the scikit-learn library. The Lasso regularization parameter is tuned by the LARS algorithm using default parameters and five-fold cross-validation. We use the coefficient of determination, R2, from the Lasso model as a measure of correlation.

hominid is a MPI program that can run on a single core or scale to multiple cores for better performance. An MPI implementation (for example OpenMPI or MPICH) is required to run hominid even on a single core. hominid will use as many processes as are specified on the mpirun command line or PBS batch script, but performance will only scale so long as one core is available for each process. See the examples for a sample command-line suitable for one computer and a PBS script template suitable for most clusters.

With an input taxon table of 55 samples and 141 taxa/covariates, on a Core i5 4690k chip, it took 8.8 cpu seconds to run hominid on one SNP on average (not counting the SNPs that were not analyzed by the hominid program because they didn't meet the filtering thresholds (see the "Preprocessing input data" section, below).

hominid can be run either with or without permutation

  • Without permutation, hominid performs Lasso regression on your data.
  • With permutation, hominid permutes the sample labels. Run with permutation when you want to do permutation testing to calculate p-values.
    • "uniform_permutation" permutes all sample labels
    • "group_permutation" permutes sample labels from male subjects separately from female subjects (not mixing male and female subjects).
      If this option is selected, there must be a column 'Sex' included in the taxon table with different numeric data values (e.g., 0 and 1) for male and female subjects. Otherwise the program will abort.

Note that different runs of hominid will produce slightly different output because of the reshufflings and permutations.

Preprocessing input data

hominid does not preprocess nor filter the input OTU/taxon table.

hominid does filter the input SNPs and omits SNPs having:

  • a total sample count less than 50
  • total sample count per genotype less than or equal to 5
  • minor allele frequency is less than 0.2

Input file formats

To see sample input files in the correct formats, see example SNP input and example taxon table input.

  • OTU/taxon table format:

    • columns = samples
    • rows = OTUs/taxa/covariates
    • data values = relative abundances
  • SNP modified VCF file format: Similar to a vcf format except:

    • Delete all the meta-information lines, the ones that start with "##".
    • Keep the one header line, the line that starts with "CHROM POS ID REF ALT", but delete the leading "#".
    • The entries in the "CHROM" column begin with "chr", e.g., "chr7".
    • Add an extra "GENE" column after the "ALT" column. If your SNPs are annotated with genes, this would be the column in which to put the gene annotation. If not, use dummy gene IDs in the GENE column (e.g., dummy1, dummy2, dummy3, ...).
    • The SNP alleles are encoded as 0, 1, 2, or NA:
      • 0/0 --> 0
      • 0/1 --> 1
      • 1/1 --> 2
      • ./. --> NA
      • 0/. --> NA
    • Columns from the SNP input file that hominid does not use are retained in the output file. Adding SNP annotation columns to the input modified vcf file is a good way to carry along the SNP annotation throughout the entire analysis, rather than annotating the SNPs at the end of the analysis.
  • Sample IDs need to match between the OTU/taxon table and the SNP modified vcf file.

Command-line arguments

The command-line arguments are all required and are expected in this order:

  • input file path

    path to OTU or taxon table file

  • path to SNP data file

    in modified vcf format (see "Input file formats" section, above, for details)

  • path to output file

    this file will be overwritten if it exists

  • transformation of the input abundance data

    Supported values are:

    • no_transform
    • arcsinsqrt
    • normalize

    In our work, we used "arcsinsqrt" for arcsin sqrt transformation on our relative abundance values.

  • number of SNPs to test

    To analyze all input SNPs, set this value to -1.

  • number train/test cycles per SNP

    This is the number of times to shuffle the sample into test (20% of samples) and training (80%) sets. We used "100". The average and median R2 values and associated bootstrap confidence intervals reported in the output file are calculated from these train/test splits.

  • permutation method

    whether or how to perform a permutation on the sample IDs. (See above section "hominid can be run either with or without permutation" for more details.) Supported values are:

    • no_permutation
    • uniform_permutation
    • group_permutation

For a sample hominid command, see the test script.

Output file format

hominid takes the input modified vcf file and merely adds its results as extra columns on the left of the output table (all the columns to the left of "CHROM"). Columns in the input modified vcf file that are not used by hominid carry along to the output. Adding SNP annotation columns to the input modified vcf file is a good way to carry along the SNP annotation throughout the entire analysis, rather than annotating the SNPs at the end.

The output columns of interest include:

  • "rsq_median": R2 (median value from 100 shufflings of the samples).
    • Note that R2 can be negative because the training data was used to fit the model and the R2 value is calculated on the test data.  (You can't get a negative R2 if you use the same data to both fit the model and calculate R2.)
  • "rsq_pibsp_median_95ci_lo" and "rsq_pibsp_median_95ci_hi": 95th percentile confidence interval for R2.

If output columns 7 to 21 (starting with column "rsq_mean" and ending with "cv_kurtosis") are all "NA" for a given SNP, then that SNP didn't meet the SNP preprocessing thresholds and hominid did not process it. These SNPs will need to be filtered out before running the next step in the pipeline, stability_selection.py.

There will be many lines (including some warnings) printed to stderr, showing the progress of hominid.

Including other covariates in the regression

You can include other features as additional covariates by adding the covariate as another row to the OTU/taxon table.

For binary categorical covariates like sex, set the data values to 0 and 1.

If you include a continuous variable (e.g., age, BMI), be aware that the data values will be transformed in the same way that the taxon abundances are (e.g., by arcsin sqrt transformation).

Clone this wiki locally