-
Notifications
You must be signed in to change notification settings - Fork 3
Running hominid on your data
hominid
performs Lasso regression to identify correlations between
host genetic variation (SNPs) and microbiome taxonomic abundances.
hominid
implements
Lasso regression using the scikit-learn library.
The Lasso regularization parameter is tuned by the LARS algorithm
using default parameters and five-fold cross-validation. We use the
coefficient of determination, R2, from the Lasso model as
a measure of correlation.
hominid
is a MPI program that can run on a single core or scale to multiple
cores for better performance. An MPI implementation (for example OpenMPI
or MPICH) is required to run hominid
even on a single core. hominid
will
use as many processes as are specified on the mpirun command line or PBS
batch script, but performance will only scale so long as one core is available
for each process. See the examples for
a sample command-line
suitable for one computer and a
PBS script template suitable for most clusters.
With an input taxon table of 55 samples and 141 taxa/covariates,
on a Core i5 4690k chip, it took 8.8 cpu seconds
to run hominid on one SNP on average
(not counting the SNPs that were
not analyzed by the hominid
program because they didn't meet
the filtering thresholds (see the "Preprocessing input data" section, below).
- Without permutation,
hominid
performs Lasso regression on your data. - With permutation,
hominid
permutes the sample labels. Run with permutation when you want to do permutation testing to calculate q-values.- "uniform_permutation" permutes all sample labels
- "group_permutation" permutes sample labels from male subjects separately
from female subjects (not mixing male and female subjects).
If this option is selected, there must be a column 'Sex' included in the taxon table with different numeric data values (e.g., 0 and 1) for male and female subjects. Otherwise the program will abort.
Note that different runs of hominid
will produce slightly different
output because of the reshufflings and permutations.
hominid
does not preprocess nor filter the input OTU/taxon table.
hominid
does filter the input SNPs and omits SNPs having:
- a total sample count less than 50
- total sample count per genotype less than or equal to 5
- minor allele frequency is less than the value set by --maf-lower-cutoff option [default 0.2]
To see sample input files in the correct formats, see example SNP input and example taxon table input.
-
OTU/taxon table format:
- columns = samples
- rows = OTUs/taxa/covariates
- data values = relative abundances
-
SNP modified VCF file format: Similar to a vcf format except:
- Delete all the meta-information lines, the ones that start with "##".
- Keep the one header line, the line that starts with "CHROM POS ID REF ALT", but delete the leading "#".
- The entries in the "CHROM" column begin with "chr", e.g., "chr7".
- Add an extra "GENE" column after the "ALT" column. If your SNPs are annotated with genes, this would be the column in which to put the gene annotation. If not, use dummy gene IDs in the GENE column (e.g., dummy1, dummy2, dummy3, ...).
- The SNP alleles are encoded as 0, 1, 2, or NA:
- 0/0 --> 0
- 0/1 --> 1
- 1/1 --> 2
- ./. --> NA
- 0/. --> NA
- Columns from the SNP input file that
hominid
does not use are retained in the output file. Adding SNP annotation columns to the input modified vcf file is a good way to carry along the SNP annotation throughout the entire analysis, rather than annotating the SNPs at the end of the analysis.
-
Sample IDs need to match between the OTU/taxon table and the SNP modified vcf file.
The command-line arguments are all required and are expected in this order:
-
input file path
path to OTU or taxon table file
-
path to SNP data file
in modified vcf format (see "Input file formats" section, above, for details)
-
path to output file
this file will be overwritten if it exists
-
transformation of the input abundance data
Supported values are:
- no_transform
- arcsinsqrt
- normalize
In our work, we used "arcsinsqrt" for arcsin sqrt transformation on our relative abundance values.
-
number of SNPs to test
To analyze all input SNPs, set this value to -1.
-
number train/test cycles per SNP
This is the number of times to shuffle the sample into test (20% of samples) and training (80%) sets. We used "100". The average and median R2 values and associated bootstrap confidence intervals reported in the output file are calculated from these train/test splits.
-
permutation method
whether or how to perform a permutation on the sample IDs. (See above section "hominid can be run either with or without permutation" for more details.) Supported values are:
- no_permutation
- uniform_permutation
- group_permutation
For a sample hominid
command, see the test script.
hominid
takes the input modified vcf file and merely adds its
results as extra columns on the left of the output table (all the
columns to the left of "CHROM").
Columns in the input modified vcf file that are not used by
hominid
carry along to the output. Adding SNP annotation
columns to the input modified vcf file is a good way
to carry along the SNP annotation throughout the entire analysis, rather
than annotating the SNPs at the end.
The output columns of interest include:
- "rsq_median": R2 (median value from 100 shufflings
of the samples).
- Note that R2 can be negative because the training data was used to fit the model and the R2 value is calculated on the test data. (You can't get a negative R2 if you use the same data to both fit the model and calculate R2.)
- "rsq_pibsp_median_95ci_lo" and "rsq_pibsp_median_95ci_hi": 95th percentile confidence interval for R2.
If output columns 7 to 21 (starting with column "rsq_mean" and ending with
"cv_kurtosis") are all "NA" for a given SNP,
then that SNP didn't meet the SNP preprocessing
thresholds and hominid
did not process it.
These SNPs will need to be filtered out before running the
next step in the pipeline, stability_selection.py.
There will be many lines (including some warnings) printed
to stderr, showing the progress of hominid
.
You can include other features as additional covariates by adding the covariate as another row to the OTU/taxon table.
For binary categorical covariates like sex, set the data values to 0 and 1.
If you include a continuous variable (e.g., age, BMI), be aware that the data values will be transformed in the same way that the taxon abundances are (e.g., by arcsin sqrt transformation).