Skip to content

Gaoyuan-Li/optICA_updated

Repository files navigation

optICA_updated

python scikit-learn MPI4py

jupyter

This repository contains scripts for running Independent Component Analysis (ICA) from a set of dimensions and get optimal dimensionality (optICA). It is updated from SBRG/modulome-workflow/4_optICA, incorporating the latest enhancements and fixes.

Update - 2/28/2024

General Updates

Update the adjust_csv.py to check the size of csv files first, preventing error from pd.read_csv

Update - 2/27/2024

General Updates

Update the adjust_csv.py to adjust_csv_MPI.py for paralleling computing

Update - 2/26/2024

General Updates

Fixed numpy sparse matrix and the order of creating folders to get compatible with large amount of cores

Update - 2/21/2024

General Adaptations

The scripts have been adapted for compatibility with Python 3.12 and scikit-learn 1.4.1

Timeout Mechanism

A timeout mechanism has been implemented. This feature is designed to stop any processor that exceeds a predefined time limit, preventing potential stuck on certain processor.

Timeout Setting: 1 hour by default - The timeout limit can be adjusted as needed in the script settings by -time

LOGFILE Name Update

The format of the LOGFILE name now includes both the date and time.

Format: LOGFILE_yyyy-mm-dd_hh-mm-ss.log

Processor Cores Utilization

The script now recognizes and utilizes the number of threads available in the system instead of being limited to the number of physical cores.

compute_distance.py

The compute_distance.py script has been updated to remain functional even when the timeout occurs. This ensures that the script handles timeout events gracefully without crashing or causing data loss.

Usage

Usage of the py37 version and py312 version are the same

Usage: run_ica.sh [ARGS] FILE

Arguments
  -i|--iter <n_iter>           Number of random restarts (default: 100)
  -t|--tolerance <tol>         Tolerance (default: 1e-7)
  -n|--n-cores <n_cores>       Number of cores to use (default: 8)
  -max|--max-dim <max_dim>     Maximum dimensionality for search (default: n_samples)
  -min|--min-dim <min_dim>     Minimum dimensionality for search (default: 20)
  -s|--step-size <step_size>   Dimensionality step size (default: n_samples/25)
  -o|--outdir <path>           Output directory for files (default: current directory)
  -l|--logfile                 Name of log file to use if verbose is off (default: ica.log)
  -v|--verbose                 Send output to stdout rather than writing to file
  -h|--help                    Display help information
  -time|--time-out             Timeout for each ICA run in seconds (default: 3600)

Example Usage

./run_ica.sh -n 16 -min 100 -max 300 -i 96 -v -time 3600 -o ../_aeruPHAGE_p_aeru ../log_tpm_p_aeru.csv

Conda environment

Please install the conda environment using the yml file

Change the 'prefix' before install it

conda env create -f optICA_updated_py37.yml
conda env create -f optICA_updated_py312.yml

Notes

OptICA may take dozens of hours to run using default arguments, depending on the size of your dataset. This can be accelerated by

  1. using more processors (i.e. a supercomputer),
  2. loosening the tolerance (e.g. -t 1e-3), or
  3. increasing the dimensionality step size (e.g. --step-size 20).

Also, if your dataset has over 500 datasets, we recommend limiting the maximum dimensionality to the number of unique conditions in your dataset.

The run_ica.sh script produces three files and a subdirectory:

  • M.csv: The M matrix
  • A.csv: The A matrix
  • dimension_analysis.pdf: Plot showing the optimal ICA dimensionality
  • ica_runs/: A subdirectory containing all the M and A matrices for all dimensions

About

Update the optICA code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published