This repository contains scripts for running Independent Component Analysis (ICA) from a set of dimensions and get optimal dimensionality (optICA). It is updated from SBRG/modulome-workflow/4_optICA, incorporating the latest enhancements and fixes.
Update the adjust_csv.py to check the size of csv files first, preventing error from pd.read_csv
Update the adjust_csv.py to adjust_csv_MPI.py for paralleling computing
Fixed numpy sparse matrix and the order of creating folders to get compatible with large amount of cores
The scripts have been adapted for compatibility with Python 3.12 and scikit-learn 1.4.1
A timeout mechanism has been implemented. This feature is designed to stop any processor that exceeds a predefined time limit, preventing potential stuck on certain processor.
Timeout Setting: 1 hour by default - The timeout limit can be adjusted as needed in the script settings by -time
The format of the LOGFILE name now includes both the date and time.
Format: LOGFILE_yyyy-mm-dd_hh-mm-ss.log
The script now recognizes and utilizes the number of threads available in the system instead of being limited to the number of physical cores.
The compute_distance.py
script has been updated to remain functional even when the timeout occurs. This ensures that the script handles timeout events gracefully without crashing or causing data loss.
Usage of the py37 version and py312 version are the same
Usage: run_ica.sh [ARGS] FILE
Arguments
-i|--iter <n_iter> Number of random restarts (default: 100)
-t|--tolerance <tol> Tolerance (default: 1e-7)
-n|--n-cores <n_cores> Number of cores to use (default: 8)
-max|--max-dim <max_dim> Maximum dimensionality for search (default: n_samples)
-min|--min-dim <min_dim> Minimum dimensionality for search (default: 20)
-s|--step-size <step_size> Dimensionality step size (default: n_samples/25)
-o|--outdir <path> Output directory for files (default: current directory)
-l|--logfile Name of log file to use if verbose is off (default: ica.log)
-v|--verbose Send output to stdout rather than writing to file
-h|--help Display help information
-time|--time-out Timeout for each ICA run in seconds (default: 3600)
./run_ica.sh -n 16 -min 100 -max 300 -i 96 -v -time 3600 -o ../_aeruPHAGE_p_aeru ../log_tpm_p_aeru.csv
Please install the conda environment using the yml file
Change the 'prefix' before install it
conda env create -f optICA_updated_py37.yml
conda env create -f optICA_updated_py312.yml
OptICA may take dozens of hours to run using default arguments, depending on the size of your dataset. This can be accelerated by
- using more processors (i.e. a supercomputer),
- loosening the tolerance (e.g.
-t 1e-3
), or - increasing the dimensionality step size (e.g.
--step-size 20
).
Also, if your dataset has over 500 datasets, we recommend limiting the maximum dimensionality to the number of unique conditions in your dataset.
The run_ica.sh
script produces three files and a subdirectory:
M.csv
: The M matrixA.csv
: The A matrixdimension_analysis.pdf
: Plot showing the optimal ICA dimensionalityica_runs/
: A subdirectory containing all the M and A matrices for all dimensions