Evaluation of several gene selection methods (including ensemble gene selection methods).
│ main.py
│
├─cache
│ │
│ ├─geneData # store selected genes
│ └─preprocessedData # store preprocessed datasets
│
├─common_utils
│ __init__.py
│ utils.py # common utils
│
├─config
│ __init__.py
│ datasets_config.py
│ experiments_config.py
│ methods_config.py
│
├─data_loader
│ __init__.py
│ dataset.py # load and preprocess datasets
│ utils.py # utils used in loading and preprocessing data
│
├─experiments
│ __init__.py
│ metrics.py # metrics used in batch correction, cell classification and cell clustering
│ recorders.py # record the evaluation results and sink them to disk
│ run_experiments.py # run each experiment by calling the corresponding function
│
├─figures # store the umap and t-sne figures
│
├─other_steps
│ __init__.py
│ classification.py # cell classification algorithms
│ clustering.py # cell clustering algorithms
│ correction.py # batch correction algorithms
│
├─records # store the evaluation results and recorders
└─selection
__init__.py
fisher_score.py
methods.py # all feature selection algorithms
nearest_centroid.py
utils.py # utils used in feature selection
Method | Language | Reference |
---|---|---|
Random Forest | Python | [1] |
XGBoost | Python | [2] |
LightGBM | Python | [3] |
Nearest Shrunken Centroid | Python | [4] |
scGeneFit | Python | [5] |
CellRanger | Python | [6] |
Fisher Score | Python | [7] |
Mutual Information | Python | [8] |
Method | Language | Reference |
---|---|---|
Variance | Python | [9] |
CV | Python | [10] |
Seurat | Python | [11] |
Deviance | R | [12] |
M3Drop | R | [13] |
scmap | R | [14] |
FEAST | R | [15] |
scran | R | [16] |
triku | Python | [17] |
sctransform | R | [18] |
GiniClust3 | Python | [19] |
pagest | Python |
The function that detects ouliers in Besca.
The normalization method in Seurat and the implementation in Scanpy.
Before the evaluation you should specify the paths to data (and marker genes if you want to run the marker discovery experiment) in config/datasets_config.py
:
class DatasetConfig:
def __init__(self):
self.data_path = "/path/to/datasets/"
self.marker_path = "/path/to/marker/genes/" # optional
Then you can run certain experiment with single line of code:
from experiments.run_experiments import run_cell_clustering, run_cell_classification
run_cell_clustering(fs_methods=['var', 'feast']) # single FS methods
run_cell_classification(fs_methods=['lgb+rf']) # ensemble FS method
All the records will be stored in the directory records/
. The recorders in .pkl
format are in records/pkl/
, and the tables are in records/xlsx/
.
Here we present an easy way to evaluate new feature selection methods on all datasets we used. if you just want to test on only a few datasets, please check the notebook for examples.
-
Add new methods to the function
single_select_by_batch()
inselection/methods.py
:elif method == 'deviance': selected_genes_df = deviance_compute_importance(adata) elif method == 'abbreviation_1': selected_genes_df = your_new_fucntion_1(adata) elif method == 'abbreviation_2': selected_genes_df = your_new_fucntion_2(adata) else: raise NotImplementedError(f"No implementation of {method}!")
- input of your new functions: an
AnnData
object, in which theadata.X
is the scaled data after log-normalization, theadata.raw
is the data after quality control but before normalization. The log-normalized data is inadata.layers['log-normalized']
, and the normalized data is inadata.layers['normalized']
. - output of your new functions: a dataframe. The first column with name
Gene
contains gene names. The second column is not necessary. It contains scores of each genes (if they exist). The higher the score is, the more important the gene.
- input of your new functions: an
-
Modify the method configuration
config/methods_config.py
:- in
self.formal_names
'feast': 'FEAST', 'abbreviation_1': 'formal_name_1', 'abbreviation_2': 'formal_name_2', 'rf+fisher_score': 'RF+\nFisher Score',
- unsupervised methods should be added in
self.unsupervised
, and supervised methods should be added inself.supervised
self.unsupervised = ['abbreviation_1', 'var', 'cv2', ...] self.supervised = ['abbreviation_2', 'rf', 'lgb', 'xgb', ...]
- in
-
Then you can run the function as shown in examples!
from experiments.run_experiments import run_cell_clustering run_cell_clustering(fs_methods=['abbreviation_1', 'abbreviation_2'])