scMalignantFinder is a Python package designed for analyzing cancer single-cell RNA-seq datasets to distinguish malignant cells from their normal counterparts. Trained on over 400,000 high-quality single-cell transcriptomes, scMalignantFinder uses curated pan-cancer gene signatures for calibration and selects features by taking the union of differentially expressed genes across each dataset. For more details, please refer to the corresponding publication.
Enhanced Flexibility for Test Input
- Test data can now be provided as a path to an .h5ad file or directly as an AnnData object.
Dynamic Feature Handling
- During pretrained model usage, any missing features in the test dataset are temporarily filled with zeros during prediction to ensure compatibility.
New features
- Introduced malignancy probability output
We recommend using a conda environment to install scMalignantFinder.
- Create and activate a conda environment
conda create -n scmalignant python=3.10.10
conda activate scmalignant
- Install
scMalignantFinder
from PyPI:
pip install scMalignantFinder
Optional: scMalignantFinder includes a built-in pan-cancer cell type annotation tool, scATOMIC. If you want to perform basic cell type annotation before identifying malignant cells, follow the scATOMIC official tutorial to complete its installation in the same conda environment.
A pretrained model and a list of ordered features are provided in the model directory. Users can also download or use the training data for training the model.
- Training data: Download the training data used in the original study from here, or use your own dataset to train the model.
- Feature file: The feature list file can be collected from here.
- Example test data:
### Load package
from scMalignantFinder import classifier
# Initialize model
model = classifier.scMalignantFinder(
test_input="path/to/test_data.h5ad", # Path to test data or AnnData object
pretrain_path=None, # Path to pretrained model
train_h5ad_path="/path/to/train_data.h5ad",# Path to training data
feature_path="/path/to/features.txt", # Path to feature list
model_method="LogisticRegression", # ML method: LogisticRegression, RandomForest, XGBoost
norm_type=True, # Normalize test data (default: True)
n_thread=1 # Number of threads for parallel processing
)
# Load data
model.load()
# Predict malignancy
result_adata = model.predict()
# View results
print(result_adata.obs["scMalignantFinder_prediction"].head())
## Example output for scMalignantFinder_prediction:
Index
KUL01-T_AAACCTGGTCTTTCAT Malignant
KUL01-T_AAACGGGTCGGTTAAC Malignant
KUL01-T_AAAGATGGTATAGGGC Normal
KUL01-T_AAAGATGGTGGCCCTA Malignant
KUL01-T_AAAGCAAGTAAACACA Malignant
Name: scMalignantFinder_prediction, dtype: category
Categories (2, object): ['Normal', 'Malignant']
print(result_adata.obs["malignancy_probability"].head())
## Example output for malignancy_probability:
Index
KUL01-T_AAACCTGGTCTTTCAT 0.98578
KUL01-T_AAACGGGTCGGTTAAC 0.78968
KUL01-T_AAAGATGGTATAGGGC 0.243564
KUL01-T_AAAGATGGTGGCCCTA 0.8796
KUL01-T_AAAGCAAGTAAACACA 0.6598
Name: malignancy_probability, dtype: float64
If you use scMalignantFinder in your research, please cite the corresponding publication.