Dataset created by: Cyber Cop
The dataset consists of network flow data from various Android applications, including both benign and malware-infected apps.
-
Data Preprocessing
: The raw dataset is preprocessed to handle missing values, drop irrelevant features, and convert categorical features into numerical format. -
Feature Selection
: To identify the most relevant features for the classification task, the Mann-Whitney U test is performed on numerical features, and the Chi-squared test is performed on categorical features. -
Model Selection
: RandomForestClassifier, XGBoostClassifier, and LGBMClassifier are chosen for their performance in handling imbalanced datasets and robustness against overfitting. -
Handling Imbalanced Data
: The Synthetic Minority Over-sampling Technique (SMOTE) is used to balance the dataset, providing better performance and speed compared to other techniques like Tomek links and NearMiss. -
Hyperparameter Tuning
: Optuna, a hyperparameter optimization framework, is used to find the best hyperparameters for the chosen models. -
Model Evaluation
: The models are evaluated using the F2 score, which focuses on recall while still maintaining a balance with precision. The precision-recall curve is also used to visualize the trade-off between precision and recall.
-
Further Feature Engineering
: Explore additional feature engineering techniques to potentially improve model performance. -
Deep Learning
: Investigate deep learning techniques, such as Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), for classifying network flows. -
Real-time Detection
: Implement the model in a real-time network traffic monitoring system to detect Android malware in live network environments. -
Model Interpretability
: Investigate model interpretability techniques like SHAP (SHapley Additive exPlanations) to better understand how the model makes its predictions and to build trust in the results.