Skip to content

Duck-m-a-n/Malware_Detection_ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

👾 Android Malware Detection Using Machine Learning 🕵️‍♂️

This project aims to develop a machine learning model for the detection of Android malware using network traffic data. The model classifies network flows as either malware or benign based on various flow features.

Dataset

Kaggle: Android Malware Detection: Detection of Android Malware using Machine Learning

Dataset created by: Cyber Cop

The dataset consists of network flow data from various Android applications, including both benign and malware-infected apps.

Methodology

  1. Data Preprocessing: The raw dataset is preprocessed to handle missing values, drop irrelevant features, and convert categorical features into numerical format.

  2. Feature Selection: To identify the most relevant features for the classification task, the Mann-Whitney U test is performed on numerical features, and the Chi-squared test is performed on categorical features.

  3. Model Selection: RandomForestClassifier, XGBoostClassifier, and LGBMClassifier are chosen for their performance in handling imbalanced datasets and robustness against overfitting.

  4. Handling Imbalanced Data: The Synthetic Minority Over-sampling Technique (SMOTE) is used to balance the dataset, providing better performance and speed compared to other techniques like Tomek links and NearMiss.

  5. Hyperparameter Tuning: Optuna, a hyperparameter optimization framework, is used to find the best hyperparameters for the chosen models.

  6. Model Evaluation: The models are evaluated using the F2 score, which focuses on recall while still maintaining a balance with precision. The precision-recall curve is also used to visualize the trade-off between precision and recall.

Results

The optimized model, a RandomForestClassifier, was able to effectively classify network flows as malware or benign. The most important features for the classification were FLOW IAT min and FLOW IAT max, which represent the minimum and maximum inter-arrival times between data packets in a network flow, respectively.

Next Steps and Improvements

  1. Further Feature Engineering: Explore additional feature engineering techniques to potentially improve model performance.

  2. Deep Learning: Investigate deep learning techniques, such as Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), for classifying network flows.

  3. Real-time Detection: Implement the model in a real-time network traffic monitoring system to detect Android malware in live network environments.

  4. Model Interpretability: Investigate model interpretability techniques like SHAP (SHapley Additive exPlanations) to better understand how the model makes its predictions and to build trust in the results.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published