This README file provides an overview of a machine learning project that analyzes the CICIDS2017 network traffic dataset for identifying malicious traffic.
Libraries Used:
- pandas (data processing, CSV file I/O)
- seaborn (data visualization)
- matplotlib.pyplot (data visualization)
- sklearn (machine learning algorithms)
- numpy (numerical operations)
Data Preprocessing:
- Import Libraries: Import necessary libraries like pandas, seaborn, matplotlib, etc.
- Read Data: Read the CICIDS2017 dataset CSV files for different days (Monday, Tuesday, etc.) using pandas.read_csv().
- Clean Data:
- Handle missing values (e.g., drop rows, fill with mean/median).
- Remove irrelevant columns.
- Encode categorical features (e.g., label encoding).
- Reduce memory usage of dataframes (e.g., using pandas.DataFrame.dtypes).
- Identify and handle meaningless features with only one unique value.
- Dimensionality Reduction: (Optional) Apply techniques like PCA or TSNE to visualize data in lower dimensions for easier analysis.
- Feature Selection: Analyze feature importance and select relevant features for model training.
Exploratory Data Analysis (EDA):
- Data Distribution: Analyze the distribution of features for different traffic types (benign, DoS, etc.) using bar plots, histograms, etc.
- Correlation Analysis: Identify correlations between features and the target variable (traffic type) using correlation coefficients (heatmap).
- Class Imbalance: Check for class imbalance in the target variable (unequal distribution of traffic types).
- If imbalanced, apply oversampling techniques (e.g., SMOTE) to balance the data.
Machine Learning Model Training:
- Split Data: Split the preprocessed data into training and testing sets using
sklearn.model_selection.train_test_split
. - Model Selection: Choose a suitable machine learning model for classification (e.g., Random Forest Classifier).
- Consider using
cuml
library for GPU acceleration if available.
- Consider using
- Model Training: Train the model on the training data.
- Model Evaluation: Evaluate the model's performance on the testing data using metrics like accuracy, confusion matrix, classification report.
Results and Discussion:
- Present Results: Report the model's accuracy, confusion matrix, and classification report.
- Discuss Findings: Analyze the results, identify strengths and weaknesses of the model, discuss the impact of feature selection, etc.
- Future Work: Outline potential improvements and future work directions for the project.
Code Structure:
- The code is likely organized with functions for data preprocessing, feature engineering, model training, and evaluation.
- Comments are included to explain code sections.