-
Notifications
You must be signed in to change notification settings - Fork 120
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #595 from DarshAgrawal14/main
Added model to analyze the physicochemical attributes of wine
- Loading branch information
Showing
10 changed files
with
10,580 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
# Wine Quality Prediction | ||
|
||
This project implements various machine learning models to predict wine quality based on physicochemical properties. The models classify wines into binary categories (high quality vs. low quality) using features like acidity, pH, alcohol content, and more. | ||
|
||
## Project Overview | ||
|
||
The project uses a dataset containing various chemical properties of wines and their quality ratings. The quality ratings are binarized into two categories: | ||
|
||
- 0: Lower quality (rating ≤ 5) | ||
- 1: Higher quality (rating > 5) | ||
|
||
## Features | ||
|
||
The following features are used for prediction: | ||
|
||
- Fixed acidity | ||
- Volatile acidity | ||
- Citric acid | ||
- Residual sugar | ||
- Chlorides | ||
- Free sulfur dioxide | ||
- Total sulfur dioxide | ||
- Density | ||
- pH | ||
- Sulphates | ||
- Alcohol | ||
- Wine type (red/white) | ||
|
||
## Models Implemented | ||
|
||
The project implements and compares six different machine learning models: | ||
|
||
1. Logistic Regression | ||
2. Decision Tree Classifier | ||
3. Random Forest Classifier | ||
4. K-Nearest Neighbors (KNN) Classifier | ||
5. Support Vector Classifier (SVC) | ||
6. Gradient Boosting Classifier | ||
|
||
## Technical Implementation | ||
|
||
### Data Preprocessing | ||
|
||
- Handling missing values | ||
- Feature scaling using MinMaxScaler | ||
- Dimensionality reduction using PCA | ||
- One-hot encoding for categorical variables | ||
|
||
### Model Pipeline | ||
|
||
Each model uses a consistent pipeline that includes: | ||
|
||
1. Feature scaling | ||
2. PCA transformation | ||
3. Model training and prediction | ||
|
||
## Project Structure | ||
|
||
``` | ||
project/ | ||
│ | ||
├── winequalityN.csv # Input dataset | ||
├── trained_models/ # Directory containing saved models | ||
│ ├── logistic_regression_model.pkl | ||
│ ├── decision_tree_model.pkl | ||
│ ├── random_forest_model.pkl | ||
│ ├── knn_model.pkl | ||
│ ├── svc_model.pkl | ||
│ └── gradient_boosting_model.pkl | ||
└── wine_quality_prediction.py # Main script | ||
``` | ||
|
||
## Usage | ||
|
||
### Loading a Saved Model | ||
|
||
```python | ||
def load_model(model_name): | ||
""" | ||
Load a saved model from the trained_models directory | ||
Parameters: | ||
model_name (str): Name of the model to load (without '_model.pkl') | ||
Returns: | ||
object: The loaded model pipeline | ||
""" | ||
filename = f'trained_models/{model_name}_model.pkl' | ||
with open(filename, 'rb') as file: | ||
model = pickle.load(file) | ||
return model | ||
``` | ||
|
||
### Making Predictions | ||
|
||
```python | ||
# Example usage | ||
model = load_model('random_forest') | ||
predictions = model.predict(X_test) | ||
``` | ||
|
||
## Dependencies | ||
|
||
- pandas | ||
- numpy | ||
- scikit-learn | ||
- matplotlib | ||
- seaborn | ||
|
||
## Model Performance | ||
|
||
The project includes various evaluation metrics for each model: | ||
|
||
- Accuracy scores | ||
- Classification reports (precision, recall, F1-score) | ||
- Visual comparisons of model performance | ||
|
||
## Visualization | ||
|
||
The project includes several visualization components: | ||
|
||
- Feature distribution plots | ||
- Correlation heatmaps | ||
- Bivariate analysis plots | ||
- Model performance comparison plots | ||
|
||
## Future Improvements | ||
|
||
Potential areas for enhancement: | ||
|
||
1. Hyperparameter tuning for each model | ||
2. Feature selection optimization | ||
3. Ensemble method exploration | ||
4. Cross-validation implementation | ||
5. Addition of more advanced models | ||
6. API development for model deployment |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
import os | ||
import pickle | ||
import pandas as pd | ||
import numpy as np | ||
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix | ||
import seaborn as sns | ||
import matplotlib.pyplot as plt | ||
|
||
def load_model(model_name): | ||
""" | ||
Load a saved model from the trained_models directory | ||
""" | ||
filename = f'trained_models/{model_name}_model.pkl' | ||
try: | ||
with open(filename, 'rb') as file: | ||
model = pickle.load(file) | ||
print(f"Successfully loaded {model_name} model") | ||
return model | ||
except FileNotFoundError: | ||
print(f"Error: Model file {filename} not found") | ||
return None | ||
except Exception as e: | ||
print(f"Error loading model: {str(e)}") | ||
return None | ||
|
||
def prepare_data(data_path): | ||
""" | ||
Prepare the wine quality data for testing, ensuring feature names match training data | ||
""" | ||
try: | ||
# Read the data | ||
df = pd.read_csv(data_path) | ||
|
||
# Convert type to dummy variables and keep all columns except 'quality' | ||
df = pd.concat([df, pd.get_dummies(df['type'])], axis=1) | ||
df = df.drop('type', axis=1) | ||
|
||
# Remove null values | ||
df = df.dropna() | ||
|
||
# Create binary quality target | ||
df['quality_binary'] = df['quality'].apply(lambda x: 1 if x > 5 else 0) | ||
|
||
# Select features in the same order as during training | ||
feature_columns = [ | ||
"fixed acidity", "volatile acidity", "citric acid", "residual sugar", | ||
"chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", | ||
"pH", "sulphates", "alcohol", "red", "white" | ||
] | ||
|
||
# Prepare features and target | ||
X = df[feature_columns] # Only select the features used in training | ||
y = df['quality_binary'] | ||
|
||
print("Data preparation successful") | ||
print(f"Features included: {', '.join(X.columns)}") | ||
return X, y | ||
|
||
except Exception as e: | ||
print(f"Error preparing data: {str(e)}") | ||
return None, None | ||
|
||
def evaluate_model(model, X, y, model_name): | ||
""" | ||
Evaluate a model's performance | ||
""" | ||
try: | ||
# Make predictions | ||
y_pred = model.predict(X) | ||
|
||
# Calculate accuracy | ||
accuracy = accuracy_score(y, y_pred) | ||
|
||
# Print results | ||
print(f"\nResults for {model_name}:") | ||
print(f"Accuracy: {accuracy:.4f}") | ||
print("\nClassification Report:") | ||
print(classification_report(y, y_pred)) | ||
|
||
# Create confusion matrix plot | ||
plt.figure(figsize=(8, 6)) | ||
cm = confusion_matrix(y, y_pred) | ||
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues') | ||
plt.title(f'Confusion Matrix - {model_name}') | ||
plt.ylabel('True Label') | ||
plt.xlabel('Predicted Label') | ||
plt.show() | ||
|
||
return accuracy, y_pred | ||
|
||
except Exception as e: | ||
print(f"Error evaluating model: {str(e)}") | ||
return None, None | ||
|
||
def test_all_models(data_path): | ||
""" | ||
Test all saved models and compare their performance | ||
""" | ||
# List of model names | ||
model_names = [ | ||
'logistic_regression', | ||
'decision_tree', | ||
'random_forest', | ||
'knn', | ||
'svc', | ||
'gradient_boosting' | ||
] | ||
|
||
# Prepare data | ||
X, y = prepare_data(data_path) | ||
if X is None or y is None: | ||
return | ||
|
||
# Store results | ||
results = [] | ||
|
||
# Test each model | ||
for model_name in model_names: | ||
model = load_model(model_name) | ||
if model is not None: | ||
accuracy, _ = evaluate_model(model, X, y, model_name) | ||
if accuracy is not None: | ||
results.append({'Model': model_name, 'Accuracy': accuracy * 100}) | ||
|
||
# Create comparison plot | ||
if results: | ||
results_df = pd.DataFrame(results) | ||
plt.figure(figsize=(10, 6)) | ||
sns.barplot(x='Model', y='Accuracy', data=results_df) | ||
plt.xticks(rotation=45) | ||
plt.title('Model Comparison on Test Data') | ||
plt.tight_layout() | ||
plt.show() | ||
|
||
def main(): | ||
""" | ||
Main function to run the test script | ||
""" | ||
print("Starting model testing...") | ||
|
||
# Specify the path to your test data | ||
data_path = 'winequalityN.csv' # Update this path as needed | ||
|
||
# Test all models | ||
test_all_models(data_path) | ||
|
||
print("\nTesting completed!") | ||
|
||
if __name__ == "__main__": | ||
main() |
Binary file not shown.
Binary file added
BIN
+140 KB
Prediction Models/wine_type/trained_models/gradient_boosting_model.pkl
Binary file not shown.
Binary file not shown.
Binary file added
BIN
+3.17 KB
Prediction Models/wine_type/trained_models/logistic_regression_model.pkl
Binary file not shown.
Binary file not shown.
Binary file not shown.
Oops, something went wrong.