Skip to content

Commit

Permalink
Merge pull request #595 from DarshAgrawal14/main
Browse files Browse the repository at this point in the history
Added model to analyze the physicochemical attributes of wine
  • Loading branch information
UppuluriKalyani authored Oct 26, 2024
2 parents b9b7c0f + d3e414b commit 644ddd8
Show file tree
Hide file tree
Showing 10 changed files with 10,580 additions and 0 deletions.
136 changes: 136 additions & 0 deletions Prediction Models/wine_type/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Wine Quality Prediction

This project implements various machine learning models to predict wine quality based on physicochemical properties. The models classify wines into binary categories (high quality vs. low quality) using features like acidity, pH, alcohol content, and more.

## Project Overview

The project uses a dataset containing various chemical properties of wines and their quality ratings. The quality ratings are binarized into two categories:

- 0: Lower quality (rating ≤ 5)
- 1: Higher quality (rating > 5)

## Features

The following features are used for prediction:

- Fixed acidity
- Volatile acidity
- Citric acid
- Residual sugar
- Chlorides
- Free sulfur dioxide
- Total sulfur dioxide
- Density
- pH
- Sulphates
- Alcohol
- Wine type (red/white)

## Models Implemented

The project implements and compares six different machine learning models:

1. Logistic Regression
2. Decision Tree Classifier
3. Random Forest Classifier
4. K-Nearest Neighbors (KNN) Classifier
5. Support Vector Classifier (SVC)
6. Gradient Boosting Classifier

## Technical Implementation

### Data Preprocessing

- Handling missing values
- Feature scaling using MinMaxScaler
- Dimensionality reduction using PCA
- One-hot encoding for categorical variables

### Model Pipeline

Each model uses a consistent pipeline that includes:

1. Feature scaling
2. PCA transformation
3. Model training and prediction

## Project Structure

```
project/
├── winequalityN.csv # Input dataset
├── trained_models/ # Directory containing saved models
│ ├── logistic_regression_model.pkl
│ ├── decision_tree_model.pkl
│ ├── random_forest_model.pkl
│ ├── knn_model.pkl
│ ├── svc_model.pkl
│ └── gradient_boosting_model.pkl
└── wine_quality_prediction.py # Main script
```

## Usage

### Loading a Saved Model

```python
def load_model(model_name):
"""
Load a saved model from the trained_models directory
Parameters:
model_name (str): Name of the model to load (without '_model.pkl')
Returns:
object: The loaded model pipeline
"""
filename = f'trained_models/{model_name}_model.pkl'
with open(filename, 'rb') as file:
model = pickle.load(file)
return model
```

### Making Predictions

```python
# Example usage
model = load_model('random_forest')
predictions = model.predict(X_test)
```

## Dependencies

- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn

## Model Performance

The project includes various evaluation metrics for each model:

- Accuracy scores
- Classification reports (precision, recall, F1-score)
- Visual comparisons of model performance

## Visualization

The project includes several visualization components:

- Feature distribution plots
- Correlation heatmaps
- Bivariate analysis plots
- Model performance comparison plots

## Future Improvements

Potential areas for enhancement:

1. Hyperparameter tuning for each model
2. Feature selection optimization
3. Ensemble method exploration
4. Cross-validation implementation
5. Addition of more advanced models
6. API development for model deployment
150 changes: 150 additions & 0 deletions Prediction Models/wine_type/test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
import os
import pickle
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

def load_model(model_name):
"""
Load a saved model from the trained_models directory
"""
filename = f'trained_models/{model_name}_model.pkl'
try:
with open(filename, 'rb') as file:
model = pickle.load(file)
print(f"Successfully loaded {model_name} model")
return model
except FileNotFoundError:
print(f"Error: Model file {filename} not found")
return None
except Exception as e:
print(f"Error loading model: {str(e)}")
return None

def prepare_data(data_path):
"""
Prepare the wine quality data for testing, ensuring feature names match training data
"""
try:
# Read the data
df = pd.read_csv(data_path)

# Convert type to dummy variables and keep all columns except 'quality'
df = pd.concat([df, pd.get_dummies(df['type'])], axis=1)
df = df.drop('type', axis=1)

# Remove null values
df = df.dropna()

# Create binary quality target
df['quality_binary'] = df['quality'].apply(lambda x: 1 if x > 5 else 0)

# Select features in the same order as during training
feature_columns = [
"fixed acidity", "volatile acidity", "citric acid", "residual sugar",
"chlorides", "free sulfur dioxide", "total sulfur dioxide", "density",
"pH", "sulphates", "alcohol", "red", "white"
]

# Prepare features and target
X = df[feature_columns] # Only select the features used in training
y = df['quality_binary']

print("Data preparation successful")
print(f"Features included: {', '.join(X.columns)}")
return X, y

except Exception as e:
print(f"Error preparing data: {str(e)}")
return None, None

def evaluate_model(model, X, y, model_name):
"""
Evaluate a model's performance
"""
try:
# Make predictions
y_pred = model.predict(X)

# Calculate accuracy
accuracy = accuracy_score(y, y_pred)

# Print results
print(f"\nResults for {model_name}:")
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y, y_pred))

# Create confusion matrix plot
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'Confusion Matrix - {model_name}')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

return accuracy, y_pred

except Exception as e:
print(f"Error evaluating model: {str(e)}")
return None, None

def test_all_models(data_path):
"""
Test all saved models and compare their performance
"""
# List of model names
model_names = [
'logistic_regression',
'decision_tree',
'random_forest',
'knn',
'svc',
'gradient_boosting'
]

# Prepare data
X, y = prepare_data(data_path)
if X is None or y is None:
return

# Store results
results = []

# Test each model
for model_name in model_names:
model = load_model(model_name)
if model is not None:
accuracy, _ = evaluate_model(model, X, y, model_name)
if accuracy is not None:
results.append({'Model': model_name, 'Accuracy': accuracy * 100})

# Create comparison plot
if results:
results_df = pd.DataFrame(results)
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='Accuracy', data=results_df)
plt.xticks(rotation=45)
plt.title('Model Comparison on Test Data')
plt.tight_layout()
plt.show()

def main():
"""
Main function to run the test script
"""
print("Starting model testing...")

# Specify the path to your test data
data_path = 'winequalityN.csv' # Update this path as needed

# Test all models
test_all_models(data_path)

print("\nTesting completed!")

if __name__ == "__main__":
main()
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit 644ddd8

Please sign in to comment.