Merge pull request #595 from DarshAgrawal14/main

Added model to analyze the physicochemical attributes of wine
UppuluriKalyani · Oct 26, 2024 · 644ddd8 · 644ddd8
2 parents b9b7c0f + d3e414b
commit 644ddd8
Show file tree

Hide file tree

Showing 10 changed files with 10,580 additions and 0 deletions.
diff --git a/Prediction Models/wine_type/README.md b/Prediction Models/wine_type/README.md
@@ -0,0 +1,136 @@
+# Wine Quality Prediction
+
+This project implements various machine learning models to predict wine quality based on physicochemical properties. The models classify wines into binary categories (high quality vs. low quality) using features like acidity, pH, alcohol content, and more.
+
+## Project Overview
+
+The project uses a dataset containing various chemical properties of wines and their quality ratings. The quality ratings are binarized into two categories:
+
+- 0: Lower quality (rating ≤ 5)
+- 1: Higher quality (rating > 5)
+
+## Features
+
+The following features are used for prediction:
+
+- Fixed acidity
+- Volatile acidity
+- Citric acid
+- Residual sugar
+- Chlorides
+- Free sulfur dioxide
+- Total sulfur dioxide
+- Density
+- pH
+- Sulphates
+- Alcohol
+- Wine type (red/white)
+
+## Models Implemented
+
+The project implements and compares six different machine learning models:
+
+1. Logistic Regression
+2. Decision Tree Classifier
+3. Random Forest Classifier
+4. K-Nearest Neighbors (KNN) Classifier
+5. Support Vector Classifier (SVC)
+6. Gradient Boosting Classifier
+
+## Technical Implementation
+
+### Data Preprocessing
+
+- Handling missing values
+- Feature scaling using MinMaxScaler
+- Dimensionality reduction using PCA
+- One-hot encoding for categorical variables
+
+### Model Pipeline
+
+Each model uses a consistent pipeline that includes:
+
+1. Feature scaling
+2. PCA transformation
+3. Model training and prediction
+
+## Project Structure
+
+```
+project/
+│
+├── winequalityN.csv          # Input dataset
+├── trained_models/           # Directory containing saved models
+│   ├── logistic_regression_model.pkl
+│   ├── decision_tree_model.pkl
+│   ├── random_forest_model.pkl
+│   ├── knn_model.pkl
+│   ├── svc_model.pkl
+│   └── gradient_boosting_model.pkl
+└── wine_quality_prediction.py # Main script
+```
+
+## Usage
+
+### Loading a Saved Model
+
+```python
+def load_model(model_name):
+    """
+    Load a saved model from the trained_models directory
+
+    Parameters:
+    model_name (str): Name of the model to load (without '_model.pkl')
+
+    Returns:
+    object: The loaded model pipeline
+    """
+    filename = f'trained_models/{model_name}_model.pkl'
+    with open(filename, 'rb') as file:
+        model = pickle.load(file)
+    return model
+```
+
+### Making Predictions
+
+```python
+# Example usage
+model = load_model('random_forest')
+predictions = model.predict(X_test)
+```
+
+## Dependencies
+
+- pandas
+- numpy
+- scikit-learn
+- matplotlib
+- seaborn
+
+## Model Performance
+
+The project includes various evaluation metrics for each model:
+
+- Accuracy scores
+- Classification reports (precision, recall, F1-score)
+- Visual comparisons of model performance
+
+## Visualization
+
+The project includes several visualization components:
+
+- Feature distribution plots
+- Correlation heatmaps
+- Bivariate analysis plots
+- Model performance comparison plots
+
+## Future Improvements
+
+Potential areas for enhancement:
+
+1. Hyperparameter tuning for each model
+2. Feature selection optimization
+3. Ensemble method exploration
+4. Cross-validation implementation
+5. Addition of more advanced models
+6. API development for model deployment
diff --git a/Prediction Models/wine_type/test.py b/Prediction Models/wine_type/test.py
@@ -0,0 +1,150 @@
+import os
+import pickle
+import pandas as pd
+import numpy as np
+from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
+import seaborn as sns
+import matplotlib.pyplot as plt
+
+def load_model(model_name):
+    """
+    Load a saved model from the trained_models directory
+    """
+    filename = f'trained_models/{model_name}_model.pkl'
+    try:
+        with open(filename, 'rb') as file:
+            model = pickle.load(file)
+        print(f"Successfully loaded {model_name} model")
+        return model
+    except FileNotFoundError:
+        print(f"Error: Model file {filename} not found")
+        return None
+    except Exception as e:
+        print(f"Error loading model: {str(e)}")
+        return None
+
+def prepare_data(data_path):
+    """
+    Prepare the wine quality data for testing, ensuring feature names match training data
+    """
+    try:
+        # Read the data
+        df = pd.read_csv(data_path)
+
+        # Convert type to dummy variables and keep all columns except 'quality'
+        df = pd.concat([df, pd.get_dummies(df['type'])], axis=1)
+        df = df.drop('type', axis=1)
+
+        # Remove null values
+        df = df.dropna()
+
+        # Create binary quality target
+        df['quality_binary'] = df['quality'].apply(lambda x: 1 if x > 5 else 0)
+
+        # Select features in the same order as during training
+        feature_columns = [
+            "fixed acidity", "volatile acidity", "citric acid", "residual sugar",
+            "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density",
+            "pH", "sulphates", "alcohol", "red", "white"
+        ]
+
+        # Prepare features and target
+        X = df[feature_columns]  # Only select the features used in training
+        y = df['quality_binary']
+
+        print("Data preparation successful")
+        print(f"Features included: {', '.join(X.columns)}")
+        return X, y
+
+    except Exception as e:
+        print(f"Error preparing data: {str(e)}")
+        return None, None
+
+def evaluate_model(model, X, y, model_name):
+    """
+    Evaluate a model's performance
+    """
+    try:
+        # Make predictions
+        y_pred = model.predict(X)
+
+        # Calculate accuracy
+        accuracy = accuracy_score(y, y_pred)
+
+        # Print results
+        print(f"\nResults for {model_name}:")
+        print(f"Accuracy: {accuracy:.4f}")
+        print("\nClassification Report:")
+        print(classification_report(y, y_pred))
+
+        # Create confusion matrix plot
+        plt.figure(figsize=(8, 6))
+        cm = confusion_matrix(y, y_pred)
+        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
+        plt.title(f'Confusion Matrix - {model_name}')
+        plt.ylabel('True Label')
+        plt.xlabel('Predicted Label')
+        plt.show()
+
+        return accuracy, y_pred
+
+    except Exception as e:
+        print(f"Error evaluating model: {str(e)}")
+        return None, None
+
+def test_all_models(data_path):
+    """
+    Test all saved models and compare their performance
+    """
+    # List of model names
+    model_names = [
+        'logistic_regression',
+        'decision_tree',
+        'random_forest',
+        'knn',
+        'svc',
+        'gradient_boosting'
+    ]
+
+    # Prepare data
+    X, y = prepare_data(data_path)
+    if X is None or y is None:
+        return
+
+    # Store results
+    results = []
+
+    # Test each model
+    for model_name in model_names:
+        model = load_model(model_name)
+        if model is not None:
+            accuracy, _ = evaluate_model(model, X, y, model_name)
+            if accuracy is not None:
+                results.append({'Model': model_name, 'Accuracy': accuracy * 100})
+
+    # Create comparison plot
+    if results:
+        results_df = pd.DataFrame(results)
+        plt.figure(figsize=(10, 6))
+        sns.barplot(x='Model', y='Accuracy', data=results_df)
+        plt.xticks(rotation=45)
+        plt.title('Model Comparison on Test Data')
+        plt.tight_layout()
+        plt.show()
+
+def main():
+    """
+    Main function to run the test script
+    """
+    print("Starting model testing...")
+
+    # Specify the path to your test data
+    data_path = 'winequalityN.csv'  # Update this path as needed
+
+    # Test all models
+    test_all_models(data_path)
+
+    print("\nTesting completed!")
+
+if __name__ == "__main__":
+    main()
diff --git a/Prediction Models/wine_type/trained_models/decision_tree_model.pkl b/Prediction Models/wine_type/trained_models/decision_tree_model.pkl
diff --git a/Prediction Models/wine_type/trained_models/gradient_boosting_model.pkl b/Prediction Models/wine_type/trained_models/gradient_boosting_model.pkl
diff --git a/Prediction Models/wine_type/trained_models/knn_model.pkl b/Prediction Models/wine_type/trained_models/knn_model.pkl
diff --git a/Prediction Models/wine_type/trained_models/logistic_regression_model.pkl b/Prediction Models/wine_type/trained_models/logistic_regression_model.pkl
diff --git a/Prediction Models/wine_type/trained_models/random_forest_model.pkl b/Prediction Models/wine_type/trained_models/random_forest_model.pkl
diff --git a/Prediction Models/wine_type/trained_models/svc_model.pkl b/Prediction Models/wine_type/trained_models/svc_model.pkl