This project was developed for a Kaggle competition focused on detecting Personally Identifiable Information (PII) in student writing. The primary objective was to build a robust model capable of identifying PII with high recall. The DeBERTa v3 transformer model was chosen for this task after comparing its performance with other transformer models.
- Introduction
- Data Loading & Preprocessing
- Data Augmentation
- Feature Engineering
- Model Building & Training
- Inference
- Results
- Conclusion
- Acknowledgments
The detection of Personally Identifiable Information (PII) is crucial in maintaining privacy and data security. This project utilizes state-of-the-art transformer models to identify PII in student writing. DeBERTa v3 was selected for its superior performance in recall, achieving a score of 98%.
Data loading and preprocessing are critical steps in any machine learning project. The following steps were undertaken:
- Loading Data: The dataset was loaded into the environment using pandas.
- Splitting Texts: Student texts were split to ensure each segment did not exceed 400 tokens. This was done to fit within the pre-trained context length of 512 tokens of the DeBERTa model.
- Tokenization: The texts were tokenized using the DeBERTa tokenizer.
- Label Encoding: Labels for PII were encoded to match the input tokens.
The preprocessing steps improved recall by 4%.
Several data augmentation techniques were explored to increase the size of the training dataset:
- Synonym Replacement: Replacing words with their synonyms.
- Random Insertion: Inserting random words at random positions in the text.
- Random Swap: Swapping the positions of two words in the text.
- Random Deletion: Deleting random words from the text.
However, these methods did not significantly impact the recall score.
Feature engineering was minimal due to the nature of transformer models, which excel at capturing contextual information from raw text. The primary features included:
- Tokenized Texts: The raw texts converted into tokens.
- Attention Masks: Masks to indicate which tokens should be attended to by the model.
The DeBERTa v3 model was selected after comparing several transformer models:
-
Model Comparison:
- RoBERTa: Recall 88%
- SpaCy Large: Recall 84%
- PIILO: Recall 87%
- DeBERTa v3: Recall 98%
-
Training Setup:
- Optimizer: AdamW
- Learning Rate: Adjusted using a learning rate scheduler.
- Loss Function: Cross-Entropy Loss
- Early Stopping: Implemented to prevent overfitting.
- Batch Size: Set according to available GPU memory.
-
Training Process:
- The model was trained on the preprocessed data.
- Early stopping was used to monitor validation loss and stop training when no improvement was observed.
The trained model was used to predict PII on unseen data:
- Text Splitting: Similar to preprocessing, texts were split into manageable segments.
- Tokenization: The segments were tokenized.
- Prediction: The model predicted the presence of PII in each segment.
- Aggregation: Results from all segments were combined to produce the final output.
The DeBERTa v3 model achieved a recall score of 98%, significantly outperforming other models considered for this task. This high recall is crucial for PII detection, ensuring that most instances of PII are correctly identified.
The DeBERTa v3 transformer model proved to be highly effective in detecting PII in student writing, achieving an outstanding recall score. The project demonstrates the importance of careful preprocessing and model selection in achieving optimal results.
Competition link - https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data