This project implements the GLARE methodology, which is a system for classifying Brazilian legal documents, specifically Recursos Especiais, using unsupervised machine learning techniques. It combines text summarization and similarity evaluation to match legal documents to predefined themes from the Brazilian Superior Court of Justice (STJ).
The goal of this project is to automate the classification of legal documents based on themes, without the need for labeled training data. The project includes:
- Text Embedding Generation: Creation of embeddings for legal documents (Recursos Especiais) and predefined themes.
- Text Summarization: Summarizing legal documents using one of four techniques: LexRank, Guided LexRank, BERTopic, or Guided BERTopic.
- Similarity Calculation: Evaluating the similarity between document summaries and themes using either BM25 (for text) or cosine similarity (for embeddings).
- Performance Metrics: Calculating various metrics to evaluate the accuracy and performance of the classification system.
The project is composed of the following scripts:
- createEmbedding.py: Generates text embeddings for legal documents and themes using Sentence-BERT.
- createTopics.py: Summarizes legal documents using one of four summarization methods:
- LexRank: A graph-based unsupervised summarization algorithm.
- Guided LexRank: LexRank guided by predefined themes.
- BERTopic: A topic modeling technique using sentence embeddings.
- Guided BERTopic: BERTopic guided by predefined themes.
- calcSimilarity.py: Calculates the similarity between the document summary and the themes.
- Uses BM25 for text-based similarity.
- Uses cosine similarity for embedding-based similarity.
- metrics.py: Computes relevant performance metrics (e.g., accuracy, recall, precision) for the classification results.
- Clone the repository:
git clone ttps://github.com/AILAB-CEFET-RJ/r2t cd src
- Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Linux/Mac venv\Scripts\activate # On Windows
- Install the required dependencies:
pip install -r requirements.txt
Use createEmbedding.py to generate embeddings for both legal documents (Recursos Especiais) and themes.
- For legal documents: python createEmbedding.py REsp_completo.csv recurso recurso --clean --begin_point cabimento -v
- For themes: python createEmbedding.py temas_repetitivos.csv tema tema --clean -v
Once embeddings are generated, you can summarize the documents using createTopics.py with one of the summarization methods. python script.py <corpus_embedding> [--verbose] [--seed_list <seed_list>] []
- corpus_embedding: Path to the corpus embeddings file (.pkl file).
- size: Number of sentences or topics to summarize.
- type: Type of topic generation:
- B: Bertopic
- G: Guided Bertopic
- L: Lexrank
- X: Guided Lexrank
- --verbose: Increase the verbosity of the process.
- --seed_list: Path to the seed list (required for type G or X).
- : Sentence-BERT model used to generate embeddings (optional, default: distiluse-base-multilingual-cased-v1)
-
Topic generation with BERTopic: python script.py corpus.pkl 10 B
-
Topic generation with Guided BERTopic: python script.py corpus.pkl 10 G --seed_list seeds.csv
-
Summary generation with LexRank: python script.py corpus.pkl 5 L
-
Summary generation with Guided LexRank: python script.py corpus.pkl 5 X --seed_list seeds.csv
After summarizing the documents, use calcSimilarity.py to compute the similarity between the document summaries and the themes. python calcSimilarity.py <corpus_file> <themes_file>
- <corpus_file> is the path to the corpus file in pickle format.
- <themes_file> is the path to the themes file in pickle format.
- is the number of top results to retrieve.
- type of similarity
- B indicates that the BM25 method should be used for similarity calculation.
- C indicates that the Cosine Similarity method should be used for similarity calculation.
For text-based similarity (using BM25): python calcSimilarity.py <corpus_file> <themes_file> B
The program will generate a CSV file with the similarity results. The file will be named CLASSIFIED_<corpus_name>_.csv, where is BM25 or COSINE, depending on the similarity method used. Example Output
- For BM25: CLASSIFIED_TOPICS_L10CLEAN_BM25.csv
- For Cosine Similarity: CLASSIFIED_TOPICS_L10CLEAN_COSINE.csv
- Ensure the input files are in pickle format and contain the expected structure.
- The rank parameter determines the number of items similar to the top to be retrieved and included in the output.
Finally, use metrics.py to calculate metrics and evaluate the system’s performance. It computes metrics such as Recall, F1-Score, MAP (Mean Average Precision), NDCG (Normalized Discounted Cumulative Gain), and MRR (Mean Reciprocal Rank) based on the provided classified data.
python metrics.py CLASSFIED_TOPICS_B10CLEAN_BM25.csv