Term project for Fall 2022 Natural Language Processing (4190.678) course at Seoul National University
Terms of Service (ToS) are legal agreements between a service provider and a user. However, they are long and complex, often leading users to accept conditions they have not fully understood. This project aims to aid understanding the agreements while maintaining faithfulness to the legal context by developing an extractive summarization model tailored for ToS documents. Utilizing MemSum (Gu et al., 2022), a SOTA model for long document summarization on GovReport dataset, as a foundation, we introduce specialized improvements to better handle the legal jargon and complex sentence structures commonly found in ToS. Our model improves legal sentence encoding in two directions: 1) using legal word2vec embeddings (MemSum-LegalEmb) and 2) levaraging Transformer-based architecture via SBERT embeddings (MemSum-SBERT). For evaluation, we construct the ToS;DR dataset from a community-driven website in which contributors highlight salient points in various service terms. Experiments demonstrate that our enhanced model outperforms the baseline MemSum model, with ablation studies indicating further performance gain by expanding the reference summary set.
The base model is MemSum (Gu et al., 2022). Our approach is to improve the Local Sentence Encoder (LSE) module of MemSum by the following:
- integrating legal domain word2vec embedding models from SigmaLaw (Sugathadasa et al., 2017), which we call MemSum-LegalEmb
- integrating SBERT sentence embeddings provided by sentence-transformers/all-mpnet-base-v2, which we call MemSum-SBERT
For validation, we run experiments on GovReport (Huang et al., 2021), the dataset which our baseline reports SOTA on and tests our model's performance on a domain with characteristics similar to ToS.
For main evaluation, we build the ToS;DR dataset by scraping ToS documents and user highlighted text in the tosdr.org.
In a single ToS document,
- Gold summary is composed of sentences highlighted by contributors in the ToS;DR community.
- Oracle summaries are candidate summaries built by sequentially selecting the optimal sentence that maximally improves the average ROUGE score once added to the current subset of selected sentences. ROUGE scores include ROUGE-1, 2, and L for measuring unigram, bigram, and longest common subsequence.
Train | Valid | Test |
---|---|---|
1,611 | 202 | 201 |
tosdr.org is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0), but currently we are not providing our ToS;DR dataset. To view the dataset format, you can head to our repo's tosdr-dataset/sample_data directory.
We train our model using the Adam optimizer with β1 = 0.9, β2 = 0.999, fixed learning rate 0.0001, weight decay 0.000001, and choose the best checkpoint based on validation performance. For dataset-specific hyperparameters, we select the values optimal after tuning in baseline experiments:
Dataset | max_sentence_num |
max_sequence_len |
p_stop_threshold |
max_extracted_sentences_per_document |
---|---|---|---|---|
GovReport | 500 | 100 | 0.6 | 22 |
TOS;DR | 300 | 50 | 0.6 | 13 |
On GovReport, MemSum-LegalEmb shows comparable performance. MemSum-SBERT approaches the baseline performance even after training for only 5 epochs.
Model | Best Epoch | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|---|
MemSum | 50 | 0.5945 | 0.2851 | 0.5668 |
MemSum-LegalEmb | 40 | 0.5935 | 0.2823 | 0.5658 |
MemSum-SBERT | 5 | 0.5827 | 0.2465 | 0.5507 |
On ToS;DR, both of our models outperform the baseline.
Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
MemSum | 0.4075 | 0.2598 | 0.3937 |
MemSum-LegalEmb | 0.4141 | 0.2705 | 0.4001 |
MemSum-SBERT | 0.4244 | 0.2732 | 0.4111 |
Using both gold and oracle summaries as reference summaries are better than just using gold summaries.
Model | Reference Summaries | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|---|
MemSum-SBERT | Gold | 0.4168 | 0.2597 | 0.4026 |
MemSum-SBERT | Gold + Oracle | 0.4244 | 0.2732 | 0.4111 |
For more results, please refer to our slides.
- Sue Hyun Park: crawling and curating ToS;DR dataset, baseline MemSum experiments, hyperparameter tuning
- Seungmin Han: MemSum-LegalEmb implementation and experiments
- Heekang Park: MemSum-SBERT implementation and experiments