diff --git a/blog/2023-11-14-llm-decontaminator.md b/blog/2023-11-14-llm-decontaminator.md
new file mode 100644
index 00000000..c0957a6d
--- /dev/null
+++ b/blog/2023-11-14-llm-decontaminator.md
@@ -0,0 +1,130 @@
+---
+title: "Cache me if you can! How to beat GPT-4 with a 13B model"
+author: "Shuo Yang*, Wei-Lin Chiang*, Lianmin Zheng*, Joseph E. Gonzalez, Ion Stoica"
+date: "Nov 14, 2023"
+previewImg: /images/blog/decontaminator/rephrase-score_with_border.png
+---
+
+
+Announcing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmarks (MMLU/GSK-8K/HumanEval)!
+To ensure result validity, we followed OpenAI's decontamination method and found no evidence of data contamination.
+
+
+
+
+What's the trick behind it? Well, rephrasing the test set is all you need! We simply paraphrase a test sample or translate it into a different language. It turns out a 13B LLM is smart enough to "generalize" beyond such variations and reaches drastically high benchmark performance. So, did we just make a big breakthrough? Apparently, there is something wrong with our understanding of contamination.
+
+In this blog post, we point out why contamination is still poorly understood and how existing decontamination measures fail to capture such nuances. To address such risks, we propose a stronger [LLM-based decontaminator](https://github.com/lm-sys/llm-decontaminator) and apply it to real-world training datasets (e.g., the Stack, RedPajama), revealing significant test overlap with widely used benchmarks.
+For more technical details, please refer to our [paper](https://arxiv.org/pdf/2311.04850.pdf).
+
+
+## **What's wrong with existing decontamination measures?**
+
+Contamination occurs when test set information is leaked in the training set, resulting in an overly optimistic estimate of the model’s performance.
+Despite being recognized as a crucial issue, understanding and detecting contamination remains an open and challenging problem.
+
+The most commonly used approaches are n-gram overlap and embedding similarity search.
+N-gram overlap relies on string matching to detect contamination, widely used by leading developments such as [GPT-4](https://arxiv.org/pdf/2303.08774.pdf), [PaLM](https://arxiv.org/pdf/2204.02311.pdf), and [Llama-2](https://arxiv.org/pdf/2307.09288.pdf).
+Embedding similarity search uses the embeddings of pre-trained models (e.g., BERT) to find similar and potentially contaminated examples.
+
+However, we show that simple variations of the test data (e.g., paraphrasing, translation) can easily bypass existing simple detection methods.
+We refer to such variations of test cases as _Rephrased Samples_.
+
+Below we demonstrate a rephrased sample from the MMLU benchmark. We show that if such samples are included in the training set, a 13B model can reach drastically high performance (MMLU 85.9).
+Unfortunately, existing detection methods (e.g., n-gram overlap, embedding similarity) fail to detect such contamination. The embedding similarity approach struggles to distinguish the rephrased question from other questions in the same subject (high school US history).
+
+
+
+
+
+
+With similar rephrasing techniques, we observe consistent results in widely used coding and math benchmarks such as HumanEval and GSM-8K (shown in the cover figure). Therefore, being able to detect such rephrased samples becomes critical.
+
+
+
+## **Stronger Detection Method: LLM Decontaminator**
+
+To address the risk of possible contamination, we propose a new contamination detection method “LLM decontaminator”.
+
+This LLM decontaminator involves two steps:
+
+ 1. For each test case, LLM decontaminator identifies the top-k training items with the highest similarity using the embedding similarity search.
+ 2. From these items, LLM decontaminator generates k potential rephrased pairs. Each pair is evaluated for rephrasing using an advanced LLM, such as GPT-4.
+
+Results show that our proposed LLM method works significantly better than existing methods on removing rephrased samples.
+
+### **Evaluating Different Detection Methods**
+
+To compare different detection methods, we use MMLU benchmark to construct 200 prompt pairs using both the original and rephrased test sets. These comprised 100 random pairs and 100 rephrased pairs.
+The f1 score on these pairs provides insight into the detection methods' ability to detect contamination, with higher values indicating more precise detection.
+As shown in the following table, except for the LLM decontaminator, all other detection methods introduce some false positives. Both rephrased and translated samples successfully evade the n-gram overlap detection. With multi-qa BERT, the embedding similarity search proves ineffective against translated samples. Our proposed LLM decontaminator is more robust in all cases with the highest f1 scores.
+
+
+
+
+
+## **Contamination in Real-World Dataset**
+
+We apply the LLM decontaminator to widely used real-world datasets (e.g., the Stack, RedPajama, etc) and identify a substantial amount of rephrased samples. The table below displays the contamination percentage of different benchmarks in each training dataset.
+
+
+
+
+Below we show some detected samples.
+
+[CodeAlpaca](https://github.com/sahil280114/codealpaca) contains 20K instruction-following synthetic data generated by GPT, which is widely used for instruction fine-tuning (e.g., [Tulu](https://huggingface.co/TheBloke/tulu-30B-fp16)).
+
+A rephrased example in CodeAlpaca is shown below.
+
+
+
+This suggests contamination may subtly present in synthetic data generated by LLMs. In the Phi-1 [report](https://arxiv.org/pdf/2306.11644.pdf), they also discover such semantically similar test samples that are undetectable by n-gram overlap.
+
+
+[MATH](https://github.com/hendrycks/math) is a widely recognized math training dataset that spans various mathematical domains, including algebra, geometry, and number theory.
+Surprisingly, we even find contamination between the train-test split in the MATH benchmark as shown below.
+
+
+
+
+[StarCoder-Data](https://huggingface.co/datasets/bigcode/starcoderdata) is used for training StarCoder and StarCoderBase, and it contains 783GB of code in 86 programming languages. In the StarCoder [paper](https://arxiv.org/pdf/2305.06161.pdf), the code training data was decontaminated by removing files that contained docstrings or solutions from HumanEval. However, there are still some samples detected by LLM decontaminator.
+
+
+
+## **Use LLM Decontaminator to Scan Your Data**
+
+Based on the above study, we suggest the community adopt a stronger decontamination method when using any public benchmarks. Our proposed LLM decontaminator is open-sourced on GitHub.
+Here we show how to remove rephrased samples from training data using the LLM decontaminator tool. The following example can be found [here](https://github.com/lm-sys/llm-decontaminator#detect).
+
+[Pre-process](https://github.com/lm-sys/llm-decontaminator#pre-process) training data and test data.
+The LLM decontaminator accepts the dataset in jsonl format, with each line corresponding to a `{"text": data}` entry.
+
+Run [End2End](https://github.com/lm-sys/llm-decontaminator#end2end) detection.
+The following command builds a top-k similar database based on sentence bert and uses GPT-4 to check one by one if they are rephrased samples. You can select your embedding model and detection model by modifying the parameters.
+
+
+
+
+## **Conclusion**
+
+In this blog, we show that contamination is still poorly understood. With our proposed decontamination method, we reveal significant previously unknown test overlap in real-world datasets. We encourage the community to rethink benchmark and contamination in LLM context, and adopt stronger decontamination tools when evaluating LLMs on public benchmarks.
+
+
+## **Acknowledgment**
+
+We would like to express our gratitude to Ying Sheng for the early discussion on rephrased samples.
+We also extend our thanks to Dacheng Li, Erran Li, Hao Liu, Jacob Steinhardt, Hao Zhang, and Siyuan Zhuang for providing insightful feedback.
+
+
+## **Citation**
+
+```
+@misc{yang2023rethinking,
+ title={Rethinking Benchmark and Contamination for Language Models with Rephrased Samples},
+ author={Shuo Yang and Wei-Lin Chiang and Lianmin Zheng and Joseph E. Gonzalez and Ion Stoica},
+ year={2023},
+ eprint={2311.04850},
+ archivePrefix={arXiv},
+ primaryClass={cs.CL}
+}
+```
\ No newline at end of file
diff --git a/public/images/blog/decontaminator/MATH-rephrase.png b/public/images/blog/decontaminator/MATH-rephrase.png
new file mode 100644
index 00000000..3d0ef697
Binary files /dev/null and b/public/images/blog/decontaminator/MATH-rephrase.png differ
diff --git a/public/images/blog/decontaminator/MMLU-f1score.png b/public/images/blog/decontaminator/MMLU-f1score.png
new file mode 100644
index 00000000..3cc83934
Binary files /dev/null and b/public/images/blog/decontaminator/MMLU-f1score.png differ
diff --git a/public/images/blog/decontaminator/MMLU-us-f1score.png b/public/images/blog/decontaminator/MMLU-us-f1score.png
new file mode 100644
index 00000000..292a4d2d
Binary files /dev/null and b/public/images/blog/decontaminator/MMLU-us-f1score.png differ
diff --git a/public/images/blog/decontaminator/codealpaca-rephrase.png b/public/images/blog/decontaminator/codealpaca-rephrase.png
new file mode 100644
index 00000000..15b4304a
Binary files /dev/null and b/public/images/blog/decontaminator/codealpaca-rephrase.png differ
diff --git a/public/images/blog/decontaminator/gsm-8k-rephrase.png b/public/images/blog/decontaminator/gsm-8k-rephrase.png
new file mode 100644
index 00000000..c7e87659
Binary files /dev/null and b/public/images/blog/decontaminator/gsm-8k-rephrase.png differ
diff --git a/public/images/blog/decontaminator/llama-Frank.png b/public/images/blog/decontaminator/llama-Frank.png
new file mode 100644
index 00000000..ebddc41b
Binary files /dev/null and b/public/images/blog/decontaminator/llama-Frank.png differ
diff --git a/public/images/blog/decontaminator/llama-rephraser.png b/public/images/blog/decontaminator/llama-rephraser.png
new file mode 100644
index 00000000..439a4b9a
Binary files /dev/null and b/public/images/blog/decontaminator/llama-rephraser.png differ
diff --git a/public/images/blog/decontaminator/overview.png b/public/images/blog/decontaminator/overview.png
new file mode 100644
index 00000000..3c4a51a8
Binary files /dev/null and b/public/images/blog/decontaminator/overview.png differ
diff --git a/public/images/blog/decontaminator/real-world-rephrase.png b/public/images/blog/decontaminator/real-world-rephrase.png
new file mode 100644
index 00000000..f36ad235
Binary files /dev/null and b/public/images/blog/decontaminator/real-world-rephrase.png differ
diff --git a/public/images/blog/decontaminator/rephrase-score_with_border.png b/public/images/blog/decontaminator/rephrase-score_with_border.png
new file mode 100644
index 00000000..21923fb8
Binary files /dev/null and b/public/images/blog/decontaminator/rephrase-score_with_border.png differ
diff --git a/public/images/blog/decontaminator/run-e2e.png b/public/images/blog/decontaminator/run-e2e.png
new file mode 100644
index 00000000..13a378d6
Binary files /dev/null and b/public/images/blog/decontaminator/run-e2e.png differ
diff --git a/public/images/blog/decontaminator/starcoder-rephrase.png b/public/images/blog/decontaminator/starcoder-rephrase.png
new file mode 100644
index 00000000..6b97766b
Binary files /dev/null and b/public/images/blog/decontaminator/starcoder-rephrase.png differ