The official release of our project Conic10K, a large-scale dataset for closed-vocabulary math problem understanding and reasoning. The paper "CONIC10K: A Challenging Math Problem Understanding and Reasoning Dataset" was accepted to EMNLP 2023 Findings.
To run the codes, you need to install the requirements:
conda create -n conic10k python=3.8
pip install torch==1.12.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
Our dataset is located in folder conic10k
.
You can also get our dataset in huggingface datasets.
from datasets import load_dataset
dataset = load_dataset("WenyangHui/Conic10K")
train_dataset = dataset["train"]
print(train_dataset[1])
# {'text': '已知双曲线$\\frac{x^{2}}{4}-\\frac{y^{2}}{m^{2}}=1(m>0)$的一条渐近线方程是$5 x-2 y=0$,则$m$=?', 'answer_expressions': '5', 'fact_expressions': 'G: Hyperbola;m: Number;m>0;Expression(G) = (x^2/4 - y^2/m^2 = 1);Expression(OneOf(Asymptote(G))) = (5*x - 2*y = 0)', 'query_expressions': 'm', 'fact_spans': '[[[2, 49]], [[71, 74]], [[5, 49]], [[2, 49]], [[2, 69]]]', 'query_spans': '[[[71, 76]]]', 'process': '双曲线\\frac{x^{2}}{4}-\\frac{y^{2}}{m2}=1(m>0)的渐近线方程为y=\\pm\\frac{m}{2}x直线5x-2y=0的方程可化为y=\\frac{5}{2}x,所以,m=5.'}
Each sample in our dataset contains the following attributes.
Attribute | Description |
---|---|
text | Question text in natural language with math formulas in latex. |
fact_expressions | Formal representation of the facts in the question. |
query_expressions | Formal representation of the queries in the question. |
answer_expressions | Answer to the question |
fact_spans | Text span corresponding to each expression in fact_expressions. |
query_spans | Text span corresponding to each expression in query_expressions. |
process | Rationale |
For more information about the annotation of this dataset, please refer to the folder docs
in this repo.
Run the following script to train a model.
# Train a causal language model
sh scripts/train_clm.sh
# Train a encoder decoder model
sh scripts/train_encoder_decoder.sh
Run the following script to generate with a model.
python src/generate.py \
--task semantic_parsing \
--model_name_or_path llama-7b \
--output_file outputs/semantic_parsing_llama_7b_lora.json \
--lora_path llama-7b-semantic-parsing-lora
Run the following script to automatically evaluate the generation results in semantic parsing.
python src/semantic_evaluate.py \
--prediction_file outputs/semantic_parsing_llama_7b_lora.json \
--split test \
--report_file outputs/semantic_parsing_llama_7b_lora_report.json
This project is MIT licensed.