Skip to content

Latest commit

 

History

History
41 lines (33 loc) · 2.44 KB

README.md

File metadata and controls

41 lines (33 loc) · 2.44 KB

TKGT

This is the official code and data repository for the paper TKGT.

Prepare

Before starting, make sure that you have :

  • Installed vllm in a v-env called vllm and download below interface for fast and parallel inference,
cd /TKGT/Hybird_RAG/predict
git clone [email protected]:jiangpw41/LLMInferenceServer.git
  • Installed llama-factory in a v-env called llama_factory for fine-tuning.
  • Installed nltk toolkit in a v-env called tkgt for first-stage and download nltk package under root dir
  • Installed env2.txt in a v-env called tkgt2 for second stage.
  • Downloaded a LLM (like chatglm3/4-6b) for Q&A
  • Downloaded embedding models for embedding and retreving
    • "roberta-large" under TKGT/Hybird_RAG/evaluation for bert-score evaluation.
    • "sentence-transformers/all-MiniLM-L6-v2" and "cross-encoder/ms-marco-MiniLM-L-6-v2" under TKGT/Hybird_RAG/retriever

Dataset

Our project refers to a total of five datasets, four of which are from previous text-to-table tasks (Rotowire, e2e, wikibio, wikitabletext) and the rest one is proposed by us (the CPL). In addition, the data will be stored in two folders, "raw" and "data", respectively, under the root directory of the project. The former contains four copies of the original data (Rotowire, e2e, wikibio, wikitabletext), the latter includes three processed datasets (CPL, Rotowire, e2e) that can be directly used in this project.

"raw"

We use the processed four datasets by https://github.com/shirley-wu/text_to_table , which can be download at https://drive.google.com/file/d/1zTfDFCl1nf_giX7IniY5WbXi9tAuEHDn/view. You need to download this .zip file, unzip it under project root, and rename it as "raw".

As for the CPL dataset, as the original data involves privacy, we only provide processed dataset in "data".

"data"

There are three datasets in table form stored in the "data" directory from total five, and the remaining two (wikibio, wikitabletext) have no limited table fields. For details, please refer to the readme file in the "data" directory.

Code example

Mixed_IE

The method of TKGT's first stage, which include paradigms of regulation, statistics, and DL.

KGs

The semi-automatically constructed Knowledge Graph of datasets by human experts with the help of Mixed_IE. Based on KGs schema, prepare prompt list and ft data in batch.

Hybird-RAG

Using

  • rule-based (prior structural knowledge and regular matching) method and
  • statistics-based (word frequency)
  • semantic-based (similarity)