Skip to content
/ pcfg-lm Public

Repo for the EMNLP 2023 Findings paper "Transparency at the Source"

License

Notifications You must be signed in to change notification settings

clclab/pcfg-lm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transparency at the Source

Repo for the EMNLP 2023 Findings paper Transparency at the Source: Evaluating and Interpreting Language Models With Access to the True Distribution.

Our pipeline consists of a mix of Java and Python code: grammar induction is done using the original code of Petrov et al. (2006) in Java, language model training is done using transformers in Python.

Grammar induction

java -Xmx32g -cp CustomBerkeley.jar edu.berkeley.nlp.PCFGLA.GrammarTrainer -path $path -out $save_dir/stage -treebank SINGLEFILE -mergingPercentage 0.5 -filter 1.0e-8 -SMcycles 5

Masked token PCFG probabilities, $grammarfile should point to the grammar archive file that can be found in the Google Drive resources (500k).

java -cp CustomBerkeley.jar edu.berkeley.nlp.PCFGLA.BerkeleyParser -gr $grammarfile -inputFile $inputfile

EarleyX causal PCFG probabilities

java -Xms32768M -classpath "earleyx_fast.jar:lib/*" parser.Main -in data/eval_subset_100.txt -grammar grammars/earleyx.grammar -out results -verbose 1 -thread 1

Language model training

python3 main_multi.py \
  --model.model_type microsoft/deberta-base \
  --model.is_mlm \
  --tokenizer.path tokenizers/added_tokens.json \
  --data.data_dir corpora \
  --data.train_file train.txt \
  --trainer.output_dir $save_dir

About

Repo for the EMNLP 2023 Findings paper "Transparency at the Source"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published