Syntax Encoding with Application in Authorship Attribution

dataset and code for the paper: Syntax Encoding with Application in Authorship Attribution. Richong Zhang, Zhiyuan Hu, Hongyu Guo, Yongyi Mao. EMNLP 2018 [pdf]

you can get the datasets used in the paper at here.

Overview

We propose a novel strategy to encode the syntax parse tree of sentence into a learnable distributed representation. The proposed syntax encoding scheme is provably information lossless. In specific, an embedding vector is constructed for each word in the sentence, encoding the path in the syntax tree corresponding to the word. The one-to-one correspondence between these “syntax-embedding” vectors and the words (hence their embedding vectors) in the sentence makes it easy to integrate such a representation with all word-level NLP models. We empirically show the benefits of the syntax embeddings on the Authorship Attribution domain, where our approach improves upon the prior art and achieves new performance records on five benchmarking data sets.

Requirement

python 3.7
pytorch 1.2.0
tensorflow 1.14.0
numpy 1.16.2
nltk 3.4

Usage

python main.py --dataset <dataset_name> --num_authors <num_of_authors>

the parameter dataset will choose from [blogs, CCAT, imdb]
the parameter num_authors will depend on the dataset you choose. when you choose blogs or CCAT, the parameter num_authors wil be 10 or 50, but the max authors num of imdb is 62.

This paper propose a syntax feature encoding method which can be used in . It has been accepted by EMNLP2018

Citation

@inproceedings{zhang2018syntax,
  title={Syntax encoding with application in authorship attribution},
  author={Zhang, Richong and Hu, Zhiyuan and Guo, Hongyu and Mao, Yongyi},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages={2742--2753},
  year={2018}
}

Licesen

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
CCAT		CCAT
Pytorch-version		Pytorch-version
Tensorflow-version		Tensorflow-version
imdb/word_data/all		imdb/word_data/all
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Syntax Encoding with Application in Authorship Attribution.pdf		Syntax Encoding with Application in Authorship Attribution.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Syntax Encoding with Application in Authorship Attribution

Overview

Requirement

Usage

Citation

Licesen

About

Releases

Packages

Contributors 3

Languages

License

BDBC-KG-NLP/Syntax-Encoding_EMNLP2018

Folders and files

Latest commit

History

Repository files navigation

Syntax Encoding with Application in Authorship Attribution

Overview

Requirement

Usage

Citation

Licesen

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages