Location Prediction for Tweets

This repo contains code for the paper Location Prediction for Tweets [https://www.frontiersin.org/articles/10.3389/fdata.2019.00005/full].

Data

Please follow the instruction from WNUT 2016 Geo-Shared-Task to acquire the data.

WNUT 2016 Geo-Shared-Task: https://noisy-text.github.io/2016/geo-shared-task.html

Path Configuration

Before running the code, please modify the path settint in the src/config.py file.

root_dir = "PATH_TO_THE_GIT_REPO"

Development Environment

Python3.6 + Tensorflow 1.12.0

Data Preprocessing

Take the downloaded tweets and the label file, extract the needed data field, perform tokenization on the tweets, and aggregrate the ground truth.

$ python preprocessing.py --tweet_path ../data/train_tweet.json --label_path ../data/train.label.json --output_path ../data/train.parquet
$ python preprocessing.py --tweet_path ../data/valid_tweet.json --label_path ../data/valid.label.json --output_path ../data/valid.parquet
$ python preprocessing.py --tweet_path ../data/test_tweet.json --label_path ../data/test.label.json --output_path ../data/test.parquet

Model Training

The script for training the model from scratch. The default hyperparameter is used in the paper so you can use the following command to start the training.

$ python train.py

If you would like to change the hyperparameter, please refer to the following argument settings.

$ python train.py [-h] [--max_len MAX_LEN] [--max_char_len MAX_CHAR_LEN]
                [--minfreq MINFREQ] [--emb_dim EMB_DIM]
                [--hidden_dim HIDDEN_DIM] [--num_head NUM_HEAD]
                [--layer_num LAYER_NUM] [--char_dim CHAR_DIM]
                [--char_hidden_dim CHAR_HIDDEN_DIM]
                [--char_num_head CHAR_NUM_HEAD]
                [--char_layer_num CHAR_LAYER_NUM] [--filter FILTER_LIST]
                [--dropout_rate DROPOUT_RATE] [--learning_rate LEARNING_RATE]
                [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--reg REG]
                [--reg_weight REG_WEIGHT] [--data_redo DATA_REDO]
                [--note NOTE] [--gpu GPU] [--train_data TRAIN_DATA]

optional arguments:

Argument	Value	Information
-h, --help		show this help message and exit
--max_len	INT (Default:30)	maximum length of the tokens
--max_char_len	INT (Default: 140)	maximum length of the characters
--minfreq	INT (Default: 10)	minimum frequency of the vocabulary and character
--emb_dim	INT (Default: 200)	word embedding dimension
--hidden_dim	INT (Default: 200)	hidden dimension
--num_head	INT (Default: 10)	number of head of the transformer
--layer_num	INT (Default: 2)	number of layer of the transformer
--char_dim	INT (Default: 100)	character embedding dimension
--char_hidden_dim	INT (Default: 100)	character hidden dimension
--char_num_head	INT (Default: 8)	number of head of the character transformer
--char_layer_num	INT (Default: 2)	number of layer of the character transformer
--filter	STRING (Default: 3:64-4:64-5:64-6:64-7:64)	filter configuration of the character CNN, ex: 3:64-4:64
--dropout_rate	FLOAT (Default: 0.3)	dropout rate across the model
--learning_rate	FLOAT (Default: 1e-4)	learning rate
--batch_size	INT (Default: 128)	batch size
--epochs	INT (Default: 30)	number of epochs for training
--reg	BOOL (Default: False)	whether use regularizer or not
--reg_weight	FLOAT (Default: 1e-4)	weighting for regularizer
--data_redo	BOOL (Default: False)	re-process the data again
--note	STRING (Default: "")	note for the model name
--gpu	STRING (Default: "0")	gpu setting
--train_data	STRING (Default: "train")	filename of the training data

Model Testing

$ python test.py [-h] --model_folder MODEL_FOLDER --target_epoch TARGET_EPOCH                                             
               [--gpu GPU]

optional arguments:

Argument	Value	Information
-h, --help		show this help message and exit
--model_folder	STRING	the path to the target model's folder
--target_epoch	STRING	specify the model for testing
--gpu	STRING (Default: "0")	gpu setting

Inferencing

To inference location (city & country) using the trained model, you will need to first process the input file into the required format where each line represents a sample. The output will be stored as a CSV file with three columns, text, city, and country.

$ python inference.py [-h] --model_folder MODEL_FOLDER --target_epoch
                    TARGET_EPOCH --text_file TEXT_FILE --output_file
                    OUTPUT_FILE [--gpu GPU]

optional arguments:

Argument	Value	Information
-h, --help		show this help message and exit
--model_folder	STRING	the path to the target model's folder
--target_epoch	STRING	specify the model for testing
--text_file	STRING	the path to the testing text file
--output_file	STRING	the path to the output file
--gpu	STRING (Default: "0")	gpu setting

Trained Model

Please find the release trained model here.

https://drive.google.com/file/d/1M8AxKuVmwRM3jEVk3iYH0BEmKOKsr_zP/view?usp=sharing

The performance of the released model is as follow.

	City Acc	Country Acc
Release Model	0.2163	0.6110

Uncompress the .tar file after downloading the model.

$ tar xvf release.tar

You can run the inference script by using the following command.

$ python inference.py --model_folder ../model/release --target_epoch 1 --text_file ../sample_text/sample.txt --output_file ../sample_text/output.csv --gpu 6

Here, I have my folders in the following structure.

root_dir (LocationPrediction)
| - src
| - model
| | - release
| - sample_text

Any Questions?

Please sent me an email at [email protected]

Citation

Please cite the following papers if you use this repo for testing, auto geo-labeling, or comparison.

@ARTICLE{10.3389/fdata.2019.00005,
  AUTHOR={Huang, Chieh-Yang and Tong, Hanghang and He, Jingrui and Maciejewski, Ross},   
  TITLE={Location Prediction for Tweets},      
  JOURNAL={Frontiers in Big Data},      
  VOLUME={2},     
  PAGES={5},     
  YEAR={2019},      
  URL={https://www.frontiersin.org/article/10.3389/fdata.2019.00005},       
  DOI={10.3389/fdata.2019.00005},      
  ISSN={2624-909X},   
  ABSTRACT={Geographic information provides an important insight into many data mining and social media systems. However, users are reluctant to provide such information due to various concerns, such as inconvenience, privacy, etc. In this paper, we aim to develop a deep learning based solution to predict geographic information for tweets. The current approaches bear two major limitations, including (a) hard to model the long term information and (b) hard to explain to the end users what the model learns. To address these issues, our proposed model embraces three key ideas. First, we introduce a multi-head self-attention model for text representation. Second, to further improve the result on informal language, we treat subword as a feature in our model. Lastly, the model is trained jointly with the city and country to incorporate the information coming from different labels. The experiment performed on W-NUT 2016 Geo-tagging shared task shows our proposed model is competitive with the state-of-the-art systems when using accuracy measurement, and in the meanwhile, leading to a better distance measure over the existing approaches.}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
sample_text		sample_text
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Location Prediction for Tweets

Data

Path Configuration

Development Environment

Data Preprocessing

Model Training

optional arguments:

Model Testing

optional arguments:

Inferencing

optional arguments:

Trained Model

Any Questions?

Citation

About

Releases

Packages

Languages

appleternity/LocationPrediction

Folders and files

Latest commit

History

Repository files navigation

Location Prediction for Tweets

Data

Path Configuration

Development Environment

Data Preprocessing

Model Training

optional arguments:

Model Testing

optional arguments:

Inferencing

optional arguments:

Trained Model

Any Questions?

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages