Data found from Stack Exchange Data Dump
- Feature Extraction
- Learning-to-Rank
- User Features
- user_age [1 numerical features]
- user_badge [categorical features]
- user_reputation [1 numerical features]
- user_views [1 numerical features]
- user_votes [1 numerical features]
- User-User Features
- user-user interactions [1 numerical features]
- Post Features
- comment_cnt [1 numerical feature]
- Download the data from Stack Exchange Data Dump
- Unzip folders to /raw folder
cd src/preprocess
./preprocess.py [name of dataset]
- Convert the format from XML to JSON
- Convert HTML-like contents into plaintext
- Link each question to the corresponding answers
- See data/[name of dataset]/question_answer_mapping.json after preprocessing
- Split the whole set into training and testing sets
- See data/[name of dataset]/train.* and data/[name of dataset]/test.*
- Questions without the best answer (ground truth) and with less than two answers are removed
Extract user_age features
cd src/feature_extraction
./user_age.py [name of dataset]
Here some descriptions briefly show the purpose of each directory.
- raw/: The directory for the raw data (i.e., Posts.xml, Users.xml, etc)
- raw/[name of dataset]/: the corresponding raw data for a certain dataset (e.g., StackOverflow)
- Note that the file names should not be modified.
- data/: The directory for the preprocessed data
- data/[name of dataset]/: the corresponding preprocessed data for a certain dataset (e.g., StackOverflow)
- src/: The directory for all source codes
- src/preprocess/: Codes for preprocessing raw data
- model/: The directory for some trained model on large English corpus