data_exploration_cleaning.py

Do some exploration of origin train.csv and test.csv.

Do naive data cleaning.

exploration&cleaning.ipynb is the ipython version of python source code.

preprocess.py

Do some feature engineering by PySpark and generate train_pyspark.csv and test_pyspark.csv.

Choose models, apply PCA and similar processing skills, tune hyperparameters, compare performance and output output.csv.

A log of hyperparameter tuning temp_results.

Attention: the test.csv in root directory is our output file!!!!!!!!!

NOT the test.csv in data folder!!!!!!!!!!!