To reproduce our project result, please follow the following steps:
Set up a cloud environment with larger than 16GB of memory and ample CPU computing power, as our dataset is of size 9GB (3 million LinkedIn profiles). Launch jupyter notebook.
Upload the notebook that contains all our project code, named 242Proj_final.
Upload the dataset from the following link and place it into the same directory as the uploaded notebook in step 2. https://drive.google.com/file/d/1fXcY3pQc32sqdaOPft9EeKFMpqbYNTbN/view?usp=sharing
Run through the jupyter notebook in sequence. All code blocks in the notebook have clear comments that describe its purpose and intrinsic logic.
All project results are printed or outputted in the notebook.
Please do not run the last three code blocks as they would take significant computation time. Their purpose is to show that our problem can be solved using alternative ways.