A project on using PySpark/databricks to replicate and extend a given research paper based on an application of Machine Learning.
In this project, we replicated and extended a research paper on Categorizing the Content of GitHub README Files using PySpark/databricks. The original code by the authors of the research paper is available here.
- data_dumps - data dumps from the SQLite database
- manual_work - files used for manual work
- new_input_readmes - new readme files used for the ML model
- notebooks - contains the code from research paper adapted for databricks, along with additional code
- other - contains Research paper, instructions, and report template
- presentations - contains presentations and the report