Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #1

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,39 @@ to run the spark job locally.
Confirm that you see "Hello Spark" in the output.

If all the test passed locally and "Hello Spark" was in the output than your environment is set up and ready for a TW Data Engineering coding interview.

###


### Wordcount
* Sample data is available in the src/test/wordcount/data directory
This applications will count the occurrences of a word within a text file. By default this app will read from the words.txt file and write to the target folder. Pass in the input source path and output path directory to the spark-submit command below if you wish to use different files.

```
spark-submit --master local[2] --py-files thoughtworks.zip job_runner.py WordCount $(INPUT_LOCATION) $(OUTPUT_LOCATION)
```

Currently this application is a skeleton with ignored tests. Please unignore the tests and build the wordcount application.

### Citibike multi-step pipeline
* Sample data is available in the src/test/citibike/data directory
This application takes bike trip information and calculates the "as the crow flies" distance traveled for each trip.
The application is run in two steps.
* First the data will be ingested from a sources and transformed to parquet format.
* Then the application will read the parquet files and apply the appropriate transformations.


* To ingest data from external source to datalake:
```
spark-submit --master local[2] --py-files thoughtworks.zip job_runner.py DailyDriver $(INPUT_LOCATION) $(OUTPUT_LOCATION)
```

* To transform Citibike data:
```
spark-submit --master local[2] --py-files thoughtworks.zip job_runner.py CitiBikeTransformer $(INPUT_LOCATION) $(OUTPUT_LOCATION)
```

Currently this application is a skeleton with ignored tests. Please unignore the tests and build the Citibike transformation application.

#### Tips
- For distance calculation, consider using [**Harvesine formula**](https://en.wikipedia.org/wiki/Haversine_formula) as an option.