Pyspark examples on how to load data from different format into Spark Dataframes.
Python 3.x should be available on OS. Create virtual environment in $HOME dir ($HOME/venv3x)
Ensure JAVA_HOME is setup in environment
Setup
$:~/pyspark_example$ source ~/venv3x/bin/activate;
$:~/pyspark_example$ pip install -r requirements.txt
Add src folder to PYTHONPATH
$:~/pyspark_example$ export PYTHONPATH=$PYTHONPATH:$PWD/src
Run a module
$:~/pyspark_example$python csv_2_dataframe.py
https://github.com/afzals2000/pyspark_example
- Fork it (https://github.com/afzals2000/pyspark_example)
- Create your feature branch (git checkout -b feature/fooBar)
- Commit your changes (git commit -am 'Add some fooBar')
- Push to the branch (git push origin feature/fooBar)
- Create a new Pull Request