This section includes an initial notebook that will teach you how to use Apache Spark (pyspark) to pre-process the ECommerce Data dataset from Kaggle. Thanks to Michael Kechinov for providing the great dataset.
The -sm.csv
is shipped along with this repository. If you want to just dip your toes into your data journey, this smaller dataset only contains 64 transactions, but can still be used to learn how to use Delta Lake and more importantly, how to work with Streaming Delta Lake tables.
Note: The full size of the two datasets (oct, nov) weigh in at around 14gb uncompressed. Using the complete dataset will give you exposure to working with data that doesn't fit natively into memory (depending on your setup)
2019-Oct.csv
contains 42,448,764 records2019-Nov.csv
contains 67,501,979 records
Note: You may experience
WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory Scaling row group sizes to 95.00% for 8 writers
while working with the data. This is a preventitive measure to prevent the Spark application from crashing. The driver memory is 1GB by default. You can increase the amount of memory being allocated to the jupyter process to remove these warnings.
Note: This is continous work in progress. Please open up an Issue if you find things that seem out of place. Just be nice.