This project leverages Apache Spark to analyze data from the Ethereum blockchain spanning August 2015 to January 2019. It focuses on mining activities, transaction patterns, and provides insights into the operational dynamics of the Ethereum network.
- Data Management: Load and manage large datasets using PySpark.
- Aggregation Analysis: Identify top miners by the total block size mined.
- Time Series Analysis: Convert UNIX timestamps and analyze blockchain activities over time.
- Data Integration: Merge transaction and block data to enhance analytical depth.
- Focused Monthly Analysis: Examine specific months for detailed insights into blockchain activity and transaction fees.
- PySpark DataFrames: Utilized for robust data processing and handling.
- Aggregation and Sorting: Applied to compute key mining statistics.
- Date Transformation: Used to facilitate easier analysis of temporal data patterns.
- Inner Joins: Employed to merge datasets for a holistic view.
- Data Visualization: Implemented using Matplotlib to create histograms showcasing data trends.
- Miner Centralization: A small number of miners were found to dominate block production.
- Fluctuating Activity Levels: Detailed analysis of September and October 2015 revealed significant variations in daily blockchain activities.
- Economic Insights: October 2015 data highlighted the cost dynamics of transactions, providing a snapshot of economic factors influencing blockchain operations.
- Strategic Insights: Offers valuable information for stakeholders in the Ethereum ecosystem regarding mining and transaction strategies.
- Technical Contributions: Demonstrates the effectiveness of PySpark in blockchain data analysis, setting a benchmark for similar analytics projects.
- Operational Recommendations: Insights into transaction fees and mining activities can guide adjustments in blockchain design to enhance fairness and decentralization.
The dataset comprises two main CSV files:
blocks.csv
: Contains data about the blocks on the Ethereum blockchain.transactions.csv
: Contains details of transactions within those blocks.
src/
: Contains all source code used for the analysis.data/
: Instructions on how to access the blockchain data (actual data not included due to size and privacy concerns).docs/
: Additional documentation and images.README.md
: This file, providing an overview of the project and setup instructions.
- Objective: Load
blocks.csv
andtransactions.csv
using PySpark. - Solution: Utilized PySpark's
read.csv
method with headers and inferred schema.
- Objective: Determine the top 10 miners by the total size of blocks mined.
- Solution: Performed aggregation and sorting in PySpark to identify the top miners.
- Objective: Convert UNIX timestamps in
blocks.csv
to readable date format. - Solution: Used PySpark functions
from_unixtime
andto_date
to format timestamps.
- Objective: Join
transactions.csv
andblocks.csv
by hash fields. - Solution: Handled field name ambiguities by specifying dataset origins in the join operation.
- Objective: Analyze block production and unique senders for September 2015.
- Solution: Filtered and aggregated data to produce histograms of daily activities.
- Objective: Calculate total transaction fees for October 2015.
- Solution: Used gas and gas_price fields to compute fees and visualized results with histograms.
- Apache Spark: Main platform for data processing.
- Python: Used for scripting and additional data manipulation.
- Matplotlib: For creating visualizations of the analysis results.