IPL_-Sports_Magazine

Challenge #10: Analyse historical IPL data and provide insights on IPL 2024 for a Sports Magazine (Codebasics).
Project Presentation PDF: Linkedin Post.

Problem Statement

"Sports Basics" is a sports blog company that entered space recently. They wanted to get more traffic to their website by releasing a special edition magazine on IPL 2024. This magazine aims to provide interesting insights and facts for fans, analysts and teams based on the last 3 years' data. The chief editor Tony Sharma oversees this publication, and he believes in data analytics. He reached out to Peter Pandey, a journalist in his team who is a data-savvy cricket enthusiast.

Data Loading

MySQL

First I created a database called "ipl" using the following command

CREATE DATABASE IF NOT EXISTS "ipl";
I used Mysql Workbench Table Data Import Wizard to import

Similarly, I added all other dimensions and fact tables into the Mysql Workbench:

The table names have been changed slightly using Mysql Workbench using the following bunch of code:

USE ipl;

ALTER TABLE dim_match_summary
RENAME TO  dim_match;

ALTER TABLE fact_bating_summary
RENAME TO  fact_bating;

ALTER TABLE fact_bowling_summary
RENAME TO  fact_bowling;

Connecting Mysql Database with Power BI:

Data Exploration

Mysql

-- Data Exploration

	-- dim_match
		SELECT * FROM dim_match LIMIT 5; 
        DESC dim_match; 
		SELECT COUNT(match_id) as No_of_Matches,  
		COUNT(DISTINCT(team1)) as No_of_Teams, 
		COUNT(DISTINCT(RIGHT(matchDate,4))) as `Year` 
		FROM dim_match; 
        -- Therefore 10 teams participated in 206 matches
	
	-- Let's have a look into the margin column
			select DISTINCT(
			CASE 
				WHEN margin LIKE "%run%" THEN RIGHT(margin,4)
				WHEN margin LIKE "%wicket%" THEN (RIGHT(margin,7)) END) AS Margin
			FROM dim_match;
            -- that is We have 4 different in the form of singular and plural

	-- dim_players
		SELECT * FROM dim_players LIMIT 5; 
        SELECT COUNT(DISTINCT(name)) as Player_Name, 
		COUNT(DISTINCT(team)) as No_of_Teams 
		FROM dim_players; 
        -- Data validation: Total Teams = 10, Total Number of Players 292

	-- fact_bating
		SELECT * FROM fact_bating LIMIT 5; 
        SELECT COUNT(DISTINCT(match_id)) as No_of_Matches, 
		COUNT(DISTINCT(batsmanName)) as Batsman_Count 
		FROM fact_bating;
        -- Data Consistency: Total Matches played  = 206, Total Number of Batsman is 262
		
	-- fact_bowling
		SELECT * FROM fact_bowling LIMIT 5; -- 5 Rows of `fact_bowling` table
		
        SELECT COUNT(DISTINCT(match_id)) as No_of_Matches, 
		COUNT(DISTINCT(bowlerName)) as Bowler_Count 
		FROM fact_bowling; 
        -- AGAIN Corraboration of Total Matches played  = 206
        -- Total Number of Bowlers 202

For more detailed documentation Click Here

Data Cleaning

Mysql

There were some issues with "May 28-29, 2023" entry in the matchYear field while changing the datatype of matchDate column. Hence the following SQL command has been employed to manupulate the data a bit:
```
  set sql_safe_updates =0;
  UPDATE dim_match
  SET matchDate = 'May 29, 2023'
  WHERE matchDate = 'May 28-29, 2023';
  set sql_safe_updates =1;
```

Power BI

"matchDate" datatype has been changed from text into Date:

 = Table.TransformColumnTypes(ipl_dim_match,{{"matchDate", type date}})

"matchYear" column added:

= Table.AddColumn(#"Changed Type", "matchYear", each Date.Year([matchDate]), Int64.Type)

Primary Data Analysis:

Mysql (Validated by Power BI Dashboar in the background)

Top 10 batsmen based on past 3 years total runs scored

	-- Primary Analysis
	-- ** Caution  Edit-> Preferences-> SQL Editor -> read-timeout interval to be at least 60 seconds **

	USE ipl;  -- (ASSIGNING DEFAULT TABLE) MUST RUN CODE ELSE REST WILL NOT WORK
	SELECT  DENSE_RANK() OVER(ORDER BY SUM(runs) DESC) as Player_Rank, 
    	batsmanName as Player_Name, SUM(runs)as Total_Run 
	FROM fact_bating
	GROUP BY batsmanName 
	ORDER BY Total_Run DESC 
	LIMIT 10;

Top 10 batsmen based on past 3 years batting average. (min 60 balls faced in each season)

	-- [Bating Average is the total number of runs they have scored divided by the number of times they have been out]
    
    	WITH CTE1 AS(SELECT batsmanName, RIGHT(matchDate,4) AS Season 
	FROM fact_bating f
	INNER JOIN dim_match d
	ON f.match_id = d.match_id
	GROUP BY 1, 2
	HAVING SUM(balls)>60)

            
	SELECT DENSE_RANK() OVER(ORDER BY ROUND(SUM(runs)/SUM(CASE WHEN `out/not_out` = "out" THEN 1 ELSE 0 END),2) DESC) as Player_Rank, 
	batsmanName as Player_Name,
     	ROUND(SUM(runs)/COUNT(CASE WHEN `out/not_out` = "out" THEN 1 ELSE NULL END),2) AS Bating_Average
	FROM fact_bating F 
	INNER JOIN dim_match D   -- `dim_match` and `fact_bating` has been Inner Joined using `match_id`
	ON F.match_id = D.match_id 
	WHERE batsmanName IN( 
			SELECT batsmanName
			FROM CTE1
			GROUP BY 1
            		HAVING count(Season) = 3)  
	GROUP BY batsmanName 
	HAVING COUNT(DISTINCT(RIGHT(matchDate,4)))=3 
	ORDER BY Bating_Average DESC -- ordered in decsending order
	LIMIT 10;

Top 10 batsmen based on past 3 years strike rate (min 60 balls faced in each season)

	SELECT DENSE_RANK() OVER(ORDER BY AVG(SR) DESC)as Player_Rank, batsmanName as Player_Name,
	ROUND(avg(SR),2) as Avg_SR 
	FROM fact_bating F 
	INNER JOIN dim_match D
	ON F.match_id = D.match_id 
    	WHERE batsmanName IN( 
			SELECT batsmanName
			FROM fact_bating f
			INNER JOIN dim_match d
           		 ON f.match_id = d.match_id
			GROUP BY batsmanName , RIGHT(matchDate,4)
            		HAVING SUM(balls)>60)
	GROUP BY batsmanName
	ORDER BY Avg_SR DESC 
	LIMIT 10;

Top 5 batsmen based on past 3 years boundary % (min 60 balls faced in each season)

-- Boundary %  is the percentage of total run that comes from 4s and 6s (min 60 balls faced in each season)
    
    	WITH CTE1 AS(SELECT batsmanName, RIGHT(matchDate,4) AS Season 
	FROM fact_bating f
	INNER JOIN dim_match d
	ON f.match_id = d.match_id
	GROUP BY 1, 2
	HAVING SUM(balls)>60)
    
    
	SELECT batsmanName AS Player_Name,
	ROUND((100*(4*SUM(`4s`)+6*SUM(`6s`))/SUM(runs)),2) as 'Boundary%'
	FROM fact_bating AS f
    	WHERE batsmanName IN( 
			SELECT batsmanName
			FROM CTE1
			GROUP BY 1
            		HAVING count(Season) = 3 ) 
	GROUP BY batsmanName 
	ORDER BY `Boundary%` DESC
	LIMIT 5;

Top 10 bowlers based on past 3 years total wickets taken

	SELECT DENSE_RANK() OVER(ORDER BY SUM(wickets) DESC) AS Player_Rank,
    	bowlerName AS Name, 
	SUM(wickets) AS Total_Wickets 
	FROM fact_bowling F  
	INNER JOIN dim_match D
	ON F.match_id = D.match_id
	GROUP BY `Name` 
	ORDER BY Total_Wickets DESC 
	LIMIT 10;

Top 10 bowlers based on past 3 years bowling average. (min 60 balls bowled in each season)

	    
	WITH CTE2 AS(SELECT bowlerName, RIGHT(matchDate,4) AS Season -- CTE Filtered by Atleast 60 balls delivered
	FROM fact_bowling f
	INNER JOIN dim_match d
	ON f.match_id = d.match_id
	GROUP BY 1, 2
	HAVING SUM(overs)*6 > 60)
    
    	SELECT DENSE_RANK() OVER(ORDER BY (SUM(runs)/SUM(wickets)) ASC) as Player_Rank, -- Ranking
	bowlerName as Player_Name,
    	ROUND((SUM(runs)/SUM(wickets)),2) as Bowling_Avg 
    	FROM fact_bowling F
	INNER JOIN dim_match D
	ON F.match_id = D.match_id
    	WHERE bowlerName IN( 
			SELECT bowlerName
			FROM CTE2
			GROUP BY 1
			HAVING count(Season) = 3 )
	GROUP BY bowlerName 
    	HAVING Bowling_Avg IS NOT NULL
	ORDER BY Bowling_Avg ASC 
    	LIMIT 10;

Top 10 bowlers based on past 3 years economy rate. (min 60 balls bowled in each season)

	
    	WITH CTE2 AS(SELECT bowlerName, RIGHT(matchDate,4) AS Season -- CTE Filtered by Atleast 60 balls delivered
	FROM fact_bowling f
	INNER JOIN dim_match d
	ON f.match_id = d.match_id
	GROUP BY 1, 2
	HAVING SUM(overs)*6 > 60)
	
    	SELECT  RANK() OVER(ORDER BY (ROUND(avg(economy),2)) ASC ) as Player_Rank,
	bowlerName, ROUND(avg(economy),3) as Avg_Economy -- Rounded upto 2 decimal place
	FROM fact_bowling f
	INNER JOIN dim_match d
	ON f.match_id =  d.match_id -- `fact_bowling` and `dim_match` has been  Inner Joined on `match_id`
    	WHERE bowlerName IN ( 
			SELECT bowlerName
			FROM CTE2
			GROUP BY 1
            		HAVING count(Season) = 3 )  -- All of the Three Seasons Bowled 
	GROUP BY bowlerName
	ORDER BY Avg_Economy ASC
	LIMIT 10;

Top 10 bowlers based on past 3 years dot ball % (min 60 balls bowled in each season)

    -- dot ball % i.e. percentage of dot balls to total ball
    
    	WITH CTE2 AS(SELECT bowlerName, RIGHT(matchDate,4) AS Season 
	FROM fact_bowling f
	INNER JOIN dim_match d
	ON f.match_id = d.match_id
	GROUP BY 1, 2
	HAVING SUM(overs)*6 > 60)
	
    	SELECT bowlerName AS Player_Name,
	ROUND(100*((SUM(`0s`))/(SUM(overs)*6)),2) as 'Dotball%' 
	FROM fact_bowling AS f
    	WHERE bowlerName IN ( 
			SELECT bowlerName
			FROM CTE2
			GROUP BY 1
            		HAVING count(Season) = 3 ) 
	GROUP BY bowlerName 
	ORDER BY `Dotball%` DESC 
	LIMIT 5;

Top 4 teams based on past 3 years winning %

-- (the fraction of games or matches a team or individual has won)
-- I employed Common Table Expression to execute the query
    
    	WITH CTE3 AS(SELECT team1 AS team,
	COUNT(match_id) as First_bat
	FROM dim_match
	GROUP BY 1
	),
    
    
    	CTE4 AS(SELECT team2 AS team,
    	COUNT(match_id) as First_ball
	FROM dim_match
    	GROUP BY 1
	),
    
    
	task AS(SELECT winner AS team,
    	COUNT(*) AS `#Win`
    	FROM dim_match
    	GROUP BY 1
	)
	
    	SELECT team AS Team,
    	(`First_ball`+`First_bat`) AS Total_play,
	ROUND((100*`#Win`/(`First_ball`+`First_bat`)),2) AS Win_Prc
	FROM CTE3
	JOIN CTE4 USING (team)
	JOIN task USING (team)
	ORDER BY 3
	LIMIT 4;

Top 2 teams with the highest number of wins achieved by chasing targets over the past 3 years


   

	WITH temp AS(SELECT winner,
	ROUND(COUNT(CAST(SUBSTRING(margin,1,LOCATE(" ",margin)-1)AS SIGNED)),2) AS Avg_Lead -- Substring `Lead_Number` absolute value of field like `26 wickets` '26' is Lead_Number and 'wickets' is filed 
	FROM dim_match
     	WHERE SUBSTRING(margin,LOCATE(" ",margin)+1,length(margin)) = "wickets" -- Those who chases run and wins must be won by wickets
     	GROUP BY winner , SUBSTRING(margin,LOCATE(" ",margin)+1,length(margin)) -- substring() gives ou "wickets"/'runs'
	 ) 

	SELECT DENSE_RANK() OVER(ORDER BY Avg_Lead DESC) AS Team_Rank,
	winner, Avg_Lead -- Average lead
	FROM temp
	
	ORDER BY Avg_Lead DESC
	LIMIT 2;

In our first social update, we noticed anomalies in some SQL query outputs that require rectification. You may refer to Old Primary SQL Analysis Docstring or Linked Post.

The Rectified SQL Queries are as follows: You may refer to the docstring - (Updated SQL Query)

Power BI (Cleaning and Basic Analysis)

Power Query (M Language)
DAX Measures

Both can be found in (Final_Dashboard.pbix)

The Orange and Purple Cap Prediction

Here in this section, we will deal with various machine learning models and Statistical analysis to optimize those models mostly deployed through R and Python as a part of our "Predictive Analytics".

Orange Cap:

Objective:

To predict the player having maximum total run in the IPL season 2024.

Methodologies:

Assumptions like Independence of errors, Normality of errors, etc. have been checked using various plots and measures.
Primary model has been made through Stepwise Regression method.
Multicollinearity of independent variable has been identified by Variance Inflation Factor
Multicollinearity of the independent variable has been addressed by Principal Component Analysis
Test of Significance has been considered while model building
Reliability has been reconfirmed by R-Squared and Adjusted R-Squared

Purple Cap:

Objective:

To predict the player with the most wickets in the IPL season 2024.

Methodologies:

The approach mentioned earlier for "Orange Cap Prediction" has been imitated.

Resources

Explanation: Linked Post
Resources: Datasets are available Here

The Ultimate Result:

My Team 11

Power BI

Data Analysis
- I Classified the pool of players into various homogeneous groups (like Opener, Middle Order, Lower Order, etc.)
- To classify some criteria are needed. I got it from Analysis using Power BI.
- Illustration: Say we want to make criteria for openers then:
  1. Filter the pool using position 1,2. We can't allow middle-order players to play the role of opener.
  2. Then Line Charts would depict 90 percentile points which may help us to put a cap for further filtering like Strike Rate, Average Ball Faced, etc.
  3. Ordering players based on Standard Deviation of runs/Strike Rate/anything else may help choose the right ones.
Actionable Insights
** Explanation Video
The Ultimate Result:

Thank You So Much!! See You Soon, Until then Goodbye 👋👋

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
Dataset		Dataset
Predictive Analytics		Predictive Analytics
Wrong SQL		Wrong SQL
Bating_Analysis.R		Bating_Analysis.R
Bowling_Analysis.R		Bowling_Analysis.R
Data_Exploration.sql		Data_Exploration.sql
Final_Dashboard.pbix		Final_Dashboard.pbix
Orange_Cap_Prediction.ipynb		Orange_Cap_Prediction.ipynb
Primary_Analysis (Rectified).sql		Primary_Analysis (Rectified).sql
Primary_Analysis.sql		Primary_Analysis.sql
Purple_Cap_Prediction.ipynb		Purple_Cap_Prediction.ipynb
README.md		README.md
Transformation_Bating.R		Transformation_Bating.R
Transformation_Bowling.R		Transformation_Bowling.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IPL_-Sports_Magazine

Table of Contents:

Problem Statement

Data Loading

Data Exploration

Data Cleaning

Primary Data Analysis:

The Orange and Purple Cap Prediction

Orange Cap:

Objective:

Methodologies:

Purple Cap:

Objective:

Methodologies:

Resources

My Team 11

About

Releases

Packages

Languages

pb319/IPL_Sports_Magazine

Folders and files

Latest commit

History

Repository files navigation

IPL_-Sports_Magazine

Table of Contents:

Problem Statement

Data Loading

Data Exploration

Data Cleaning

Primary Data Analysis:

The Orange and Purple Cap Prediction

Orange Cap:

Objective:

Methodologies:

Purple Cap:

Objective:

Methodologies:

Resources

My Team 11

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages