Skip to content

Challenge #10: Analyse historical IPL data and provide insights on IPL 2024 for a Sports Magazine (Codebasics)

Notifications You must be signed in to change notification settings

pb319/IPL_Sports_Magazine

Repository files navigation

IPL_-Sports_Magazine

  • Challenge #10: Analyse historical IPL data and provide insights on IPL 2024 for a Sports Magazine (Codebasics).
  • Project Presentation PDF: Linkedin Post.

1


Table of Contents:

  1. Project Link
  2. Problem Statement
  3. Data Modeling
  4. Primary Data Analysis
  5. Orange Cap and Purple Cap Prediction
  6. My Team 11

Problem Statement

"Sports Basics" is a sports blog company that entered space recently. They wanted to get more traffic to their website by releasing a special edition magazine on IPL 2024. This magazine aims to provide interesting insights and facts for fans, analysts and teams based on the last 3 years' data. The chief editor Tony Sharma oversees this publication, and he believes in data analytics. He reached out to Peter Pandey, a journalist in his team who is a data-savvy cricket enthusiast.

Data Loading

  1. MySQL

01

Similarly, I added all other dimensions and fact tables into the Mysql Workbench:

02

The table names have been changed slightly using Mysql Workbench using the following bunch of code:

USE ipl;

ALTER TABLE dim_match_summary
RENAME TO  dim_match;

ALTER TABLE fact_bating_summary
RENAME TO  fact_bating;

ALTER TABLE fact_bowling_summary
RENAME TO  fact_bowling;

Connecting Mysql Database with Power BI:

04

Data Exploration

  1. Mysql
-- Data Exploration

	-- dim_match
		SELECT * FROM dim_match LIMIT 5; 
        DESC dim_match; 
		SELECT COUNT(match_id) as No_of_Matches,  
		COUNT(DISTINCT(team1)) as No_of_Teams, 
		COUNT(DISTINCT(RIGHT(matchDate,4))) as `Year` 
		FROM dim_match; 
        -- Therefore 10 teams participated in 206 matches
	
	-- Let's have a look into the margin column
			select DISTINCT(
			CASE 
				WHEN margin LIKE "%run%" THEN RIGHT(margin,4)
				WHEN margin LIKE "%wicket%" THEN (RIGHT(margin,7)) END) AS Margin
			FROM dim_match;
            -- that is We have 4 different in the form of singular and plural

	-- dim_players
		SELECT * FROM dim_players LIMIT 5; 
        SELECT COUNT(DISTINCT(name)) as Player_Name, 
		COUNT(DISTINCT(team)) as No_of_Teams 
		FROM dim_players; 
        -- Data validation: Total Teams = 10, Total Number of Players 292

	-- fact_bating
		SELECT * FROM fact_bating LIMIT 5; 
        SELECT COUNT(DISTINCT(match_id)) as No_of_Matches, 
		COUNT(DISTINCT(batsmanName)) as Batsman_Count 
		FROM fact_bating;
        -- Data Consistency: Total Matches played  = 206, Total Number of Batsman is 262
		
	-- fact_bowling
		SELECT * FROM fact_bowling LIMIT 5; -- 5 Rows of `fact_bowling` table
		
        SELECT COUNT(DISTINCT(match_id)) as No_of_Matches, 
		COUNT(DISTINCT(bowlerName)) as Bowler_Count 
		FROM fact_bowling; 
        -- AGAIN Corraboration of Total Matches played  = 206
        -- Total Number of Bowlers 202 

For more detailed documentation Click Here

Data Cleaning

  1. Mysql
  • There were some issues with "May 28-29, 2023" entry in the matchYear field while changing the datatype of matchDate column. Hence the following SQL command has been employed to manupulate the data a bit:
      set sql_safe_updates =0;
      UPDATE dim_match
      SET matchDate = 'May 29, 2023'
      WHERE matchDate = 'May 28-29, 2023';
      set sql_safe_updates =1;
    
  1. Power BI
  • "matchDate" datatype has been changed from text into Date:
     = Table.TransformColumnTypes(ipl_dim_match,{{"matchDate", type date}})
    
  • "matchYear" column added:
    = Table.AddColumn(#"Changed Type", "matchYear", each Date.Year([matchDate]), Int64.Type)
    

Primary Data Analysis:

  1. Mysql (Validated by Power BI Dashboar in the background)
  • Top 10 batsmen based on past 3 years total runs scored
	-- Primary Analysis
	-- ** Caution  Edit-> Preferences-> SQL Editor -> read-timeout interval to be at least 60 seconds **

	USE ipl;  -- (ASSIGNING DEFAULT TABLE) MUST RUN CODE ELSE REST WILL NOT WORK
	SELECT  DENSE_RANK() OVER(ORDER BY SUM(runs) DESC) as Player_Rank, 
    	batsmanName as Player_Name, SUM(runs)as Total_Run 
	FROM fact_bating
	GROUP BY batsmanName 
	ORDER BY Total_Run DESC 
	LIMIT 10;     

Untitled design (18)

  • Top 10 batsmen based on past 3 years batting average. (min 60 balls faced in each season)
	-- [Bating Average is the total number of runs they have scored divided by the number of times they have been out]
    
    	WITH CTE1 AS(SELECT batsmanName, RIGHT(matchDate,4) AS Season 
	FROM fact_bating f
	INNER JOIN dim_match d
	ON f.match_id = d.match_id
	GROUP BY 1, 2
	HAVING SUM(balls)>60)

            
	SELECT DENSE_RANK() OVER(ORDER BY ROUND(SUM(runs)/SUM(CASE WHEN `out/not_out` = "out" THEN 1 ELSE 0 END),2) DESC) as Player_Rank, 
	batsmanName as Player_Name,
     	ROUND(SUM(runs)/COUNT(CASE WHEN `out/not_out` = "out" THEN 1 ELSE NULL END),2) AS Bating_Average
	FROM fact_bating F 
	INNER JOIN dim_match D   -- `dim_match` and `fact_bating` has been Inner Joined using `match_id`
	ON F.match_id = D.match_id 
	WHERE batsmanName IN( 
			SELECT batsmanName
			FROM CTE1
			GROUP BY 1
            		HAVING count(Season) = 3)  
	GROUP BY batsmanName 
	HAVING COUNT(DISTINCT(RIGHT(matchDate,4)))=3 
	ORDER BY Bating_Average DESC -- ordered in decsending order
	LIMIT 10;


Untitled design (19)

  • Top 10 batsmen based on past 3 years strike rate (min 60 balls faced in each season)
	SELECT DENSE_RANK() OVER(ORDER BY AVG(SR) DESC)as Player_Rank, batsmanName as Player_Name,
	ROUND(avg(SR),2) as Avg_SR 
	FROM fact_bating F 
	INNER JOIN dim_match D
	ON F.match_id = D.match_id 
    	WHERE batsmanName IN( 
			SELECT batsmanName
			FROM fact_bating f
			INNER JOIN dim_match d
           		 ON f.match_id = d.match_id
			GROUP BY batsmanName , RIGHT(matchDate,4)
            		HAVING SUM(balls)>60)
	GROUP BY batsmanName
	ORDER BY Avg_SR DESC 
	LIMIT 10;

Untitled design (20)

  • Top 5 batsmen based on past 3 years boundary % (min 60 balls faced in each season)
-- Boundary %  is the percentage of total run that comes from 4s and 6s (min 60 balls faced in each season)
    
    	WITH CTE1 AS(SELECT batsmanName, RIGHT(matchDate,4) AS Season 
	FROM fact_bating f
	INNER JOIN dim_match d
	ON f.match_id = d.match_id
	GROUP BY 1, 2
	HAVING SUM(balls)>60)
    
    
	SELECT batsmanName AS Player_Name,
	ROUND((100*(4*SUM(`4s`)+6*SUM(`6s`))/SUM(runs)),2) as 'Boundary%'
	FROM fact_bating AS f
    	WHERE batsmanName IN( 
			SELECT batsmanName
			FROM CTE1
			GROUP BY 1
            		HAVING count(Season) = 3 ) 
	GROUP BY batsmanName 
	ORDER BY `Boundary%` DESC
	LIMIT 5;
    

Untitled design (28)

  • Top 10 bowlers based on past 3 years total wickets taken
	SELECT DENSE_RANK() OVER(ORDER BY SUM(wickets) DESC) AS Player_Rank,
    	bowlerName AS Name, 
	SUM(wickets) AS Total_Wickets 
	FROM fact_bowling F  
	INNER JOIN dim_match D
	ON F.match_id = D.match_id
	GROUP BY `Name` 
	ORDER BY Total_Wickets DESC 
	LIMIT 10;
	
	

Untitled design (22)

  • Top 10 bowlers based on past 3 years bowling average. (min 60 balls bowled in each season)
	    
	WITH CTE2 AS(SELECT bowlerName, RIGHT(matchDate,4) AS Season -- CTE Filtered by Atleast 60 balls delivered
	FROM fact_bowling f
	INNER JOIN dim_match d
	ON f.match_id = d.match_id
	GROUP BY 1, 2
	HAVING SUM(overs)*6 > 60)
    
    	SELECT DENSE_RANK() OVER(ORDER BY (SUM(runs)/SUM(wickets)) ASC) as Player_Rank, -- Ranking
	bowlerName as Player_Name,
    	ROUND((SUM(runs)/SUM(wickets)),2) as Bowling_Avg 
    	FROM fact_bowling F
	INNER JOIN dim_match D
	ON F.match_id = D.match_id
    	WHERE bowlerName IN( 
			SELECT bowlerName
			FROM CTE2
			GROUP BY 1
			HAVING count(Season) = 3 )
	GROUP BY bowlerName 
    	HAVING Bowling_Avg IS NOT NULL
	ORDER BY Bowling_Avg ASC 
    	LIMIT 10; 


Untitled design (25)

  • Top 10 bowlers based on past 3 years economy rate. (min 60 balls bowled in each season)
	
    	WITH CTE2 AS(SELECT bowlerName, RIGHT(matchDate,4) AS Season -- CTE Filtered by Atleast 60 balls delivered
	FROM fact_bowling f
	INNER JOIN dim_match d
	ON f.match_id = d.match_id
	GROUP BY 1, 2
	HAVING SUM(overs)*6 > 60)
	
    	SELECT  RANK() OVER(ORDER BY (ROUND(avg(economy),2)) ASC ) as Player_Rank,
	bowlerName, ROUND(avg(economy),3) as Avg_Economy -- Rounded upto 2 decimal place
	FROM fact_bowling f
	INNER JOIN dim_match d
	ON f.match_id =  d.match_id -- `fact_bowling` and `dim_match` has been  Inner Joined on `match_id`
    	WHERE bowlerName IN ( 
			SELECT bowlerName
			FROM CTE2
			GROUP BY 1
            		HAVING count(Season) = 3 )  -- All of the Three Seasons Bowled 
	GROUP BY bowlerName
	ORDER BY Avg_Economy ASC
	LIMIT 10;

Untitled design (26)

  • Top 10 bowlers based on past 3 years dot ball % (min 60 balls bowled in each season)
    -- dot ball % i.e. percentage of dot balls to total ball
    
    	WITH CTE2 AS(SELECT bowlerName, RIGHT(matchDate,4) AS Season 
	FROM fact_bowling f
	INNER JOIN dim_match d
	ON f.match_id = d.match_id
	GROUP BY 1, 2
	HAVING SUM(overs)*6 > 60)
	
    	SELECT bowlerName AS Player_Name,
	ROUND(100*((SUM(`0s`))/(SUM(overs)*6)),2) as 'Dotball%' 
	FROM fact_bowling AS f
    	WHERE bowlerName IN ( 
			SELECT bowlerName
			FROM CTE2
			GROUP BY 1
            		HAVING count(Season) = 3 ) 
	GROUP BY bowlerName 
	ORDER BY `Dotball%` DESC 
	LIMIT 5;

Untitled design (27)

  • Top 4 teams based on past 3 years winning %
-- (the fraction of games or matches a team or individual has won)
-- I employed Common Table Expression to execute the query
    
    	WITH CTE3 AS(SELECT team1 AS team,
	COUNT(match_id) as First_bat
	FROM dim_match
	GROUP BY 1
	),
    
    
    	CTE4 AS(SELECT team2 AS team,
    	COUNT(match_id) as First_ball
	FROM dim_match
    	GROUP BY 1
	),
    
    
	task AS(SELECT winner AS team,
    	COUNT(*) AS `#Win`
    	FROM dim_match
    	GROUP BY 1
	)
	
    	SELECT team AS Team,
    	(`First_ball`+`First_bat`) AS Total_play,
	ROUND((100*`#Win`/(`First_ball`+`First_bat`)),2) AS Win_Prc
	FROM CTE3
	JOIN CTE4 USING (team)
	JOIN task USING (team)
	ORDER BY 3
	LIMIT 4;
    

Untitled design (29)

  • Top 2 teams with the highest number of wins achieved by chasing targets over the past 3 years

   

	WITH temp AS(SELECT winner,
	ROUND(COUNT(CAST(SUBSTRING(margin,1,LOCATE(" ",margin)-1)AS SIGNED)),2) AS Avg_Lead -- Substring `Lead_Number` absolute value of field like `26 wickets` '26' is Lead_Number and 'wickets' is filed 
	FROM dim_match
     	WHERE SUBSTRING(margin,LOCATE(" ",margin)+1,length(margin)) = "wickets" -- Those who chases run and wins must be won by wickets
     	GROUP BY winner , SUBSTRING(margin,LOCATE(" ",margin)+1,length(margin)) -- substring() gives ou "wickets"/'runs'
	 ) 

	SELECT DENSE_RANK() OVER(ORDER BY Avg_Lead DESC) AS Team_Rank,
	winner, Avg_Lead -- Average lead
	FROM temp
	
	ORDER BY Avg_Lead DESC
	LIMIT 2;


Untitled design (31)

In our first social update, we noticed anomalies in some SQL query outputs that require rectification. You may refer to Old Primary SQL Analysis Docstring or Linked Post.

The Rectified SQL Queries are as follows: You may refer to the docstring - (Updated SQL Query)


  1. Power BI (Cleaning and Basic Analysis)
  • Power Query (M Language)
  • DAX Measures

Both can be found in (Final_Dashboard.pbix)

The Orange and Purple Cap Prediction

Here in this section, we will deal with various machine learning models and Statistical analysis to optimize those models mostly deployed through R and Python as a part of our "Predictive Analytics". 2

Orange Cap:

Objective:

  • To predict the player having maximum total run in the IPL season 2024.

Methodologies:

  • Assumptions like Independence of errors, Normality of errors, etc. have been checked using various plots and measures.

  • Primary model has been made through Stepwise Regression method.

  • Multicollinearity of independent variable has been identified by Variance Inflation Factor

  • Multicollinearity of the independent variable has been addressed by Principal Component Analysis

  • Test of Significance has been considered while model building

  • Reliability has been reconfirmed by R-Squared and Adjusted R-Squared

    image

Purple Cap:

Objective:

  • To predict the player with the most wickets in the IPL season 2024.

Methodologies:

  • The approach mentioned earlier for "Orange Cap Prediction" has been imitated.

Resources


  • The Ultimate Result:

IPL_Sports_Magazine

My Team 11

  1. Power BI
  • Data Analysis
    • I Classified the pool of players into various homogeneous groups (like Opener, Middle Order, Lower Order, etc.)
    • To classify some criteria are needed. I got it from Analysis using Power BI.
    • Illustration: Say we want to make criteria for openers then:
      1. Filter the pool using position 1,2. We can't allow middle-order players to play the role of opener.
      2. Then Line Charts would depict 90 percentile points which may help us to put a cap for further filtering like Strike Rate, Average Ball Faced, etc.
      3. Ordering players based on Standard Deviation of runs/Strike Rate/anything else may help choose the right ones.
  • Actionable Insights
  • ** Explanation Video
  • The Ultimate Result:

IPL_Sports_Magazine (1)

Thank You So Much!! See You Soon, Until then Goodbye πŸ‘‹πŸ‘‹