This repository contains a project that I completed as part of my Ph.D. in Economics coursework at PIMES/UFPE in Recife, Brazil. The project was submitted at the end of July 2021 and is quite lengthy. Therefore, on April 5th, 2023, I have decided to break it down into smaller parts. This is the first original final project. The project is a descriptive analysis which uses Covid data and many Economic data. It begins with webscraping and it ends with machine learning application, besides including many data visualizations after several merges between the data.
In this Data Science Covid-19 Project, I conducted a comprehensive analysis of the pandemic's impact on various countries using a wide range of data sources and statistical methods. The project can be outlined as follows:
- Collected Covid-19 data from world meters using Python's web scraping techniques. Additionally, acquired vaccine data to correlate with Covid-19 cases and death rates.
- Merged the Covid-19 data and vaccine data with country indicators such as the Human Development Index, Gross Domestic Product (GDP), GDP per capita, unemployment rate, inflation rate, and annual GDP growth.
- To provide a comprehensive visual representation, I created various graphs such as scatter plots, bar graphs, pie charts, tree maps, trendlines, as well as world maps with shapefiles using latitude and longitude data. I utilized libraries such as Plotly, Matplotlib, Geopandas, and others to enhance the visual aspect of the analysis.
- Statistical tests were employed to assess the significance between different variables.
- For the econometric analysis, I utilized a diverse set of models including linear regression, multivariate regression, tested for endogeneity, two-stage least squares, Hausman test, OLS (Ordinary Least Squares), Regression Discontinuity Design (RDD), and Difference-in-Difference (Dif-in-Dif). I also employed Logit regression and quantile regression to explore correlations and causal relationships between variables such as GDP per capita and Covid-19 cases or fatalities, and the potential impact of unemployment rates on the number of Covid-19 cases. Many other variables were also tested.
- Spatial Econometrics was used to analyze spatial autocorrelation, both global and local, to understand how Covid-19 cases and fatalities might affect neighboring countries.
- Machine learning techniques, including Support Vector Classifier (SVC), Logistic Regression, Decision Trees, and Random Forests, were applied for prediction purposes. For instance, I predicted Covid-19 mortality rates based on whether a country falls into the high income group, upper middle income group, lower middle income group, or lower income group.
- The project is divided into separate sections, each focusing on specific analyses and methodologies. I have made the data and relevant links available for download or web scraping for further exploration.
- By conducting this comprehensive analysis, we aimed to gain valuable insights into the correlations and impacts of various factors related to Covid-19, contributing to a deeper understanding of the pandemic's effects on different countries and populations.