This document is the README for my Data Science 871 Practical. My approach to the various questions will be detailed here although if anything doesn’t make sense here it is detailed in the README’s in the respective question folders(e.g. Question_1).
To start I created Texevier html templates for all the questions using the standard code:
Texevier::create_template_html(directory="", template_name = "")
For this question, there was a lot of missing data and the data is super confusing, so the first thing I did was to look at the unique values in each column to try figure out what was going on. It seems like the location column contains continents as well as individual countries and income designations. Considering that our first task was to try to compare Africa to the other countries, I thought it was a good idea to extract the continent information from the location column, but the entries with a continent in the location column had no data for the other columns, making this approach futile. I really wanted to get a measure of total cases per million per continent to show the comparison. Thus I needed to instead get the location column to a country column and then average the values over all the countries in a particular continent. This is what I did in the function “barplotcases”.
One might assume that Africa would be hardest hit by the Covid-19 pandemic given the high levels of poverty and inequality present in many of the countries in the region. To investigate this, one can look at a number of metrics. The left pane of the figure below shows the average total number of cases experienced by a continent, the average excludes those countries for which no data on total cases was available. One can see that this appears to show a suprising story, that Africa experienced the least Covid-19 cases per million inhabitants out of all the continents. This is not surprising though when you delve a little deeper. Consider the right pane, it shows that Africa also has the lowest average total tests per thousand inhabitants. This indicates that the reason Africa’s Covid-19 numbers are so low might not be due to a lack of actual cases, but rather a failure to adequately test for the virus.
Another stark illustration of some of the problems Africa has experienced in collecting data on the pandemic is the case of Tanzania. This is not a case of a lack of resources, like it is for testing, but rather, a lack of leadership. The president declared the pandemic to be over early in 2020 and thereafter stopped recording cases, this is why we see the total cases for Tanzania remaining close to zero for much of 2020 and 2021. It was only when the new president came into power that cases began to be recorded again, hence the jump in late 2021.
Both these figures illustrate a need for caution in making blanket statements about how certain countries or regions fared during Covid-19. Without adequate understanding of the limitations of the dataset provided, one might certainly be inclined to conclude that Africa got off relatively lightly with regard to Covid-19 or that Tanzania had barely any Covid cases for the first two years of the pandemic. This could be an interesting topic to discuss, with more examples of where data might be confusing to readers at face value.
This failure of the data to capture the true impacts of the pandemic extends beyond just Africa. Testing has tended to be more limited in lower income countries as well. From this data set we can obtain average case numbers and deaths per income level, these are displayed in the figure below. It once again shows the counter-intuitive conclusion that lower income countries experienced lower case levels and lower deaths. Although it is plausible that the lower rates of travel in lower income countries might have suppressed cases initially, it is unlikely that this effect would’ve remained throughout the pandemic. Thus, the graphs below illustrate the problems with data collection from resource light nations.
There is clearly a positive relationship between average life expectancy and total cases. This could be due to correlations with levels of development and accordingly resources available for testing, or due to the abundance of elderly people who were more at risk of Covid-19.
One can see from this figure that limited data is available on hospitalizations and ICU admissions across regions. For those regions for which data is available, it can be seen that Asia and South America appear not to vary their hospital beds per thousand vary much, Europe does appear to pick up hospital beds following the first wave but there is relatively little movement after that.
The key things to get across here is that London is in fact cold and wet. I think the best way to express that it is wet is to calculate the frequency of days in which it rained vs days in which it didn’t rain so that I can convince her that it rains a lot of the time. For the cold part of the equation some kind of temperature plot will be optimal. There is a lot of data so I think it will be best to restrict it to the most recent year and then plot the average temperatures for the year. Here it will be necessary to provide some sort of comparison, so I obtained the average annual temperature for South Africa (World Bank, 2022). I also provided a plot of the snow depth for some humour.
There are so many reasons not to move to London but in this document I will name but a few. London is cold, London in rainy and third, there is a zero percent change of running into me. It rains nearly 50% of the time! The amount of rain in London basically guarantees you’ll be stomping in puddles and going to work with wet socks. Do you really like wet socks?
Weather | Frequency | Percentage |
---|---|---|
No Rain | 7963 | 100 |
Rainy | 7372 | 100 |
Not only is London super wet. It is also super cold. Where the average annual temperature in London is a chilly 12.7 degrees, in sunny South Africa it is 17.5 degrees. From the graph below it is clear than many days are spent below the average in London.
The maximum snow depth in London is deeper than a kitten, this is an illustration of what a kitten stuck in that snow might look like, doesn’t that make you sad?
This dataset repeatedly crashed my computer so I was unable to read the data in using my function. I had to instead run all the elements of my function manually so that I could complete the rest of the analysis. As such. The figures that I created won’t display properly through code here, but you can go and run them as individual functions or check their code in the Question_3_correct folder. I will include screenshots of the output here.
In terms of points, none of the top players are particularly familiar. Neither Djokovic or Nadal appear among the top 10 players as is seen in Table 1 below.
The top players in terms of rank are more familiar. Unfortunately my favorite Nadal does only come in at 10th, far behind Djokavic who holds the first position.
Although the scatter plot below illustrates that there is limited correlation between a players height and their ranking. All the top players appear to taller than 170cm. The average South African male is 169cm and thus, the average South African man is definitely not going to be a pro tennis player.
Lefties seem to be relatively good at tennis, making up 13% of the top tennis players but only 10% of the population.
A good place to start in explaining what works and what doesn’t work for a streaming platform is to look at what shows and movies do well on the platform. The metric for performance that is going to be used here is the TMDB popularity as it aggregates important metrics likes the number of votes and views for a movie or show. The top 10 movies and shows are shown in the tables below. The movie table provides valuable insights into possible explanations for movies performance, it seems as though longer movies may perform a bit better, romances and thrillers seem to be popular, and the TMDb score seems to be relatively important although that is expected given that it is an input into popularity. For series, the most popular titles appear to have slightly above the average run time for series of 40 minutes. Drama appears to be the most popular primary genre among top series.
Title | Runtime | IMDb Score | TMDb Score | Primary genre | Popularity |
---|---|---|---|---|---|
365 Days: This Day | 111 | 2.5 | 5.6 | romance | 1823 |
Yaksha: Ruthless Operations | 125 | 6.2 | 6.2 | action | 1275 |
Black Crab | 114 | 5.6 | 6.2 | war | 944 |
The Adam Project | 106 | 6.7 | 7.0 | drama | 920 |
Fistful of Vengeance | 94 | 4.5 | 5.4 | fantasy | 829 |
Honeymoon With My Mother | 110 | 5.8 | 6.3 | comedy | 642 |
Battle: Freestyle | 88 | 4.3 | 4.9 | romance | 571 |
Last Man Down | 87 | 5.0 | 6.3 | thriller | 506 |
Restless | 95 | 5.8 | 6.0 | thriller | 494 |
Texas Chainsaw Massacre | 83 | 4.8 | 5.1 | thriller | 438 |
Title | Runtime | IMDb Score | TMDb Score | Primary genre | Popularity | Seasons |
---|---|---|---|---|---|---|
The Marked Heart | 44 | 6.3 | 6.8 | thriller | 1455 | 2 |
Wheel of Fortune | 26 | 6.7 | 6.7 | family | 1441 | 39 |
Grey’s Anatomy | 49 | 7.6 | 8.3 | drama | 1215 | 18 |
Peaky Blinders | 58 | 8.8 | 8.6 | drama | 972 | 6 |
Heartstopper | 28 | 8.9 | 8.9 | drama | 926 | 1 |
The Walking Dead | 46 | 8.2 | 8.1 | action | 773 | 11 |
Lucifer | 47 | 8.1 | 8.5 | scifi | 761 | 6 |
The Flash | 42 | 7.6 | 7.8 | scifi | 751 | 8 |
All of Us Are Dead | 61 | 7.5 | 8.5 | action | 679 | 1 |
Conversations with a Killer: The John Wayne Gacy Tapes | 61 | 7.2 | 7.8 | crime | 535 | 1 |
Now that we understand some of the broad factors that feed into a series being popular. It is necessary to conduct a more granular analysis, delving into specific factors.
Peoples preferences for entertainment are often influenced by the preferences for a particular genre of content. Understanding which genres are most popular would be helpful in determining which content to prioritise on the streaming platform. The graph below shows the 10 most popular genres by median TMDb popularity. Median was chosen to minimize the effects outliers have on results. It is clear from the graph that animation, crime and sport are the most popular genres and thus these should be prioritized when creating a new streaming platform.
Correlations between the various variables can help us undertand what matters for a shows popularity. The correlation plot below clearly illustrates that there is a strong positive correlation between the IMDb score and the TMDb score, as well as a relatively strong negative correlation between the TMDb score and the runtime. This suggests that shows with a shorter run time might score better overall. Ideally though, this should be plotted separately for movies and series as these have significantly different run times and combining them in one plot might confuse the relationship between run time and the other variables.
What is clear from the above analysis is that the company must carefully consider genres as well as a title’s IMDb and TMDb ratings when deciding to load titles onto the streaming platform. Variables like run time should also be considered although what is optimal is less clear. Further analysis is necessary before we commence the buying of titles.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1453717 77.7 2721218 145.4 2721218 145.4
## Vcells 2870236 21.9 42324883 323.0 52876963 403.5