-
Notifications
You must be signed in to change notification settings - Fork 0
/
Social_Media_Bias_Project.Rmd
152 lines (100 loc) · 15.9 KB
/
Social_Media_Bias_Project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
title: "Final Project Report"
authors: Sameer Khan, Luke Schreiber, Nick Waddups
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Social media has become extremely common among university students. With its increasing popularity, many are concerned about its harmful consequences. Several studies have been conducted by health experts, and many point to social media having many negative affects, such as increased anxiety and depression, addiction, body dysmorphia, inability to focus on school or work, and more. Thus begins this study on the affects of social media. The motive of this case study is to find out if University of Utah students are negatively affected by the use of Instagram.
We decided to focus our efforts on studying Instagram over other social media apps because of its popularity, and because the activity of users on Instagram can potentially provide evidence suggesting the existence of body shaming on the platform. Individuals who post pictures of themselves on their accounts may receive more or less attention in the form of followers depending on their body types. To take this further, an individual's body type may even be a deciding factor in determining whether or not they have pictures of themselves on Instagram accounts in the first place.
Whether Instagram affects users' abilities to focus on work or school is another important topic to study, since general awareness about the issue can result in people taking responsibility for their content consumption, as well as encouraging further allocation of resources to provide services to help those who struggle with social media. In the case of U of U students, we all have a common and somewhat standard metric of determining performance: GPA. Thus, we will study how time spent on Instagram affects U of U students' GPA's. To take this further, whether or not a student even has an Instagram may have an effect on their GPA, so this will be studied as well.
To summarize, this case study will attempt to answer the following questions with regards to University of Utah students:
1) Does the BMI of an individual who has pictures of themselves on their Instagram account have an effect on the number of followers on their account?
2) Does BMI affect whether or not users post pictures of themselves on their Instagram?
3) Does BMI affect whether or not users have an Instagram account?
4) Does the daily time spent on Instagram affect an individual's GPA?
5) Does having an Instagram account affect an individual's GPA?
## Data Collection
This study is primarily focused on U of U students' usage of Instagram, their grades, and their body types. To help answer the questions posed in the introduction, we can gather data about students' usage of Instagram by asking whether or not respondents post pictures of themselves on their Instagram account, how many followers they have, and how many minutes per day they spend on Instagram. Quantifying respondents grades is very straightforward, since we can simply ask them for their cumulative GPA.
As for respondents' body types, this is much less simple. Asking respondents to describe their body types is impractical, and perhaps inappropriate. Asking them how they feel about their bodies wouldn't do us much good either, since a person's happiness with how they look depends on several factors other than their own body. Ideally, we want qualitative data here. Thus, we decided to collect each respondent's Body Mass Index (BMI) by asking them for their height and weight. However, this approach isn't without its flaws. BMI is a measure of body fat based entirely on an individual's height and weight--it does not account for other factors which play a large role in an individual's body type, such as their level of physical activity. An athlete with a significant amount of muscle mass could have a higher BMI than someone who leads a more sedentary lifestyle, but this doesn't mean that the athlete is overweight or less healthy than the individual with a lower BMI and more sedentary lifestyle. Body fat percentage would be a more accurate quantitative variable to describe an individual's body type. However decided against collecting this, since measuring respondents' body fat percentages comes with a plethora of impracticalities (we can't exactly hand surveyees body fat calipers over discord).
We also included some questions that may not necessarily be used to answer the five questions posed in the introduction, but could provide some insight into our sample and the topic being studied. These questions asked about gender, sleep, and whether respondents felt like they spent too much time on Instagram. The survey questions are listed below:
1) What is your cumulative GPA?
2) is your height in feet and inches?
3) What is your gender?
4) How much sleep do you get per night on average?
5) What is your weight in pounds?
The following questions were only posed to surveyees who had an Instagram account:
6) Do you post pictures of yourself on your Instagram account?
7) How many followers do you have on your Instagram account?
8) On average, how many minutes do you spend on Instagram per day?
9) Do you feel like you spend too much time on Instagram?
To gather the data, we created a Google survey with the questions listed below. The survey was posted on the class discord, as well as other U of U affiliated discord servers, such as Data Science UofU, U Physics & Astronomy, and Crimson Gaming. It was also posted to the U's subreddit, r/uofu. Then, we surveyed students on campus by ...
<span style="color: red;">
Talk about methods of collecting data-- how we collected the data, and why we collected the data this way. Talk about any flaws in the way we intend to gather data, and why we had to do it this way if it's less than ideal.
</span>
## Data Analysis
After receiving (N) survey responses, the csv file from the Google form was downloaded and read into R. A non-trivial amount of cleanup was required before all of the data could be used. For example, some survey respondents incorrectly formatted their heights. This was difficult to fix in R, so the csv file was altered manually. Examples of faulty data that was cleansed in R includes non-numeric answers for GPA, time spent on Instagram, or follower count-- the responses of these surveyees was dropped by using the "is.na" function in conjunction with other parsing functions. Some of the incorrectly formatted data that was easy to locate and correct in the csv file was done directly in the csv file itself, such as surveyees including units for weight.
Next, the BMI of each respondent needed to be calculated. To do this, the "Height" column was converted from feet and inches to inches. Then, a column for BMI was added, which was found by multiplying weight by 703 and dividing by the height (in inches) squared. From here, the data could be separated into multiple different tables to represent the different groups of this study. There was a table for all respondents, respondents who used Instagram, respondents who used Instagram who posted pictures of themselves on their accounts, and respondents who used Instagram who didn't post pictures of themselves on their accounts. Now, data analysis can be performed with respect to the questions posed in the introduction.
```{r, echo=FALSE}
# Some data cleansing is done here. Some of it must be done manually every time the data set is downloaded. In particular, cleansing incorrectly formatted height data is difficult to do it in R, so just do it manually.
# Read the data, then rename the columns for better readability
raw_data = read.csv("Survey Data2.csv")
names(raw_data) <- c('Timestamp', 'GPA', 'Height', 'Gender', 'Sleep', 'Weight',
'PostsPics', 'Followers', 'MinsOnIG', 'TooLongOnIG')
raw_data <- raw_data[!is.na(as.numeric(as.character(raw_data$GPA))),] # Purge faulty data from people being ape brain
# Convert height in feet and inches to just inches
raw_data['Inches'] <- as.numeric(substr(raw_data$Height, 1, 1))*12 + as.numeric(substr(raw_data$Height, 3, nchar(raw_data$Height)))
# Calculate BMI
raw_data['BMI'] <- raw_data$Weight * 703 / raw_data$Inches^2
# Get all Instagram users
ig_users <- raw_data[raw_data$PostsPics != '',] # Maybe do == 'Yes' || == 'No' instead
ig_users <- ig_users[!is.na(ig_users$Followers),] # Some people answered this question even though they don't have IG
# Get Instagram users who post pictures of themselves
ig_posters <- ig_users[ig_users$PostsPics == 'Yes',]
# Get Instagram users who don't post pictures of themselves
ig_NonPosters <- ig_users[ig_users$PostsPics == 'No',]
# Get the non-Instagram users
nonIg_users <- raw_data[raw_data$PostsPics == '',]
```
#### 1) Does the BMI of an individual who has pictures of themselves on their Instagram account have an effect on the number of followers on their account?
To determine whether BMI plays a roll in the number of followers an individual has on their account, given that they post pictures of themselves on their account, linear regression can be used to determine if there's a trend in the data. Below, a plot of BMI vs Follower Count is given for the sampled group of Instagram users who post pictures of themselves on their account. This is followed by the results of fitting a linear model to this relationship, where the independent variable is BMI, and the dependent variable is follower count.
```{r, echo=FALSE}
# 1) Does the BMI of an individual who has pictures of themselves on their Instagram account have an effect on the number of followers on said account?
# Relationship between BMI and number of followers for those who post pictures of themselves on IG
plot(ig_posters$BMI, ig_posters$Followers, xlab = "BMI", ylab = "Follower Count", main = "BMI vs Follower Count of those visible on their IG account")
BMI_vs_Followers_Model <- lm(ig_posters$Followers ~ ig_posters$BMI, data = ig_posters)
summary(BMI_vs_Followers_Model) # High R-squared means strong correlation, low R-squared means we don't have evidence for a strong correlation
```
#### 2) Does BMI affect whether or not users post pictures of themselves on their Instagram?
We want to answer whether the mean BMI of individuals who don't post pictures of themselves on their Instagram accounts is different from the mean BMI of individuals who do post pictures of themselves on their Instagram accounts. It's important to aknowledge that body shaming can go both ways--it is something that people can unfortunately experience at high BMI's as well as low BMI's. Thus, this question will be answered by running a two-sample t-test for the difference of means with a 90% confidence level. To account for body shaming going both ways, this will be a two-sided hypothesis test. In other words, the null hypothesis is that the true difference in means between the two groups is equal to zero, and the alternative hypothesis is that the true difference is not equal to 0 (not necessarily just less than 0 or just greater than 0). We chose a 90% as our confidence level over a higher value like 95% or 99% because type 1 errors aren't a concern--it wouldn't be a huge problem if we claimed to have evidence of body shaming even though there is no body shaming in reality. At least, it wouldn't be nearly as big of an issue compared to a type 2 error: claiming that there is no evidence of body shaming, when in reality, body shaming is present. In other words, we chose a 90% confidence level because having a lower confidence level reduces the probability of a type 2 error, which would be more detrimental in this case. The results of the t-test are shown below.
```{r, echo=FALSE}
# 2) Does BMI affect whether or not users post pictures of themselves on their Instagram?
t.test(ig_NonPosters$BMI, ig_posters$BMI, alternative = "two.sided", conf.level = 0.90)
```
#### 3) Does BMI affect whether or not users have an Instagram account?
This will be determined in a fashion very similar to question 2 above. We want to answer whether the mean BMI of individuals who don't use Instagram is different from the mean BMI of individuals who do use Instagram. Thus, this question will be answered by running a two-sample t-test for the difference of means with a 90% confidence level. Once again, the t-test will be two-sided, since body shaming goes both ways, and a lower confidence level of 90% is used because a type 2 error is more detrimental than a type 1 error for the purposes of this study. The results of the t-test are shown below.
```{r, echo=FALSE}
# 3) Does BMI affect whether or not users have an Instagram account?
# See if there's a difference in the BMI of people who have IG vs those who don't have IG
t.test(nonIg_users$BMI, ig_users$BMI, alternative = "two.sided", conf.level = 0.90)
```
#### 4) Does the daily time spent on Instagram affect an individual's GPA?
We want to answer whether an increase in daily time spent on Instagram causes a significant increase or decrease in GPA. To determine this, linear regression can once again be used to see if there's a trend in the collected data. Below, a plot of daily time spent on Instagram in minutes vs GPA is given for the sampled group of all Instagram users. This is followed by the results of fitting a linear model to this relationship, where the independent variable is daily time spent on Instagram, and the dependent variable is GPA.
```{r, echo=FALSE}
# 4) Does the daily time spent on Instagram affect an individual's GPA?
plot(ig_users$MinsOnIG, ig_users$GPA, xlab = "Minutes spent on Instagram per day", ylab = "GPA", main = "Time spent daily on Instagram vs GPA")
TimeOnIG_vs_GPA_Model <- lm(ig_users$GPA ~ ig_users$MinsOnIG, data = ig_users)
summary(TimeOnIG_vs_GPA_Model) # High R-squared means strong correlation, low R-squared means we don't have evidence for a strong correlation
```
#### 5) Does having an Instagram account affect an individual's GPA?
Determining the answer to this question will involve another confidence interval. We want to determine whether the mean GPA of U of U students who have an Instagram is less than the mean GPA of those who do not have an Instagram. Thus, a two-sample t-test for the difference of means will be conducted. Unlike the previous two t-tests, this one will be a one sided test. We are specifically looking for evidence that the mean GPA of Instagram users is less than the mean GPA of those who don't user Instagram. In the extreme case, we'd expect someone who spends almost every waking hour on Instagram to perform poorly in school. So, the null hypothesis is that the difference in mean GPA of U of U students who use Instagram and those who don't is greater than or equal to 0, and the alternative hypothesis is that this difference is less than 0. Also, for this test, we chose a confidence level of 95%--this is a higher confidence level than the other two t-tests conducted. This is because the difference in severity for type 1 and type 2 errors isn't as prominent here. We don't want to fail to reject the null hypothesis when it is false, but on the other hand, we don't wan't to tell students to further limit their Instagram usage if it doesn't actually affect their GPA. Especially with COVID, Instagram has the ability to let students socialize with people that they may not be able to see in-person. This is not something we want to limit for no good reason. The results of the t-test are shown below.
```{r, echo=FALSE}
# 5) Does having an Instagram account affect an individual's GPA?
t.test(as.numeric(ig_users$GPA), as.numeric(nonIg_users$GPA), alternative = "less", conf.level = 0.95)
```
## Conclusions
<span style="color: red;">
What conclusions can be made. Confounding variables, shortcomings of this study. Mostly STEM majors--asked people from class, discord, reddit, idk. See if we can actually answer the question we posed, strength of evidence, etc etc. Didn't have equal numbers of non IG users and IG users, or equal number of non IG pic posters and IG pic posters.
</span>