From e69861675634a6e7bc7a45e67874d5fd49ebd049 Mon Sep 17 00:00:00 2001 From: Adriana Solis <60237877+solisa986@users.noreply.github.com> Date: Thu, 4 May 2023 08:46:05 -0400 Subject: [PATCH] actual final draft --- config.yaml | 2 +- thesis.md | 79 +++++++++++++++++++++++++++++------------------------ 2 files changed, 44 insertions(+), 37 deletions(-) diff --git a/config.yaml b/config.yaml index a19173f..c43e57d 100644 --- a/config.yaml +++ b/config.yaml @@ -4,7 +4,7 @@ # Project-specific values title: 'Binge On!: A Machine Learning Analysis Tool to Examine the Success Rate of Movies' author: 'Adriana Solis' -date: '01 April 2023' +date: '05 May 2023' firstreader: 'Oliver Bonham Carter' secondreader: 'Russell Ormiston' logo: 'images/logo' diff --git a/thesis.md b/thesis.md index 49fab8b..22259e2 100644 --- a/thesis.md +++ b/thesis.md @@ -8,20 +8,29 @@ How do I say thank you to such an inspirational woman other than to credit half To **my siblings, Guadalupe Solis, Vicente Solis, America Rubi Nunez, Lucca Victoria, Jan Carlo Victoria, and Marshell Victoria**: I love you all so much, words cannot express how much your love and support for me throughout this year has helped me to finish this project. Especially Lupita, as without her weekly FaceTime's, I don't think I would have been able to stay sane. Thank you so much for providing me with comedic relief and a sense of home for whenever I needed a break from all of the academic stress. Thank you for always being spontaneous and for coming with me to run mundane tasks whenever I come home to visit. Though you were not able to physically provide me support, just the thought of our inside jokes and banter kept me happy whenever I needed it. +To **my beloved cats, Nena and Frankie**: +Meow meow meow meow. Meow meow, meow meow meow meow meow meow meow. Meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow, meow meow meow meow, meow meow meow meow meow, meow meow meow meow meow meow meow meow. Meow meow meow meow meow. Meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow. Meow meow meow meow meow meow, meow meow meow meow meow meow, meow meow meow meow meow meow! + To **my extended family**: -Thank you for being the first group of people that believed in me, especially when I gave you guys no reason to be so trusting in my future. Without your continued support throughout the years, I doubt I would have been able to make it past the first year of college, let alone through this extensive research project. I have always looked forward to all of the varying conversations y'all have in the family group chat, as it allowed me to disconnect from academic reality and tune into life outside of college. Without your baby pictures, pet pictures, and words of affirmation, I would have gone crazy for sure. +Thank you for being the first group of people that believed in me, especially when I gave you guys no reason to be so trusting in my future. Without your continued support throughout the years, I doubt I would have been able to make it past the first year of college, let alone through this extensive research project. I have always looked forward to all of the varying conversations y'all have in the family group chat, as it allowed me to disconnect from academic reality and tune into how my life will be outside of college. Without your baby pictures, pet pictures, and words of affirmation, I would have gone crazy for sure. Thank you for being my rocks and for never failing to make me feel loved and supported throughout all of my endeavors. To **my best friends, Favour Ojo, Shira Haus, Kyrie Doniz, Hanna Nguyen, Laura Guo, Jasmin Noor Meyer Jaafari, Daniel Sanchez, Gabriel Schwartz, London Dejarnette, and Lilian Fogland**: -Literally how do I start this. I have never felt such a deep connection with a group of friends as I do with you guys. We have truly gone through all of the ups and downs, and I would not trade any of it for the world. Thank you for being my sounding board, my shoulder to cry on, my source of unimaginably funny jokes, my drinking buddies. Thank you for always letting me hog the TV and choose all of the movies we watch, even if you guys complain about my movie choices. Thank you for never failing to provide me with peace and comfort whenever I needed it most. I love you guys so much and cannot wait to keep experiencing life to the fullest with you all! <3 +Literally how do I start this. I have never felt such a deep connection with a group of friends as I do with you guys. We have truly gone through all of the ups and downs, and I would not trade any of it for the world. Thank you for being my sounding board, my shoulder to cry on, my source of unimaginably funny jokes, my drinking buddies, and my favorite comfort place. Thank you for always letting me hog the TV and choose all of the movies we watch, even if you guys complain about my movie choices. Thank you for never failing to provide me with peace and support whenever I needed it most. I love you guys so much and cannot wait to keep experiencing life to the fullest with you all! <3 To **Maricarmen Cervantes, my fellow stats survivor**: -Thank you for providing me with a good laugh as we struggled through Stats 1 and 2 together. Even though we may not have retained as much about stats as we should have, I still feel like I benefitted from those classes because it meant starting an amazing friendship with you. Thank you for coming with me to get 'study' margaritas and for always (somewhat) being down to watch whatever crazy movie I want to see. Your support throughout this project has been invaluable! +Thank you for providing me with a good laugh as we struggled through Stats 1 and 2 together. Even though we may not have retained as much about stats as we should have, I still feel like I benefitted from those classes because it meant starting an amazing friendship with you. Thank you for coming with me to get 'study' margaritas and for always being down to go on spontaneous beach trips and parties. Your support throughout this project has been invaluable! To **my advisors, Oliver Bonham Carter and Russell Ormiston**: -Thank you for all of your guidance and support throughout the course of this project. I know I wasn't always the most coherent thoughts whenever I would come to you guys for help, but I was always able to leave with my questions answered and reassurance about the progress of my senior thesis. I have never failed to laugh every time I went to your office hours, so thank you for being a comedic relief whenever I needed it. +Thank you for all of your guidance and support throughout the course of this project. I know I wasn't always the most coherent with my thoughts and questions whenever I would come to you guys for help, but I was always able to leave with my questions answered and reassurance about the progress of my senior thesis. I have never failed to laugh every time I went to your office hours, so thank you for being a comedic relief whenever I needed it. + +To **bestie + other friends I probably missed (you know how you are)**: +Thank you for providing me with overwhelming support. I could not imagine doing my last year without you by my side. I cannot wait to see all of the great and amazing things you will accomplish over the years. This one's for you! + +To **Mango Monsters, Dunkin's "the Charlie" Cold Brews, and Iced Lavender & Pistachio Matcha Lattes**: +Thank you for existing. Without you I definitely would not have had the motivation to do this research project. Here's to many more drinks! -To **my beloved cat, Nena**: -Meow meow meow meow. Meow meow, meow meow meow meow meow meow meow. Meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow, meow meow meow meow, meow meow meow meow meow, meow meow meow meow meow meow meow meow. Meow meow meow meow meow. Meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow meow. meow meow meow meow meow meow, meow meow meow meow meow meow, meow meow meow meow meow meow! +To **Laura Jordan, Lisa Gathright, and Jennifer White, my former high school teachers**: +Though this paper has absolutely nothing to do with anything you taught me in class, I still feel as if I would not have been able to do this without you. College was such a scary concept to me, as I had never fathomed how difficult being away from home would be. And yet, you all left me amazing advice, heartfelt notes, and touching words of encouragement. Without knowing that you all were rooting for me in my corner, I would not have been able to survive my undergraduate studies, and especially not this senior comprehensive project. Thank you for believing in me from the very beginning and for providing me with the building blocks I needed to excel. # Introduction @@ -29,12 +38,12 @@ This chapter aims to describe the motivation, the current state of research, the ## Motivation -Over the past decade, there has been significant growth in the number of movies being produced, specifically with the rise of movie production within streaming platforms. With this leap in movie production also comes a drastic change in the overall determinants of movie success, as the preferences for movies are ever changing to fit the demand of consumers. Movie success for the purposes of this research is the likelihood that a movie will generate revenue, given certain factors of the movie as well as the production budget of the movie. Through the preliminary analysis of recent literature about determinants of movie production, it can be inferred that past research has not focused on how the overall features of movies that can influence movie success can be used towards a movie recommendation system. Most of the analysis focuses on predetermined factors of movie success, such as whether an actors/actresses star power can influence the profitability of a movie. In order to provide an in-depth research of what exactly will determine the success of a movie for the purposed of the creation of a movie recommendation system, this research performs the following steps to determine the factors of movie success: take all current and known features of movie success into account, determine which features are positively correlated to movie success through a preliminary logistic regression, and then use the determined features to perform machine learning using different algorithms. Once the machine learning model is properly trained with the given data, the end product is an interactive Streamlit-hosted application where a user is able to be given movie recommendations based on the correlation between the movie that the user chose. +Over the past decade, there has been significant growth in the number of movies being produced, specifically with the rise of movie production within streaming platforms. With this leap in movie production also comes a drastic change in the overall determinants of movie success, as the preferences for movies are ever changing to fit the demand of consumers. Movie success for the purposes of this research is the likelihood that a movie will generate revenue, given certain factors of the movie as well as the production budget of the movie. Through the preliminary analysis of recent literature about determinants of movie production, it can be inferred that past research has mainly focused on which specific features of movies can influence movie success. Most of the analysis focuses on predetermined factors of movie success, such as whether an actors/actresses star power can influence the profitability of a movie. However, none of the previous research uses their results to create a movie recommendation system for users. In order to provide an in-depth research of what exactly will determine the success of a movie, this research performs the following steps to determine the factors of movie success: take all current and known features of movie success into account, determine which features are positively correlated to movie success through a preliminary logistic regression, and then use the determined features to perform machine learning using different algorithms. Once the machine learning model is properly trained with the given data, the end product is an interactive Streamlit-hosted application where a user is able to be given movie recommendations based on the correlation between the movie that the user chose. -As the act of watching movies has been converted into a more lax and cost-effective activity thanks to the rise of streaming platforms, then it can be inferred that movies are an integral part of the way that people spend their leisure time. Therefore, continuing on from past research is necessary in order to propel the movie industry to be more in line with the rise in interest for personalized recommendation systems. The overall motivation for this area of research lies within the lack of reputable studies on how movie recommendation systems can influence the overall success of a movie. In order to provide insight into how the research gap in movie recommendation systems influences movie success, this paper does the following: +As the act of watching movies has been converted into a more lax and cost-effective activity thanks to the rise of streaming platforms, then it can be inferred that movies are an integral part of the way that people spend their leisure time. Therefore, continuing on from past research is necessary in order to propel the movie industry to be more in line with the rise in interest for personalized recommendation systems. In order to extend previous research done over the determinants of movie success through the creation of a movie recommendation system, this paper does the following: -1. Use a machine learning model to predict the overall success of a movie and -2. Use the given results to provide users with a list of movie recommendations based on the preference a user has on the given movie factors +1. Use a machine learning model to predict the overall success of a movie given statistically significant movie variables and +2. Use the given results to create a movie recommendation application that would provide users with a list of movie recommendations based on different determinants of movie success ## Current State of the Art @@ -42,9 +51,9 @@ This section of the research paper is an in-depth analysis of the key points of ### Past Areas of Research and Knowledge Gap -Notable areas of research that the movie industry focuses on is the prediction of movie success. For instance, experiments conducted by other data scientists featured a tool that would be able to predict movie success, with [@movie_success_1]'s tool ultimately being used to "predict the gross box office revenue to the nearest ten’s of million" and to "predict if the movie would make money, not by a specific amount, just if the budget was smaller than the revenue from ticket sales". [@movie_success_1]’s tool is then used to evaluate the specific percentage of the accuracy of their machine learning model when compared with their data on successful movies. When comparing the results of different experiments, the accuracy of the machine learning model hovered at around 60-64.7% range for accuracy of their models, and oftentimes are building off of previous research's machine learning models. As most of the previous experiments conducted in this area are built off of previous research and contain only data-fueled results (as in there is nothing interactive for users to do with the results), then it is essential for further research to be done in order to further understand what affects the overall success of a movie. +Notable areas of research that the movie industry focuses on is the prediction of movie success. For instance, experiments conducted by other data scientists featured a tool that would be able to predict movie success, with [@movie_success_1]'s tool ultimately being used to "predict the gross box office revenue to the nearest ten’s of million" and to "predict if the movie would make money, not by a specific amount, just if the budget was smaller than the revenue from ticket sales". [@movie_success_1]’s tool is then used to evaluate the specific accuracy percentage of their machine learning model when compared with their data on successful movies. When comparing the results of different experiments, the accuracy of the machine learning model hovered at around the 60-64.7% accuracy range of their models, and oftentimes are building off of previous research's machine learning models. As most of the previous experiments conducted in this area are built off of previous research and contain only data-fueled results (as in there is nothing interactive for users to do with the results), then it is essential for the research in this paper to be done in order to further understand what affects the overall success of a movie and extend the results found in [@movie_success_1]. -With this, the proposed area of research focuses on essentially the same methodology, where the machine learning model is used as a predicator of movie success. This project will extend [@movie_succes_1]'s findings by analyzing a bigger sample of movie data (around 8,000+) and by aiming for a higher accuracy percentage of the machine learning model. As previously stated, the knowledge gap for this area of research is by extending previous research to include an interactive application, where users can utilize the pre-computed results from this research to display the list of recommended movies. This widens the chosen audience for the results of this paper to include all public users, movie industry personnel, and other data researchers. This ensures that the increasing demand for personalized movie recommendations is taken into account within this area of research. +With this, the proposed area of research focuses on essentially the same methodology, where the machine learning model is used as a predicator of movie success. This project will extend [@movie_succes_1]'s findings by analyzing a bigger sample of movie data (around 8,800), by aiming for a higher accuracy percentage of the machine learning model, and by extending the results of the model into the creation of a movie recommendation system. As previously stated, the knowledge gap for this area of research is addresed by extending previous research to include an interactive application, where users can utilize the pre-computed results from this research to display the list of recommended movies. This widens the chosen audience for the results of this paper to include all public users, movie industry personnel, and other data researchers. This ensures that the increasing demand for personalized movie recommendations is taken into account within this area of research. ### Proposed Solution to Knowledge Gap @@ -218,13 +227,13 @@ The second section of the Streamlit API contains a web page of the different sec Some of the challenges associated with this area of research deals with the amount of data that is being analyzed, as well as the overall accuracy of the machine learning model being used. Given the accuracy of the machine learning models in previous research, then the model being used for this research introduces some variability to the results. Additionally, the number of data points being analyzed can also influence the variability of the results, as this research runs the risk of giving results on antiquated and/or biased data. For example, a movie that is proven successful in 2008 may not contain all of the features of a successful movie in 2023 given the changes in consumerism in the movie industry. As a way to combat this loss of usefulness of the model/data, this research is using continuously updated data and data collection. For example, the data being used from IMDB is updated daily and the data from Kaggle is updated monthly. However, this runs the risk of the model becoming too slow to function, as it has to process and go through more movies in order to get it's results. Therefore, the datasets are stored in Streamlit's cache decorator instead of the computer's memory, so that the cached results and pathways can be reset with each refresh of the dataset. This way, the model stays as relevant as possible and is not too slow once more users are able to use the API. -Another important challenge to note is with the weakness in using statistical analysis for the data of this project. As with any statistical analysis, there is always the liklihood that the analytical results of a sample do not align with the actual results for the population data. How this project overcomes this weakness is through the analysis of the p-value of all of the independent variables. As the p-value is a predicator for whether the given variable accurately explains the sample model, then it can also be used as an indicator of whether the null hypothesis would be retained if given another sample. With low p-values, then the model is able to be retained for accuracy. +Another important challenge to note is with the weakness in using statistical analysis for the data of this project. As with any statistical analysis, there is always the likelihood that the analytical results of a sample do not align with the actual results for the population data. How this project overcomes this weakness is by taking into account the analysis of the p-value for all of the independent variables being examined. As the p-value is a predicator for whether the given variable accurately explains the sample model, then it can also be used as an indicator of whether the null hypothesis would be retained if given another sample. By only keeping the independent variables that have a low sample p-values, the analytical model would provide a more accurate explanation of what is occuring with the population data for movies. ## Goals of the Project -As most of the project is dedicated towards predicting movie success and providing a more unique movie experience, then the main goal is to create a simplified application that users can navigate to for either result that they desire. Therefore, the creation of an application was required, which is where Streamlit comes into play. Streamlit is an open-source Python library where data scientists can create custom web apps for machine learning and data science. Using Streamlit to build the interface for the users streamlines the process of deploying the application, which allows for a faster runtime and less bugs during future uses of the application. +As most of the project is dedicated towards predicting movie success and providing a more unique movie experience, then the main goal is to create a simplified application that users can navigate to for either result that they desire, whether it is the analytical results or the movie recommendation system dashboard. Therefore, the creation of an application was required, which is where Streamlit comes into play. Streamlit is an open-source Python library where data scientists can create custom web apps for machine learning and data science. Using Streamlit to build the interface for the users streamlines the process of deploying the application, which allows for a faster runtime and less bugs during future uses of the application. -Additionally, this research aims to improve the accuracy rate of the machine learning model beyond previous research, where research has hovered at around 64.7% for the accuracy of the model. As most of the other research has analyzed a small sample of movie data, this project aims for a better accuracy rate by continuously evaluating a bigger sample size. The chosen sample size is set at 20,000-30,000 movies, which allows for a more accurate model prediction, as previous research had hovered at around 100-200 movies. Since previous research used dataset from certain time periods and not continuously updated datasets (such as the IMDB, Netflix, Hulu, Disney+, and Amazon Prime datasets, which are updated either daily or monthly), then their accuracy rate would only refer to the accuracy of their model for *the chosen time period*. This limits the scope of their results to a certain timeframe of the movie industry, instead of allowing for variation of the features that make a movie successful. +Additionally, this research aims to improve the accuracy rate of the machine learning model beyond previous research, where research has hovered at around 64.7% for the accuracy of the model. As most of the other research has analyzed a small sample of movie data, this project aims for a better accuracy rate by continuously evaluating a bigger sample size. The chosen sample size is set at 8,800 movies, which allows for a more accurate model prediction, as previous research had hovered at around 100-200 movies. Since previous research used dataset from certain time periods and not continuously updated datasets (such as the IMDB, Netflix, Hulu, Disney+, and Amazon Prime datasets, which are updated either daily or monthly), then their accuracy rate would only refer to the accuracy of their model for *the chosen time period*. This limits the scope of their results to a certain timeframe of the movie industry, instead of allowing for variation of the features that make a movie successful. This project combats this by using continuously updated movie data throughout the duration of the project. ## Ethical Implications @@ -315,7 +324,7 @@ The Numbers is a free website that offers its resources to industry professional As most of the information in this dataset is numerical, then it was decided that obtaining more textual data was necessary to improve the predictive nature of the proposed model. Therefore, Kaggle was used to obtain data on the following streaming services: Netflix, Hulu, Disney +, and Amazon Prime. Kaggle is a free website that offers over 50,000 public datasets and 400,000 public notebooks that are used to perform data analysis tasks and data science work [@kaggle]. The specific datasets that were used contained information directly from Netflix, Hulu, Disney +, and Amazon Prime through webscraping done by other Kaggle users. This involves automating a program to visit a website, locating information through the website's HTML tags, and saving the data in a dataset. The chosen Kaggle dataset for Netflix contained all of the movies and TV shows (over 8,000) that were available on Netflix as of October 18th, 2021, with new updates to the data occuring Quarterly. The chosen Kaggle dataset for Hulu contains all of the movies and TV shows available to users, with the data also being updated Quarterly. The chosen Kaggle dataset for Amazon Prime contains information on close to 10000 movies or TV shows available on their platform, with the data being updated Monthly. The chosen Kaggle dataset for Disney+ contains information on 1300 movies or TV shows, with the update to the data occuring quarterly. All of these datasets contain information on the following features of a movie: show ID, type, title, director, cast, country, date added, release year, rating, and duration, listed in (genre), and description. -In order to display relevant information about movies for the predictive recommendation application, then the OMDB API was used. OMDB API is a free web service that obtains movie information to be used for data analysis and visualization. All content and images saved on the API are contributed and mainted by the users [@api]. This API provides information on the following features of a movie: title, year, rated, release date, runtime, genre, director, writer, actors, plot, language, country, awards, poster (image), ratings, metascore, IMDB rating, IMDB ID, type, DVD sales, Box Office earnings, production, and website. +In order to display relevant information about movies for the predictive recommendation application, the OMDB API was used. OMDB API is a free web service that obtains movie information to be used for data analysis and visualization. All content and images saved on the API are contributed and mainted by the users [@api]. This API provides information on the following features of a movie: title, year, rated, release date, runtime, genre, director, writer, actors, plot, language, country, awards, poster (image), ratings, metascore, IMDB rating, IMDB ID, type, DVD sales, Box Office earnings, production, and website. As part of the training of the machine learning model, new data has to be introduced to the model for predictive analysis. IMDB is considered the world's most popular and authoritative source for movies and TV shows, therefore their free public datasets were chosen for implementation [@imdb]. There are a variety of different datasets that are available from IMDB, this paper decided to focus on one particular dataset that contained preliminary information about movies, which is the Title Basics dataset. This dataset contains information on the following features of a movie: title type, primary title, original title, isAdult, start year, end year, runtime (in minutes), and genre. Since the OMDB API only needs the movie title to perform it's searching functions, then the primaryTitle column is the only variable used in this dataset for analysis. @@ -442,11 +451,7 @@ Which can then be condensed into: **p-value = probability(Z score < sample Z score) * (number of tails in hypothesis testing)** -Once the sample Z score is calculated, then the probability of the population Z score being less than the sample Z score can be determined through the use of the Z score table. This probability score becomes the P-value. The interpretation of this p-value is as follows: - -If the dependent variable 'x' has no effect on the success of a movie, then "(p-value * 100)"% of studies will obtain the effect described in the null hypothesis in their sample because of random sample error. Essentially, the p-value will determine the liklihood that the null hypothesis was incorrectly rejected as an explanation for the population data. Since this determines the statistical significance of the regression, then p-values are considered significant to the distribution of the model if it is .05 or lower. Anything higher than .05 signifies that the effect of X on the sample does not provide enough significance to the effect of X on the population. - -Using this explanation of the significance of p-values, the following shows the p-values of the different columns in the dataset as they were regressed on the column 'movie success' for statistical significance: +Once the sample Z score is calculated, then the probability of the population Z score being less than the sample Z score can be determined through the use of the Z score table. This probability score becomes the P-value. Using this explanation of the what a p-value is, the following shows the sample p-values of the different columns in the dataset as they were regressed on the column 'movie success' for statistical significance: Table: Results of Stepwise Linear Regression on 'movie_success' @@ -473,23 +478,15 @@ Table: Results of Stepwise Linear Regression on 'movie_success' | genre_Western | -0.2397 | 0.109 | 0.028 | -Through running a simple regression, the following features were found to be statistically significant, as they had a p-value that was less than .05: production year, production budget, domestic box office, international box office, rating, sequel, running time, genre_Adventure, genre_Drama, genre_Western. - -Below are examples of the logistic regressions ran for some of the features being analyzed against movie success. - -![Logistic Regression of Production Year on Movie Success](images/prod_year_analysis.png){width="500" height="400" style="display: block; margin: 0 auto" } - -![Logistic Regression of Production Budget on Movie Success](images/prod_budg_analysis.png){width="500" height="400" style="display: block; margin: 0 auto"} - - -![Logistic Regression of Adventure Genre on Production Budget](images/dom_office_analysis.png){width="500" height="400" style="display: block; margin: 0 auto"} +Assuming that the alternative hypothesis is that movie_success will be a 1 given the different independent variables and that the null hypothesis is that movie_success will be a 0 given the different independent variables, then the interpretation of this p-value using the column 'genre_Romantic_Comedy' is as follows: +If the dependent variable 'genre_Romantic_Comedy' has no effect on the success of a movie (movie_success = 0), then 2.8% of studies will obtain the effect described in the null hypothesis in their sample because of random sample error. Essentially, there is a 2.8% chance that the null hypothesis was incorrectly rejected as an explanation for the population data. Since this determines the statistical significance of the regression, then p-values are considered significant to the distribution of the model if it is .05 or lower. Anything higher than .05 signifies that the effect of X on the sample does not provide enough significance to the effect of X on the population. Since the p-value is lower than .05, then the model can say with a 97.2% confidence level that a movie with the genre 'Romantic Comedy' will lead to a successful movie (movie_success = 1) in both the sample data and the population data. -Since the column of movie success was created using the domestic box office and international box office figures, then these variables were ultimately dropped from the rest of the processing. Given the results from previous research, the significant variables identified with the logistic regression fit with the results from previous research. Essentially, the results are sensible with the subject matter being studied. +Through running a simple regression, the following features were found to be statistically significant, as they had a p-value that was less than .05: production year, production budget, domestic box office, international box office, rating, sequel, running time, genre_Adventure, genre_Drama, genre_Western. Since the column of movie success was created using the domestic box office and international box office figures, then these variables were ultimately dropped from the rest of the processing. As the regressions ran were multiple single regressions with one independent 'x' variable being regressed against the 'y' variable 'movie_success', then there is the probability of the results containing omitted variable bias. As the independent variables were not all analyzed alongside each other when being regressed against 'movie_success', then there is the liklihood that the coefficients do not take into account the influence that the other independent variables have on the 'movie_success' variable. This could lead to a bias in the estimates of the coefficients of the rest of the independent variables, as well as the hypothesis tests about the coefficients of the independent variables, making the predicted values of p not as reliable with the model. This creates concerns about the validity of the model, as it mainly relies on the statistical significance of the stepwise linear regression results. -In an effort to combat the omitted variable bias, a multiple logistic regression was ran alongside the aforementioned stepwise linear regressions to confirm the statistically significant variables. The results of running the multiple logistic is shown below: +In an effort to combat the omitted variable bias, a multiple logistic regression was ran alongside the aforementioned stepwise linear regressions to confirm the statistically significant variables. The results of running the multiple logistic regresion is shown below: Table: Results of Multiple Logistic Regression on 'movie_success' @@ -508,7 +505,7 @@ Table: Results of Multiple Logistic Regression on 'movie_success' | genre_Drama | 0.6735 | 0.5680 | 0.23570 | | genre_Adventure | 1.036 | 0.6151 | 0.09201 | | genre_Black_Comedy | 1.368 | 0.7608 | 0.07222 | -| genre_Concert/Performance | 1.599 | 1.257 | 0.20313 | +| genre_Concert/Per... | 1.599 | 1.257 | 0.20313 | | genre_Documentary | 0.1045 | 1.172 | 0.92896 | | genre_Horror | 1.389 | 0.6212 | 0.02532 | | genre_Musical | 0.5013 | 0.7771 | 0.51888 | @@ -516,6 +513,7 @@ Table: Results of Multiple Logistic Regression on 'movie_success' | genre_Thriller/Suspense | 1.317 | 0.5802 | 0.02319 | | genre_Western | NA | NA | NA | + For the results of the multiple logistic regression, the following columns were deemed statistically significant as their P-value is .05 or less: 1. production_budget @@ -529,7 +527,16 @@ For the results of the multiple logistic regression, the following columns were 9. genre_Romantic Comedy 10. genre_Thriller/Suspense -When comparing the results of the stepwise linear regressions and the multiple logistic regressions, there are some columns that were deemed statistically significant by both models, whereas other columns were deemed statistically significant by one model but not the other. This discrepancy is due to: the difference in the number of independent variables being regressed on the y variable 'movie_success' and the difference in regression types (linear and logistic). As the variable being regressed is binary and multiple independent variables decrease the amount of omitted variable bias, then the results from the multiple logisitic regressions are used as the final regression results for the purposes of this paper. However, given the limited number of observations in the data, it was decided that all of the genres should remain statistically significant towards the creation of the machine learning model. +Below are examples of the logistic regressions ran for some of the features being analyzed against movie success. + +![Logistic Regression of Production Year on Movie Success](images/prod_year_analysis.png){width="500" height="400" style="display: block; margin: 0 auto" } + +![Logistic Regression of Production Budget on Movie Success](images/prod_budg_analysis.png){width="500" height="400" style="display: block; margin: 0 auto"} + + +![Logistic Regression of Adventure Genre on Production Budget](images/dom_office_analysis.png){width="500" height="400" style="display: block; margin: 0 auto"} + +Given the results from previous research, the significant variables identified with the logistic regression fit with the results from previous research. Essentially, the results are sensible with the subject matter being studied. When comparing the results of the stepwise linear regressions and the multiple logistic regressions, there are some columns that were deemed statistically significant by both models, whereas other columns were deemed statistically significant by one model but not the other. This discrepancy is due to: the difference in the number of independent variables being regressed on the y variable 'movie_success' and the difference in regression types (linear and logistic). As the variable being regressed is binary and multiple independent variables decrease the amount of omitted variable bias, then the results from the multiple logisitic regressions are used as the final regression results for the purposes of this paper. However, given the limited number of observations in the data, it was decided that all of the genres should remain statistically significant towards the creation of the machine learning model. After doing the preliminary data analysis to find the statistically significant features, it was then time to do the machine learning analysis of the data. Machine learning is the process of 'making systems that learn and improve by themselves, by being specifically programmed' [@banoula]. For machine learning all of the statistically significant features were put as the x value, with the y-value being the movie success column. The data was randomized to make sure that it is evenly distributed and that the ordering does not affect the learning process. From there, the data is split into training and testing data. The training data (which is 30% of the total data) is what the machine learning model learns from, where it takes in all of the features included in x and then attempt to guess the possibility of y based on these features. The testing data (which is 70% of the total data) is used to check the accuracy of the model after training. @@ -954,7 +961,7 @@ This section describes the experiments that are performed to test the validity o ## Purpose of Project -In today's hectic society, recommendation systems are becoming increasingly vital. Individuals are constantly pressed for time as a result of their increased workload, leaving little time for determining which movies would fit their personal preferences. As a result, the recommendation systems are crucial since they enable people to make the best decisions without using up their cognitive resources. A recommendation system basically seeks out content that a particular person might find interesting. Additionally, it takes into account a variety of variables to develop tailored lists of fascinating and helpful information that are unique to each user. Artificial intelligence-based algorithms used in recommendation systems scan through all available options to compile a unique list of options that are interesting and pertinent to a particular user. These outcomes are usually determined by what other people with comparable characteristics and demographics are watching. +In today's hectic society, recommendation systems are becoming increasingly vital. Individuals are constantly pressed for time as a result of their increased workload, leaving little time for determining which movies would fit their personal preferences. As a result, the recommendation systems are crucial since they enable people to make the best decisions without using up their cognitive resources. A recommendation system basically seeks out content that a particular person might find interesting. Additionally, it takes into account a variety of variables to develop tailored lists of fascinating and helpful information that are unique to each movie. Artificial intelligence-based algorithms used in recommendation systems scan through all available options to compile a unique list of options that are interesting and pertinent to a particular user. These outcomes are usually determined by what other people with comparable characteristics and demographics are watching. For the purposes of this project, the knowledge gap in current recommendation systems are that they only analyze either the description of a movie or user ratings for a movie when making their recommendations. This does not take into account other features that may be equally as important to other users. Given the survey of recent literature, the actors/actresses were hypothesized to be one of the main determinants of movie success, as well as the production budget. These features, along with other ones, were found to have a positive correlation with the overall success of a movie. Once these determinants were found (through data analysis of the variable movie success), they are used in a predictive machine learning model to determine which movies are closely related to each other by the following features: Title, Plot, Genre, Actors, Director, Writer, and Rating. All of these features were grouped together into a singular column called 'tags' for the data processing, lemmatization, stemming, and vectorization. Therefore, this paper solves for the knowledge gap by allowing for more features to be used in the recommendation system.