Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final Report - Peer Review (kc594) #94

Open
kc594 opened this issue Dec 10, 2017 · 0 comments
Open

Final Report - Peer Review (kc594) #94

kc594 opened this issue Dec 10, 2017 · 0 comments

Comments

@kc594
Copy link

kc594 commented Dec 10, 2017

This project is about predicting the box office performance of movies for their opening weekend. The group collects data from four different movie sites and aims to assist marketers and decisions about how many theaters to plan to release a movie at.

Some highlights of the project:

  1. Data examples and features are pulled from many sources and aggregated, but the process is very well outlined and explained as it progresses. Reasoning is included as are definitions of what values made up the features.
  2. The process of cleaning the data and extracting from the examples and features pulled from websites is very methodical, and each decision to remove information is well explained to the reader.
  3. Visualizations of each model are very helpful to follow how the performance is doing and view the accuracy of each.

Some room for future improvements:

  1. The number of sources the team pulls data from is impressive, but although I understand data scraping is time consuming, I would not be convinced that the number of samples used to train the model is sufficient to make accurate predictions moving forward. More than one year of movies would also likely be more accurate to capture trends better.
  2. You mention using the column mean to fill missing values, but given the number of missing values is 84 and 41 out of only 165, I wonder if the mean is really the best choice for say all 84 missing values. Did you try any other methods such as matrix completion, or possibly removing this column altogether to see how it really affects the accuracy of the predictions? It is later said director_gross does not seem to be highly correlated with open_gross, and I wonder if this is due to the number of values imputed with the mean of the other 81/165 values in the column.
  3. The page of visualizations is a little hard to follow. Maybe including the most indicative graphs of important features would have been easier to include. As a reader I am not sure what to focus on for this page.

Overall, I can tell a lot of work has been put into this project, and it was definitely one of the more interesting ones to read. Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant