-
Notifications
You must be signed in to change notification settings - Fork 0
/
ideas.txt
123 lines (95 loc) · 3.3 KB
/
ideas.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
To-Do:
* show source in notebook code
Questions:
* Define the concepts, purpose
- What makes a good movie. Get indicators for a quality movie as indicated by ratings, budget, gross, ...
- See what kind of information is correlated (budget-votes, )
- Get a generator for ratings/gross if possible
* Define workflows (import, process, visualize, ...)
- Import data, combine data
- Filter Data
- analyze (statistical, ...)
- for statistical tools ask
- google
- visualize
- document
* Code I need to write, libraries I can use.
- imdb api libraries mostly out of date. Not usable anymore -> write myself
- maybe requests to get the data.
- matplotlib to visualize
* How to present?
Steps:
1. Clean Data
2. Combine data in meaningful way
Limit to movies with x reviews?
3a. Look for correlations. e.g. rating with others, budget, ...
3b. Produce some visualizations
- most rated movies
- rating distribution
- genre vs rating
- timeline of certain movies (current top100?) rating, ...
- duration vs rating
- best directors? director and rating?
- best actors? actor and rating?
- release year vs rating
- budget vs rating
- budget vs gross
- number of reviews/votes vs rating
- number and budgaet of movies per year
- number of reviews correlation to number of votes
- gross vs reviews
- correlation of keywords to rating/gross/number of votes -> find keywords that every good movie has
- draw links between movies, link database
4. Predict rating of new movies
- with keywords, genre, budget, technicals, director, language, duration, location
- Personalized score. Predict how much I will like a movie based on movies I rated/liked before.
Questions
What kind of movies get the most gross to budget coefficient?
What movies make the most money?
What makes a good movie?
Can we predict the features/themes of a good selling movie (like keywords)?
Can we predict the rating/revenue for a given movie genre, actors, director, etc?
Possible further ideas:
* Heatmap, PCA, Correlation: https://www.kaggle.com/arthurtok/principal-component-analysis-with-kmeans-visuals
* https://www.r-bloggers.com/predicting-movie-ratings-with-imdb-data-and-r/
* https://rpubs.com/yash91sharma/dw_project_ys
* https://blog.nycdatascience.com/student-works/machine-learning/movie-rating-prediction/
* https://github.com/Poyuli/sentiment.analysis
Goals:
* Learn Python
* Learn about different libraries to work with datasets, visualization, ...
* Learn about statistical significance, correlation, variance in a dataset
* Introduction to Machine Learning and basic categorization and prediction
Libraries:
* (urllib2)[deprecated] => requests
Further ideas:
* Use references.list to make a graph of connections. Evaluate connections, look for bias.
Regarding the presentation:
Background
Libraries used
Methods
What are the results
What did I learn
scatterplot = scatter(x, y)
fig = pyplot figure, multiple graphs in one figure
fig.add_subplot return axes object
pcolor
hist
figsize as argument for fig
after that quantify?
no correlation? What doesn't work? IMPORTANT!
pearsonr
spearmanr
ttest
single test
chi-square test
scipy optimize
curve_fit
first clustering by scikit.learn
10min
Background
Probelem formulation
Implementation
Demonstration
What I learned
More about apply than details of Implementation