Movie Reviews & Descriptive Statistics¶

The dataset was put together to help detect bias in the movie review sites. Each of these sites has 2 types of score -- User scores, which aggregate user reviews, and Critic score, which aggregate professional critical reviews of the movie.

The dataset contains information on most movies from 2014 and 2015 and was used to help the team at FiveThirtyEight explore Fandango's suspiciously high ratings.

The goal is to figure out suspiciously high ratings with Descriptive Statistics

In [1]:

import pandas as pd
movies = pd.read_csv("fandango_score_comparison.csv")

In [2]:

movies.head()

Out[2]:

	FILM	RottenTomatoes	RottenTomatoes_User	Metacritic	Metacritic_User	IMDB	Fandango_Stars	Fandango_Ratingvalue	RT_norm	RT_user_norm	...	IMDB_norm	RT_norm_round	RT_user_norm_round	Metacritic_norm_round	Metacritic_user_norm_round	IMDB_norm_round	Metacritic_user_vote_count	IMDB_user_vote_count	Fandango_votes	Fandango_Difference
0	Avengers: Age of Ultron (2015)	74	86	66	7.1	7.8	5.0	4.5	3.70	4.3	...	3.90	3.5	4.5	3.5	3.5	4.0	1330	271107	14846	0.5
1	Cinderella (2015)	85	80	67	7.5	7.1	5.0	4.5	4.25	4.0	...	3.55	4.5	4.0	3.5	4.0	3.5	249	65709	12640	0.5
2	Ant-Man (2015)	80	90	64	8.1	7.8	5.0	4.5	4.00	4.5	...	3.90	4.0	4.5	3.0	4.0	4.0	627	103660	12055	0.5
3	Do You Believe? (2015)	18	84	22	4.7	5.4	5.0	4.5	0.90	4.2	...	2.70	1.0	4.0	1.0	2.5	2.5	31	3136	1793	0.5
4	Hot Tub Time Machine 2 (2015)	14	28	29	3.4	5.1	3.5	3.0	0.70	1.4	...	2.55	0.5	1.5	1.5	1.5	2.5	88	19560	1021	0.5

5 rows × 22 columns

Histograms¶

In [39]:

%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(movies['Fandango_Stars'])
plt.show()
plt.hist(movies['Metacritic_norm_round'])
plt.show()

In 'Metacritic' column ratings are spread out in contrast to 'Fandango_Stars' where ratings are between 3.0 and 5.0

Mean, Median, And Standard Deviation¶

In [40]:

mean = movies[['Fandango_Stars','Metacritic_norm_round']].mean()
median = movies[['Fandango_Stars','Metacritic_norm_round']].median()
std = movies[['Fandango_Stars','Metacritic_norm_round']].std()
print(mean, median, std)

Fandango_Stars           4.089041
Metacritic_norm_round    2.972603
dtype: float64 Fandango_Stars           4.0
Metacritic_norm_round    3.0
dtype: float64 Fandango_Stars           0.540386
Metacritic_norm_round    0.990961
dtype: float64

Fandango vs Metacritic Methodology¶

Fandango appears to inflate ratings and isn't transparent about how it calculates and aggregates ratings. Metacritic publishes each individual critic rating, and is transparent about how they aggregate them to get a final rating.

Fandango vs Metacritic differences¶

The median metacritic score appears higher than the mean metacritic score because a few very low reviews "drag down" the mean. The median fandango score is lower than the mean fandango score because a few very high ratings "drag up" the mean.

Fandango ratings appear clustered between 3 and 5 so the Standard Deviation is smaller than Metacritic reviews, which go from 0 to 5, with a higher Standard Deviation.

Fandango ratings in general appear to be higher than metacritic ratings.

Fandango's main business is selling movie tickets, so they could bias their ratings to sell more tickets. And it explain why they calculates its ratings in a hidden way.

Scatter Plots : detect outliers ratings¶

In [41]:

plt.scatter(movies['Fandango_Stars'],movies['Metacritic_norm_round'])
plt.show()

In [42]:

import numpy as np

movies['fm_diff'] = movies['Fandango_Stars'] - movies['Metacritic_norm_round']
movies['fm_diff'] = np.absolute(movies['fm_diff'])

In [56]:

movies_sorted = movies.sort_values(by="fm_diff", ascending = False)
movies_sorted[['FILM','Fandango_Stars','Metacritic_norm_round']].head()

Out[56]:

	FILM	Fandango_Stars	Metacritic_norm_round
3	Do You Believe? (2015)	5.0	1.0
85	Little Boy (2015)	4.5	1.5
47	Annie (2014)	4.5	1.5
19	Pixels (2015)	4.5	1.5
134	The Longest Ride (2015)	4.5	1.5

We computed ratings differences between Fandango and Metacritic, took the absolute values and sorted values in descending to select the largest outliers.

Correlations between Fandango & Metacritic Ratings¶

In [47]:

from scipy.stats.stats import pearsonr

r, p_value = pearsonr(movies['Fandango_Stars'], movies['Metacritic_norm_round'])
print(r)

0.178449190739

The low correlation between Fandango and Metacritic scores indicates that Fandango scores aren't just inflated, but are just different. They must inflate ratings depending on some special criterias.

Linear Regression based on Metacritic Score¶

In [48]:

from scipy.stats import linregress

slope, intercept, r_value, p_value, stderr_slope = linregress(movies['Metacritic_norm_round'], movies['Fandango_Stars'])

predicted_y_fandango = slope * 3 + intercept

print(predicted_y_fandango)

4.09170715282

A movie with a rate of 3 in Metacritic would be a rate of 4.1 for Fandango

Linear Regression and Scatter plot¶

Better visualize how the line relates to the existing datapoints.

In [54]:

predicted_y_1 = slope * 1.0 + intercept
predicted_y_5 = slope * 5.0 + intercept

plt.scatter(movies["Metacritic_norm_round"], movies["Fandango_Stars"])
plt.plot([1,5],[predicted_y_1,predicted_y_5])
plt.xlim(1,5)

plt.show()