Analyzing Moving Reviews
Posted on Dim 23 septembre 2018 in Data Analysis
Movie Reviews & Descriptive Statistics¶
The dataset was put together to help detect bias in the movie review sites. Each of these sites has 2 types of score -- User scores, which aggregate user reviews, and Critic score, which aggregate professional critical reviews of the movie.
The dataset contains information on most movies from 2014 and 2015 and was used to help the team at FiveThirtyEight explore Fandango's suspiciously high ratings.
The goal is to figure out suspiciously high ratings with Descriptive Statistics
import pandas as pd
movies = pd.read_csv("fandango_score_comparison.csv")
movies.head()
Histograms¶
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(movies['Fandango_Stars'])
plt.show()
plt.hist(movies['Metacritic_norm_round'])
plt.show()
In 'Metacritic' column ratings are spread out in contrast to 'Fandango_Stars' where ratings are between 3.0 and 5.0
Mean, Median, And Standard Deviation¶
mean = movies[['Fandango_Stars','Metacritic_norm_round']].mean()
median = movies[['Fandango_Stars','Metacritic_norm_round']].median()
std = movies[['Fandango_Stars','Metacritic_norm_round']].std()
print(mean, median, std)
Fandango vs Metacritic Methodology¶
Fandango appears to inflate ratings and isn't transparent about how it calculates and aggregates ratings. Metacritic publishes each individual critic rating, and is transparent about how they aggregate them to get a final rating.
Fandango vs Metacritic differences¶
The median metacritic score appears higher than the mean metacritic score because a few very low reviews "drag down" the mean. The median fandango score is lower than the mean fandango score because a few very high ratings "drag up" the mean.
Fandango ratings appear clustered between 3 and 5 so the Standard Deviation is smaller than Metacritic reviews, which go from 0 to 5, with a higher Standard Deviation.
Fandango ratings in general appear to be higher than metacritic ratings.
Fandango's main business is selling movie tickets, so they could bias their ratings to sell more tickets. And it explain why they calculates its ratings in a hidden way.
Scatter Plots : detect outliers ratings¶
plt.scatter(movies['Fandango_Stars'],movies['Metacritic_norm_round'])
plt.show()
import numpy as np
movies['fm_diff'] = movies['Fandango_Stars'] - movies['Metacritic_norm_round']
movies['fm_diff'] = np.absolute(movies['fm_diff'])
movies_sorted = movies.sort_values(by="fm_diff", ascending = False)
movies_sorted[['FILM','Fandango_Stars','Metacritic_norm_round']].head()
We computed ratings differences between Fandango and Metacritic, took the absolute values and sorted values in descending to select the largest outliers.
Correlations between Fandango & Metacritic Ratings¶
from scipy.stats.stats import pearsonr
r, p_value = pearsonr(movies['Fandango_Stars'], movies['Metacritic_norm_round'])
print(r)
The low correlation between Fandango and Metacritic scores indicates that Fandango scores aren't just inflated, but are just different. They must inflate ratings depending on some special criterias.
Linear Regression based on Metacritic Score¶
from scipy.stats import linregress
slope, intercept, r_value, p_value, stderr_slope = linregress(movies['Metacritic_norm_round'], movies['Fandango_Stars'])
predicted_y_fandango = slope * 3 + intercept
print(predicted_y_fandango)
A movie with a rate of 3 in Metacritic would be a rate of 4.1 for Fandango
Linear Regression and Scatter plot¶
Better visualize how the line relates to the existing datapoints.
predicted_y_1 = slope * 1.0 + intercept
predicted_y_5 = slope * 5.0 + intercept
plt.scatter(movies["Metacritic_norm_round"], movies["Fandango_Stars"])
plt.plot([1,5],[predicted_y_1,predicted_y_5])
plt.xlim(1,5)
plt.show()