Attempting to predict the next Blockbuster!

What if we could predict the next Blockbuster? We could hire the actors and director that would be sure to bring in moviegoers, set our budget to maximize return, release the movie during the prime time of year, make a movie that is part of an existing IP, and more! We can attempt to create the perfect film by making a model to find the most important features using the movies dataset to train, test, and validate our model.

After merging the datasets to include all factors in one dataframe, we first want to remove any rows without revenue reported. Then, we can move forward adjusting our revenue and budget for inflation. After the data is standardized, we can preview the highest and lowest revenue films to see if any of our outliers are due to improper data entry.

Then, we can convert the set into a time series to check for seasonality in the data. We do find that there is some correlation between summer or holiday releases with higher revenue films, though that is not always consistent.

Then, after creating additional variable to signal whether or not the movie is released during a peak season, we can run Ridge and Lasso regression to find our most predictive variables.

Then, it was time to separate out the actor variables and run PCA to find the best combination of actors. Unfortunately, there was only a negative correlation found in our top principal components.

In the end, the model only represented 30% of the variability in the data, but shows that if there is a comedy/drama with a large budget, short runtime, part of an IP, released in the summer/holidays, it should have a high revenue.