Predicting a Hit: Analysis of the Film Industry

Jeff Spagnola
7 min readDec 4, 2020

--

Note: The project outlined in this blog was the first of several projects from my time in the Data Science Bootcamp at Flatiron School. If some of what you’re about to read below sounds a bit amateur…well, you’re right. This was around 3 weeks into my Data Science training and was certainly a bit green. If nothing else, enjoy the show.

When it comes to the film industry, what makes a hit? Are there certain features of top grossing films that can nearly guarantee that a movie performs well at the box office? This is what we’ve explored in a fun data analysis project.

To put a cool spin on this project, we decided to give it a little bit of a backstory. In this alternate data analysis dimension, Microsoft has recognized that other large tech corporations have found ample success in the film industry by producing their own original content. The new plan: Microsoft Studios. This new venture could create an entirely new revenue stream for the company as well as provide many opportunities to synergize with existing Microsoft products. Sounds like a great idea, right?

However, with a high-dollar point of entry and Microsoft’s sterling reputation on the line, this can also prove to be risky. How can we minimize the risk and maximize the return on investment by using a data driven production system?

Over the course of this project, we wanted to answer the following questions:

  • Is there a correlation between a film’s budget and it’s performance at the box office?
  • Do films in certain genres perform better at the box office? If so, which genres?
  • Is there a correlation between a film’s MPAA rating and it’s ROI?
  • Does a film’s runtime have any effect on worldwide gross

The Data

The data was obtained by combining several datasets containing relevant info about films over the last decade. Including:

IMDB Title Basics — This CSV contains basic information pertaining to films listed on IMDB, including title, genre, year released, and runtime. Most importantly, it contains the ‘tconst’ column which is a unique code used by IMDB to organize their films.

IMDB Title Ratings — This CSV contains the user rating and vote numbers for each film listed. Like earlier, this csv includes the ‘tconst’ column which will be used in a later merge.

Scraped Data from IMDB — This scrape was a group effort with the core P2P study group team. Each person took on a section of IMDB ‘tconst’ codes to scrape useful monetary and rating data. The combined computing power allowed us to cut scrape time from 11 days down to around 9 hours.

Next, we were ready to merge all these datasets after cleaning them of null values, missing values, and random inputs.

imdb_full_df = pd.merge(left=imdb_money_ratings, right=imdb_df, how='left', left_on='tconst', right_on='tconst')

At this point, we also wanted to create a net revenue column by subtracting the ‘budget’ from the ‘ww_gross’.

imdb_full_df['Net_Revenue'] = imdb_full_df['ww_gross'] - imdb_full_df['budget']

Also, while we’re at it, let’s create an ROI column.

imdb_full_df['ROI'] = (imdb_full_df['Net_Revenue'] / imdb_full_df['budget']) * 100

Since we’re looking to create a film studio that can compete with other large film companies, we must assume that movies will not be made with budgets under 1 million dollars, which is the Hollywood standard for a low-budget movie. (Most major studios consider $30 million to be low budget!)

imdb_full_df = imdb_full_df[imdb_full_df['budget'] >= 1000000]

Analysis

Q1. Is there a correlation between a film’s budget and it’s performance at the box office?

In the figure below, the goal is to see if there exists a “sweet spot” in terms of a production budget where we can maximize the rate of return. The top 100 highest grossing films are represented and the larger the plot point, the higher the net profit. As we can see, outside of a handful of “home runs”, the area of the most consistent rate of return is where the production budget falls between 150 and 200 million dollars.

After figuring out that the budget “sweet spot” is between 150 and 200 million dollars, I wanted to see if I could narrow that range down a bit and figure out a more exact number. I now want to compare the ROI of films with budgets between 150 & 200 milliion dollars. We can now see that the single budget that has the highest ROI within this range is $183 million.

We’ll talk more about genre in the next section, but the figure below shows the median budget by genre of films produced over the last 10 years. We can see that Adventure, Animation, and Sci Fi are among the most expensive films to make.

Q2. Do movies in certain genres perform better at the box office?

The figure below represents the total net revenue for each major genre over the past 10 years. We can see that the genres with the highest median net revenue are Sci Fi, Adventure, Fantasy, Musical and Animation.

After seeing that the top 5 genres by net revenue are Adventure, Sci-Fi, Animation, Musical, and Fantasy, I wanted to check the ROI across these 5 genres. Surprisingly, of these genres, Animation has the highest ROI.

Q3. What is the relationship between a film’s MPAA rating and it’s performance at the box office?

The figure below represents the median net revenue of movies with particular MPAA ratings over the last 10 years. Films that are rated PG make up a large portion of the total profits of the entire film industry.

In terms of return on investment, we can see that there’s little difference in the average ROI between G, PG, PG-13 and R rated movies with G rated movies edging out the others in this category. Therefore, we can still say that making a PG rated movie is the safest bet.

Q4. Does the length of a film affect the worldwide gross?

The histogram below shows the range of the runtime of all the movies in the dataset. We can see that the runtime of most films falls between 80 minutes and 140 minutes with a mean of 107 minutes.

The figure below represents the relationship between the length of the top 250 highest grossing movies and each movie’s ROI. Taking a look at the results, we see that, while close the densest area of the scatterplot coincides with the area between 100 and 120 minutes in length.

To further narrow this down to find the perfect intersection of runtime and ROI, the figure below compares the Median ROI of films with a specific runtime between 100 and 120 minutes. As we can see, the highest ROI occurs when a film is 115 minutes long.

Conclusion

The fictitious creation of Microsoft studios offers the opportunity for the company to create an entirely new revenue stream by getting in on the content creation game in producing feature films. Since the Microsoft reputation is on the line, we obviously want to make hits. This task isn’t without risk, so we decided to analyze data directly from the film industry to see if there were any ways we can hedge our bet by using a data-driven production system.

So what have we learned?

  • The most consistently high rate of return occurs when a production budget is $183 million.
  • The most profitable film genre is animation.
  • The most profitable films tend to be rated PG
  • Films should aim to be roughly 115 minutes in length.

Recommendations

Based on the above analysis, we would be able to provide Microsoft Studios with the following recommendations:

  • The film should be made with a production budget of $183 million.
  • The genre of the film should be animation.
  • The film should be rated PG
  • This film should be 115 minutes in length.

Future Work

There are a variety of other aspects that I would like to tackle in the future. These include:

  • Are established film franchises more of a ‘sure-thing’?
  • Do major award nominations affect overall earnings?
  • Is it worth exploring creating and producing content exclusively for streaming platforms?

--

--

Jeff Spagnola
Jeff Spagnola

Written by Jeff Spagnola

A mildly sarcastic, often enthusiastic Data Scientist based in central Florida. If you’ve come expecting blogs about machine learning, future science, space, AI

No responses yet