Movie Analytics 2.0: Predicting the next 100 crore club movie

Movie Analytics 2.0: Predicting the next 100 crore club movie

This blog is the second part of the 2 part article by Praxis Business Analytics Alumni Ritesh Mohan Srivastava that appeared in Analytics India Magazine on October 02, 2014. For more articles on Analytics, visit – a leading information portal in the Analytics domain.

Last October I wrote an article on how analytics can transform the way film industry (Movie Analytics in India: Are we underestimating, rather completely ignoring the business potential) . I received an overwhelming response from people, appreciating the idea and some of them saw it as a full-fledged business opportunity but most them had a question : “Is this possible or just another pretty idea on table?”, so I decided to come up with a probable solution or way forward to do the analysis. So let’s look at some of the methods which can be used to predict the next 100 crore movie.

Social media sentiment analysis

Like many other people. I also don’t go by movie reviews; instead, I rely on people who tell me if a particular movie is a 100 crore club thing or a waste of money.

Now, let’s talk about “Happy New Year”, the upcoming SRK flick releasing in Diwali. By using R ( twitteR package, which is freely available) , we can extract the tweets for a given time period.

Once we gather the tweets, it’s time to run Sentiment Analysis on them, and answer several questions highlighted below:

  1. Number of tweets mentioning the film’s name or related to film:
  • The number of tweets when the first look of the movie was launched
  • The number of tweets when its first song was launched
  • The number of tweets when SRK’s 8   pack look was launched  and finally
  • The number of tweets of Friday and Saturday (to know the final buzz)

There will be accounts that tweet repeatedly, and we will have to sanitize them. Now there will be tweets with good sentiments and

bad sentiments (Can’t wait for #HNY or SRK looks so old in HNY #HNYsucks) so we will give a sentiment score to the sanitized tweets, on the scale of -10 to 10. After comparison of the good and the bad sentiment score, we will get a S-score for the movie.

Text Analysis of the tweets

We will analyse the text of sentiment tweets (good and bad) to know what words people are using prominently. We may find results like


SRK, Sharukh, Deeepika are the prominent words in good tweets, whereas Abhishek, JuniorB, Farah are common words in bad tweets. This will give us an idea about the general perception about the movie

  • If we have to predict whether or not a movie will make it to the 100 crore club, YouTube is perhaps one of the most important destination to find out.
  • If a movie is stated to release on Diwali, the first look of the movie will be launched on YouTube 2-3 months prior and the “YouTube API v2.0 – Retrieving Insight Data” can provide us some very relevant data such as the time, day, date, device on which you watched the video, your location , ratings you gave, when did you pause the video and commented, your browsing and scrolling behaviour, when and where did you rewind , fast forward or watch the video again.

Every movie, small or big, will have a dedicated Wikipedia page and the importance of those pages in movie prediction can be seen as a research ( shows the correlation between people who are coming , reading and editing the movie page to the success of the movie is very strong hitting an R-squared value of 0.94. It was 0.925, a month before the movie came out. While Twitter at its peak hit 0.98, that peak arrives sharply on the release date and hovers around 0 a month before release.

There are different variables which can be considered while taking Wikipedia into consideration such as Number of edits, Number of views etc, these can be obtained by using the categorization function in Wikipedia

Extracting the data and comparing the results
Comparing the results with the films that were into 100 crore club in the past, also which were supposed to be the part of the club but couldn’t make it (to know that whether the prediction based on our model is accurate or not, if yes then to what extent)

Extracting the data : Beside using R packages like twitteR and YouTube API , Its easier if we can extract the social media data by using services such as Radian or Ubervu (both are paid services) and analyse them on any statistical software i.e. SAS, R

Neural network and other statistical techniques

  • To predict the success of the movie before release, neural networks and other models can be useful. So, if we say that 9 is the most successful and 1 is the least, we can put a upcoming movie on this scale to predict the success of the movie. For this we can take the past movie data of last 5 years (boxofficeinida).
  • Now take independent variables such as the rating of the movie (18+), Cast star value, Genre of the movie, it’s a sequel or not, Number of screens where films will be screened, budget, director.
  • We can use aMultilayer perceptron neural network model for this particular case, with two layers. We can also use Box-Jenkins ARIMA model as well as Multilinear regression. So, it would be like one has to identify the patterns in past data and mix them with the current data points that are freely available and then came to a conclusion.
  • First we will quantify the value of a cast and crew of the film. Now this can be accomplished by using graph theory applied to the IMDB database (Boxofficeindia in case of bollywood). The nodes of the graph will be the actors, directors, producers, story writer, studio, music director in the database, then We can connect these nodes using films as the graph’s edges. That will result in millions of connections between all of the nodes
  • Now we can apply Google’s PageRank algorithm to the collection set and assign each film’s opening weekend box office revenue as the value of each connection. Now that will allow us to come up with a score for each cast member and approximate his or her relative contribution to the Movie market economy.
  • Next we will weigh the impact of the cast by role: actor, director, producer, studio, writer etc by using least squares linear regression. Least square linear regression will give plane in space and then it will minimizes the sum of the squares of the distances for all the data points from the said plane ultimately providing us with coefficients that approximate the relative value of each role.
  • On the other hand applying ANOVA can help us to separate the factors that were statistically significant in predicting revenue from those which are not.
  • We can also quantify the impact of genre on a film’s financial success, we took a sample of films released in past from each genre (Action, Romance, Family drama) along with revenue figures and measure the deviation from mean by genre category. However, though it is an effective technique, it is difficult to balance the test and train data also getting the right number of weights and starting nodes

There can be several methods of predicting the success or failure of a movie at box office. Some of were covered here. Provided adequate time and resources, we can build a model which can almost accurately predict a movie’s box office success and its chances of making it to the 100 crore club.

Movie Analytics 2.0: Predicting the next 100 crore club movie

Leave a comment

Your email address will not be published. Required fields are marked *

© 2024 Praxis. All rights reserved. | Privacy Policy
   Contact Us