Box Office Statistical Analysis

Predicting Success of Film Adaptations from Book Data

Lindsay Tracy, Holly Baldacci, and Lauren Rogge

         Many different factors impact the lifetime box office gross of film adaptations of novels. We examined factors like number of lifetime theatres, natural log of the year the book was published, natural log of the number of books the author has written, critic rating on Rotten Tomatoes, audience rating on Rotten Tomatoes, and whether the adaptation is of the action genre. This study is interesting because in 2015 there were at least 20 book-to-film adaptations in the United States alone. As adaptations become more and more popular every year, this becomes a more prominent factor in potentially predicting a film’s success. According to, “Do Popular Books Always Make for Popular Movies?” by Xinran Lilly Liu, book sales have “a significant influence on the box office gross of a film adaptation” but that there is also “a tradeoff between book sales and production budget: a six-percent increase in book sales would achieve the same increase in box office gross as a one-percent increase in production budget.”  Furthermore, “Adaptation of Novels into Film—a Comprehensive New Framework for Media Consumers and Those Who Serve Them” by Neil Hollands studies whether ratings from one media form (book or film) can predict the success of the other, and determines that there is “a significant correlation between the two ratings, but a limited ability of that correlation to explain the full difference between the scores.” Therefore, the higher the production value of each film, the greater the lifetime box office gross, and book ratings and film ratings seem to be related.

audience rating

         We predicted that book rating and whether or not a book is part of a series would be the most accurate predictors, and that number of pages and how many books the author has written would play a less important role. For book rating, more popular books are more likely to be adapted into films, and film studios may be more likely to adapt books that belong to a series because if a film is successful, there is also a chance the rest of the series can be adapted into successful films as well. Previous analyses generally look at box office sales, book sales, and production cost, so our study is unique in that it attempts to predict box office sales based on more qualities of the books alone, in addition to the usual movie data.

         For our study, our dependent variable is lifetime gross of the film adaptation. Our dependent variables are opening theater gross, number of opening theaters, number of lifetime theaters, critic and audience film rating, number of pages in the book, how many books total the author of the book has written, the year the book was published, whether or not the book is part of a series, the genre of the story, and the book rating.

critic rating

         We found the number of theatres that the film is shown in its lifetime to be the most significant variable, at a p-value of 4.048E-12. However, this variable is closely linked to the lifetime gross, as theatres change viewing availability based on the film’s performance, and so the significance was expected. On average, for every additional theatre a film is shown in, the film accrues $0.047 million.

         In addition, whether or not a film is an action adaptation is an extremely significant variable, with a p-value of 0.00896. On average, action movies garnered $77.027 million more than adaptations of other genres. Though the weight of this correlation was unanticipated, the correlation itself was not, as many film studios allocate large production budgets to action book-to-movie adaptations.

Number of books author has written

         The most surprising significant variable was number of books the author has written, which had a p-value of 0.0348. For each additional book that the author had written, on average lifetime gross was $14.527 million lower. Considering the serialized nature of many book-to-movie adaptation, this correlation was unexpected, but is significant.

         Other variables, namely Rotten Tomatoes critic rating, Rotten Tomatoes audience rating, and year published, are less statistically significant, with p-values above 0.1. However, critic ratings were, as expected, positively correlated, with higher ratings signifying higher lifetime gross.

         To obtain our data, we found lifetime box office gross, and number of lifetime theaters for each film on, a database of many films. Information about how many books authors have written, number of pages in each book, whether or not each book is part of a series, the year of book publication, genre, and book rating out of 5 were found on Goodreads, the world’s largest site for readers and book recommendations. Critic and audience ratings out of 100 were found on Rotten Tomatoes.

number of books written

         Since all of this data was not available from one single source, we compiled data from multiple sources. This data is readily made public and available and is cross-checked, and since most film data was found on and most book data was found on Goodreads, these data sources should be consistent. Though Wikipedia is not generally considered a reliable source, we used it to get a general idea of what each book and the film adapted from it were about. A downside to our study is that there are many films based on books, so we selected all films produced since 2013 which had data available. This allowed us to produce a sample including blockbusters as well as smaller independent films, which means that there is lots of variety within the sample. Though we have a large amount of data, it would be nice to have some supplemental information. Ideally, our study would also examine total number of lifetime books sold, first run copies sold, highest rank on book bestseller lists, number of awards each book won, and the weeks each book spent on the bestseller list. Unfortunately, not much of this information is publicly available for every film and book combo in our sample. Production cost for each film was also not all readily available, which would have been another beneficial variable to include.

         We defined “Number of Theaters Lifetime” as the total number of theaters a movie was played in throughout its lifetime. “Critic Rating” was the rating given by critics between 0 and 100 percent on Rotten Tomatoes, an online movie review website that has a wide variety of film and movie critics. “Audience Ratings” is defined as the rating between 0 and 100 percent on Rotten Tomatoes, similar to the Critic Rating but based on normal movie goers instead of professional critics “Number of Books an Author Has Written” was the natural log of the total number of books an author has written throughout their literary career. “Year Published” was defined as the natural log of the book’s publication year. “Action” was an indicator variable defined as the genre of the film when the film was an action movie, with the base category of Drama.

theatres lifetime

         We believe including highest rank of a novel on the New York Times bestseller list as well as the number of weeks it spent on the New York Times bestseller list would have been beneficial to include. However, not every book was on the bestseller list. Furthermore, some of the older books were quite popular but never made it on the New York Times bestseller list as they were published before its existence.

         Additionally, including the total number of books sold as well as the First-Run number of copies sold for the book. However, some of this knowledge was not publicly available. Furthermore, Shakespeare books can be printed from anywhere, they can even be found on the internet for free because no publisher has the explicit rights to Shakespeare’s writings.

         The descriptive statistics, which can be found in the appendix, do make sense. The statistics reflect the fact that we used a variety of both indie films and popular films, and the averages found for each variable appear to make logical sense. Our scatter plots generally show a positive correlation for every variable with lifetime gross except for the natural log of the number of books the author has written. The majority of our samples were within the drama genre. Some interesting information provided by the descriptive statistics is that the mean critic rating on Rotten Tomatoes was 51.1%, indicating that there wasn’t a significant positive or negative critic bias and the sample is varied enough to include both critically acclaimed and critically spurned films. The mean for log of year published for books was 7.59, and when converted back to year, this represents 1985, which is interesting as all films were from recent years, indicating that older books continue to be adapted into films even as time passes. However, the median is 2005, indicating that aside from a few outliers published long ago, there was generally about a decade lag between book publication and film adaptation premiere.

year published

         We used a first order multiple linear regression model taking logarithmic transformation of two variables. We took the natural log of Number of Books an Author Has Written as well as Year Published. We took the natural log because the scatter plots originally looked skewed. This is probably due to looking at small independent films as well as large blockbusters. Using these methods, we found out model to be: Lifetime Gross ($M) = 1862.51+0.047002(Number of Theaters Lifetime) + 0.62839(Critic Ratings on Rotten Tomatoes) + 0.98815(Audience Rating on Rotten Tomatoes) – 14.5267(ln(number of books the author has written)) – 225.44701(ln(year published)) + 77.0269(Action). The Coefficient of Determination (R^2) is the proportion of variation in the dependent variable explained by the regression model. In general, or R^2 values is low, however because we have substantial variation in the types of books and films we looked at  it is reasonable to have a R^2 of 0.59997. The standard error as a fraction of the average lifetime gross is 0.9093, a large value considering the normal cutoff value of 0.2, indicating that forcasting with this model is not accurate. This is not too surprising, since there is a wide range in the lifetime gross with the addition of independent studio films and Hollywood studio films and many factors that we could not find. The standard error could potentially be improved by increasing the sample size of the data or by segmenting the data into its small, independent movie studio productions and large Hollywood studio productions. The null hypothesis that there is no correlation between the variables is rejected because the F-statistic (24.247) is much larger than the F-critical (2.1935), at an alpha level of .05.

         We used the most advantageous data for this model as other models likely would not have been as accurate or have made as much sense. We didn’t need a second order model because none of our variables resulted in quadratic graphs, and linear data seemed most appropriate. WE also didn’t a time series multiple linear regression because our observations were not taken at even intervals over a specific time period. We also removed some variables that were not included in our final analysis due to collinearity. For example, we kept “Number of Theaters Lifetime” and removed “Number of Theaters Opening” as the two variables were strongly linearly related. We also removed “Opening Theater Gross” as it was strongly related to our independent variable. Additionally, using backwards elimination, we removed “Pages in Book,” “Goodreads Rating (out of five),” and “Series” (whether or not the book was part of a series). We also removed some of the other genres besides action. We also removed four films from our data, including The Martian, World War Z, and Fifty Shades of Grey, which had studentized residuals above 2.0, 2.0, and 3.0, respectively, as well as The Hobbit: Battle of the Five Armies which had studentized and standardized residuals above 5.0. The remaining outliers made up less than 5% of our total data.

         The main themes from our research are that very few factors affect both small, independently made book-to-film adaptations and large box office hits. Our data represented an extremely diverse selection of adaptations, and therefore few variables, though there were some, remained significant throughout. Film studios could certainly benefit from this analysis. The fact, for instance, that lifetime gross decreases with each additional books that an author has written, could benefit studios by helping them pick adaptations by authors with fewer novels or limited series. They also should note the viability of action adaptations, as they are less of a risk in terms of producing box office margins. We also learned many lessons from working with this particular data set, including how to work with skewed variables. Taking the natural logs of two variable allowed us to adapt them to be more appropriate for our model. Additionally, by using regression diagnostics, we were able to identify large outliers and leverage points using SPSS and remove them to make our data more effective. Finally, through the process of backwards elimination, we were able to narrow down the number of dependent variables in our model and lower our standard error.