Wednesday 3 May 2017

Film Database - Technical Details

You can find my database here. It consists of a raw data tab (DB), a pivot table that makes it easy to look at the data, and some back up tabs.

To build it, I first went to boxofficemojo.com and downloaded copy-pasted the top 10 film details for every year from 2017 to 1980 onto the Excel file (e.g. here's the link for 2017). Since this only goes back to 1980, I then went to the All Time Domestic box office site, sorted the data by year, and copy-pasted all films going up to 1980.

(This creates a few issues, because it becomes harder to do like-for-like comparisons before the 80s. For instance, 1972 only had 7 films that made it to the all time box office records, whereas 1974 had 17; this means that it's harder to compare the 70s to the 80s, which only feature top 10 films).

Thankfully, copy-pasting from boxoffice.com made it easy to extract the URL for each site - this I put in column B. A formula gets the search link for the film's site in IMDB.com (columns E&F). This is where things get tricky: I needed to be able to extract some basic information about the films. To do this, I used importXML formulas on Google Sheets. What these formulas do is they go to the specified part of a webpage and get the data that's located there. The database at the link above has the results copy-pasted as values, but you can see formulas in the tab "Import Formulas", in case you want to reapply them. The same formulas are used to get IMDB ratings and a few other bits of data.

A few more formulas convert the information thus extracted to an analysis-friendly format. These are relatively simple if you're already familiar with Excel, so I won't go through them.

I adjusted all financials to 2017 $ values to account for inflation. To do this, I used the adjustment table found here. This was missing a few years, so I added those myself by either keeping the same price as the previous year or taking the average between the previous and subsequent year (I did this when these two prices differed significantly).

BoxOffice.com had 99 genre classification combinations, because some films belong to more than one genre (e.g. Action Thriller, Action Comedy &c). I mapped these to 8 genres for analysis purposes. In general, I tried to do the mapping by deciding which single genre best defines a multi-genre film. For instance, Lethal Weapon is correctly classified as an Action Comedy by BoxOffice.com; but if I had to choose one genre only, I'd go for Action. You can find the mappings in the "Genres" tab, though note that I did manually adjust the classification of a few films (so, for instance, Alvin and the Chipmunks' full genre is Family Comedy, which would normally map to Comedy, but I manually changed it to Animation). Feel free to download the db and change the mapping if you disagree with it.

I also had to classify films into one of 5 types - Original, Adaptation, Sequel, Spinoff or Remake. I did this manually (this was the one part I did not manage to automate). Some films could be considered to be more than one of these (e.g. is the Dark Knight Rises a sequel or an adaptation?). In general, the first of a sequence of films adapted from a novel or play (or any other source) was marked an adaptation, and subsequent films in the series sequels; if a film could be considered both an adaptation or a remake, the classification depended on how similar the new film was compared to the previous adaptation. For instance, the Amazing Spiderman was tagged as an adaptation, whereas Ben Hur as a remake (though to be fair, I've only ever watched the Charlton Heston version - but since the film is usually called a remake, I went with that).

I think this pretty much covers it - but feel free to ask any questions in the comments.

No comments:

Post a Comment