This repository contains data, analytic code, and findings supporting BuzzFeed News's analysis of diversity in the dialogue of Best Picture–nominated films, published March 2, 2018. Please read that article, which contains important context and details, before proceeding.
This analysis relies on two data files, both found in data/
.
data/actor-metrics.csv
lists each actor in our analysis, and contains these columns:
year
: Year the film was releasedfilm
: The film's name (shortened in some cases)actor
Actor name as found on IMDB or Variety Insightscharacters
: Character names/references, as found in screenplayimdb
: IMDB link for referencegender
: Actor gender, as found on Variety Insightsrace
: Actor race, as found on Variety Insights and supplemented with additional reportingrace_simple
: Simplified actor race — eitherWhite
orPOC
(person of color)words
: Number of words spokensentences
: Number of sentences spoken (not used in published article, but provided for reference and context)
data/character-word-counts-csv
counting each character + word
combination (excluding "stop words"; see below for details), for each actor in our analysis. It contains these columns:
year
: Year the film was releasedfilm
: The film's name (shortened in some cases)character
: Character name/reference, as found in screenplayactor
: Actor name as found on IMDB or Variety Insightsword
: Word this character spoke in this filmcount
: Number of times this word is uttered by this character
The analyses in this repository use, as their main source material, the scripts of the 22 films nominated for Best Picture for the 1990, 2016, and 2018 Academy Awards. (Those films were released in 1989, 2015, and 2017, respectively.)
For two films, Mad Max and My Left Foot, we could not locate a script, so we instead relied on film transcripts, which we then checked against the final film. We then entered these transcripts into the Writer Duet scriptwriting program, and exported the results as XML (in the same format that we used for other screenplays).
The list of nominated films came from the Oscars Awards Database and the Oscar's website.
The character names and dialogue were extracted from film scripts, which were found on public websites (such as Script Slug and The Internet Movie Script Database) and on the websites of various film distributors.
It is important to note:
- Movie scripts are often imperfectly structured and can contain errors. Irregularities in writing styles and syntax can make it difficult to correctly attribute dialogue to the associated characters. We have attempted to standardize the data where we could.
- Movie scripts also sometimes differ from what ultimately appears on screen.
The official names for each script's characters were drawn from Variety Insights and IMDB.
The source for each actor's actor gender and race/ethnicity was primarily Variety Insights. In cases where an actor's gender race/ethnicity could not be confirmed in Variety Insights, we sometimes made a judgment call based on photos, biographies and other information. In cases where an actor's ethnicity or gender was at all in question, we confirmed the facts with their representative.
In some cases, names could not be matched to actors either because the character's part was not included in the finished film, or because the actor was not credited. These names were removed from the analysis.
First, we converted PDFs of the movie scripts into XML files, using Writer Duet or Story Writer. Then, we used Python's Beautiful Soup, TextBlob, and ftfy libraries to extract the character names and dialogue from the XML files, clean them up, and "tokenize" the dialogue into sentences and words. Then, we exported each character's lines and total word and sentence counts to a CSV file.
Using that CSV file, we manually assigned each character we could to an actor, using the sources listed above. Then, we removed characters who fit any of the following criteria:
- Characters who could not be matched to an actor (for example, because they were not clearly credited)
- Characters who appeared in the script but ultimately not in the film
- Characters who spoke fewer than 100 words — largely because such characters are often unnamed (e.g., "PATIENT 1") and difficult to match to actors
Ultimately, we removed 11 characters who did speak at least 100 words:
- "RADIO," from
born-on-the-fourth-of-july
(176 words) - "MAN," from
bridge-of-spies
(269) - "ANNOUNCER," from
bridge-of-spies
(120) - "NARRATOR," from
call-me-by-your-name
(317) - "SOLDIER," from
dunkirk
(112) - "BLACK," from
the-big-short
(154) - "MODERN TRADER" from
the-big-short
(387) - "YOUNG BANKER," from
the-big-short
(108) - "BRAZEAU," from
the-revenant
(155) - "PIG", from
the-revenant
(323) - "RUNTY MAN," from
the-revenant
(210)
To generate the character-word-counts.csv
file, we took the following steps:
- Extracted the dialogue from each script, per the above. In all TextBlob library identified 181,547 total words.
- Removed so-called "stop words", common words in the English language, and "words" that contained no alphabetic characters (e.g., "12"). That reduced the word count to 56,023. For this analysis, the list of exclusions contains approximately 600 stop words, such as
I
,you
andme
. These stop words were drawn from the Natural Language Tool Kit, MySQL, and editorial judgment based on very-commonly occurring words in the scripts. For reference, you can find the official NLTK stopwords indata/stopwords-nltk.txt
and the additional custom and MySQL words indata/stopwords-other.txt
. - Counted the number of times each character said each of the remaining words.
This repository uses Python code and Jupyter notebooks to process the data. That code can be found here:
-
casting_stats.ipynb
: This notebook loadsdata/actor-metrics.csv
file and uses to compute descriptive statistics on character-, word-, and sentence-counts by year, film, gender, and race/ethnicity. -
calculate_word_frequencies.ipynb
: This notebook loads data fromdata/character-word-counts.csv
anddata/actor-metrics.csv
, uses them to createoutput/top-words-by-gender.csv
andtop-words-by-race.csv
.
Contact Lam Thuy Vo at [email protected] and Scott Pham [email protected].
Looking for more from BuzzFeed News? Click here for a list of our open-sourced projects, data, and code.