This repository contains data, code, and methodology supporting BuzzFeed News' analysis of comments submitted to three Federal Communications Commission (FCC) dockets, published October 3, 2019:
- 17-108 ("Restoring Internet Freedom")
- 16-42 ("Expanding Consumers' Video Navigation Choices")
- 14-28 ("Protecting and Promoting the Open Internet")
Please see below for further details.
The data in this repository comes from several sources:
The ECFS is the FCC's public portal for searching and accessing comments submitted to the commission's dockets. BuzzFeed News used the website to download each individually-listed comment, for two of the dockets: 14-28 and 16-42. Note: Not all comments submitted to the FCC are individually listed; in some cases, an organization will submit a consolidated set of comments as a PDF, with signatures and/or commenters' information listed in that PDF. Because of the extraordinary variety and inconsistency of those files, BuzzFeed News did not disaggregate those comments.
On November 7, 2017, the FCC released a "complete set of [Docket 17-108] filings submitted as of November 3, 2017"; BuzzFeed News used this download to examine docket-wide trends.
In response to two FOIA requests, the FCC provided to BuzzFeed News the files submitted to the agency's bulk-upload system for Docket 17-108, plus associated metadata indicating the uploader's Box.com account and the time of the upload. According to the FCC, it provided all such files submitted. Although the agency provided a template for the uploads, some of the files — typically the smallest ones, containing just one comment each — do not conform to them and could not be incorporated easily. Those comments, which represent an exceedingly small percentage of all bulk-uploaded comments, have not been included in this repository's data; in many cases, the corresponding comments appear also not to have been added to the FCC's public comment portal. In certain other cases, the upload files use non-standard column names. In cases where the intention appeared to be clear, BuzzFeed News fixed the column names and included the data.
Have I Been Pwned is a website and service that identifies whether any given email address has been exposed in any of hundreds of major data breaches. BuzzFeed News used HIBP's application programming interface to determine the most common breaches associated with various groups of email addresses.
Because it appears that many of the comments in the data above were submitted without the consent of the named commenters, we have taken the following steps:
-
Removing all raw personal-information columns (name, physical address, etc.).
-
Replacing each distinct email address with a randomly-assigned unique identifier. (Specifically, a version 4 UUID.)
-
Replacing each distinct email domain with a similar randomly-assigned unique identifier, except for very common domains. (Specifically the 36 domains that are associated with 10,000 or more unique email addresses in the Docket 17-108 comments.)
-
Replacing each distinct combination of name + location (first line of street address, city, state, ZIP code) with another UUID. Before converting to UUIDs, ZIP codes are converted to zero-padded five-digit representations, and all strings are lowercased. For instance:
John Doe, 123 Smith Street, New York, NY 01111
will receive the same UUID asjohn doe, 123 SMITH STREET, New York, ny 1111
, but neither will match submissions that put him at123 Smith St.
(with the abbreviation).
The process above produces the files listed below. Several are too large to host on GitHub, so BuzzFeed News has uploaded them here.
These files contain selected fields from the comment data listed above:
bulk-uploads-17-108-with-uuids.csv
: Docket 17-108 bulk uploads, via FOIAcomments-17-108-with-uuids.csv
: Docket 17-108, via FCC official downloadcomments-14-28-with-uuids.csv
: Docket 14-28, via FCC online portalcomments-16-42-with-uuids.csv
: Docket 16-42, via FCC online portal
They contain the following columns:
date
: The date of submission.id_submission
: The ID the FCC has assigned to the comment. Note: Not available inbulk-uploads-17-108-with-uuids.csv
, because the FCC assigns the IDs after they are uploaded.comments
: The text of the comment. Note: This is sometimes modified by the FCC, for example by adding a filename or, as appears to be the case for some Docket 14-28 comments, removing boilerplate language.) Note: Not included incomments-17-108-with-uuids.csv
for file-size considerations, because this file is mainly used for domain-counts.name_and_location
: The UUID (see above) corresponding to the name and adress information provided with the comment. Note: Not included incomments-17-108-with-uuids.csv
.email_address
: The UUID (see above) corresponding to the email address provided with the comment. Note: In the FCC's commenting system, you don't have to control an email address to list it as the author of a comment.email_address_nonstandard
: If the email address contains nonstandard characters (such as%
) or formatting (such as lacking an@
symbol), this value will be1
; otherwise, it will be0
. This is used to filter out likely-invalid addresses before checking them on Have I Been Pwned.email_domain
: The domain of the email address, as a UUID unless it is one of the 36 domains described above.
Additionally, bulk-uploads-17-108-with-uuids.csv
contains the following columns:
file
: The name of the file in which the comment was uploaded.uploader
: The email address associated with the Box.com account that uploaded the file.
These files list the breaches, per Have I Been Pwned, for email addresses in a randomized samples of the comments bulk-uplaoded to Docket 17-108:
breaches-17-108-bulk-uploads-sample.csv
: 1,000-address sample of each of the eight bulk-uploaders whose Docket 17-108 uploads contained at least 10,000 unique email addresses.breaches-17-108-mb-sample.csv
: 10,000-address sample of Media Bridge's Docket 17-108 bulk-uploads.
They contain the following columns:
email_address
: The UUID (see above) corresponding to the email address examined.breach
: The name of the breach, as returned by Have I Been Pwned.
The analyze-fcc-comments
notebook examines comments submitted to the three FCC dockets described above, the language used in them, the timing of their submission. For Docket 17-108, the notebook also examines the email domains associated with the comments, as well as rates at which the email addresses in the bulk uploads overlap with those exposed in major data breaches. The notebook also examines the overlap between the contact information in Docket 16-42 and Docket 17-108.
The analyze-mb-comment-structure
notebook examines the phrasing of the comments that Media Bridge submitted to Docket 17-108, and attempts to reverse-engineer the comments that use randomly-generated text.
The code running the analysis is written in Python 3, and requires the following Python libraries:
If you would like to reuse the code for fetching data from Have I Been Pwned's API, you will also need these Python libraries:
- requests for HTTP requests
- requests-cache for caching HTTP requests
- tqdm for progress bars
If you use Pipenv, you can install all required libraries with pipenv install
.
As noted above, you will need to download the source data separately. Save the folder as this repository's /data
directory.
Execute the notebooks in the notebooks/
directory to reproduce the findings.
All code in this repository is available under the MIT License.
Contact Jeremy Singer-Vine at [email protected].
Looking for more from BuzzFeed News? Click here for a list of our open-sourced projects, data, and code.