Skip to content
greglu edited this page Jun 22, 2011 · 25 revisions

General Information

The datasets listed below are all preloaded into the HackReduce hadoop clusters and ready for immediate use at the event. The [datasets/*] notice next to each title indicates the path where its located depending on where you want to access it:

  • Hadoop HDFS: Can be found at /datasets/*
  • Namenode local filesystem: Can be found at /mnt/datasets/*
  • HackReduce Github project: Samples found in the datasets/* folder of the project

There's also the possibility of loading new data at the event, but this process could take a few hours. Please see one of the Hopper event organizers (probably Greg) about loading new data into your clusters.

Million Song Dataset [datasets/msd]

Special thanks to Echo Nest for converting the whole 200+ GB HDF5 format of the dataset to TSV for us

NASDAQ daily prices and dividends [datasets/nasdaq]

NYSE daily prices and dividends [datasets/nyse]

Wikipedia XML dump [datasets/wikipedia]

Google Ngram [datasets/ngrams]

Geonames [datasets/geonames]

Reddit voting data [datasets/reddit]

Bixi Montreal [datasets/bixi]

  • XML dump of all the bike station information queried every minute over a couple of months.
  • Provided by Fabrice

DNS dataset [datasets/dns]

  • Contains the root file with all the domain names and their associated nameservers for the "com" TLD.

LDEO Surface Ocean CO2 Climatology data [datasets/ldeo]

Twitter dataset [datasets/twitter]

Flight dataset [datasets/flights]

  • Limited set of flight data containing origin, destination, departure time, return time, price and date. Only has flights originated from SEA.
  • Provided by Hopper

Amazon dataset [datasets/amazon]

IMDB dataset [datasets/imdb]

Taylor Tweets [datasets/taylor_tweets]

  • Taken around of the time of Elizabeth Taylor's death in late March 2011, this dataset consists was a search of all tweets containing the word "taylor" in them.
  • JSON format