-
Notifications
You must be signed in to change notification settings - Fork 0
Datasets
The datasets listed below are all preloaded into the HackReduce hadoop clusters and ready for immediate use at the event. The [datasets/*] notice next to each title indicates the path where its located depending on where you want to access it:
- Hadoop HDFS: Can be found at
/datasets/*
- Namenode local filesystem: Can be found at
/mnt/datasets/*
- HackReduce Github project: Samples found in the
datasets/*
folder of the project
There's also the possibility of loading new data at the event, but this process could take a few hours. Please see one of the Hopper event organizers (probably Greg) about loading new data into your clusters.
Special thanks to Echo Nest for converting the whole 200+ GB HDF5 format of the dataset to TSV for us
- Only the 1-gram and 2-gram datasets are available
- http://ngrams.googlelabs.com/datasets
- XML dump of all the bike station information queried every minute over a couple of months.
- Provided by Fabrice
- Contains the root file with all the domain names and their associated nameservers for the "com" TLD.
- Data of the social graph, user id to names, and selected celebrity profiles. This does not contain actual tweets because of Twitter policies.
- http://an.kaist.ac.kr/traces/WWW2010.html
- Limited set of flight data containing origin, destination, departure time, return time, price and date. Only has flights originated from SEA.
- Provided by Hopper
- Description of data formats: http://131.193.40.52/data/README.txt
- Data listing: http://131.193.40.52/data/
- Taken around of the time of Elizabeth Taylor's death in late March 2011, this dataset was a search of all tweets containing the word "taylor" in them.
- JSON format
-
Arxiv HEP-PH (high energy physics phenomenology) [datasets/citation-networks/hep-ph/{dates,graph}]: http://snap.stanford.edu/data/cit-HepPh.html
-
Arxiv HEP-TH (high energy physics theory) [datasets/citation-networks/hep-th/{dates,graph}]: http://snap.stanford.edu/data/cit-HepTh.html
-
U.S. patent dataset: http://snap.stanford.edu/data/cit-Patents.html