forked from hackreduce/Hackathon
-
Notifications
You must be signed in to change notification settings - Fork 0
Datasets
greglu edited this page Jun 22, 2011
·
25 revisions
The datasets listed below are all preloaded into the HackReduce hadoop clusters and ready for immediate use at the event. The [datasets/*] notice next to each title indicates the path where its located depending on where you want to access it:
- Hadoop HDFS: Can be found at
/datasets/*
- Namenode local filesystem: Can be found at
/mnt/datasets/*
- HackReduce Github project: Samples found in the
datasets/*
folder of the project
There's also the possibility of loading new data at the event, but this process could take a few hours. Please see one of the Hopper event organizers (probably Greg) about loading new data into your clusters.
Special thanks to Echo Nest for converting the whole 200+ GB HDF5 format of the dataset to TSV for us
- Only the 1-gram and 2-gram datasets are available
- http://ngrams.googlelabs.com/datasets
- XML dump of all the bike station information queried every minute over a couple of months.
- Provided by Fabrice
- Contains the root file with all the domain names and their associated nameservers for the "com" TLD.
- Data of the social graph, user id to names, and selected celebrity profiles. This does not contain actual tweets because of Twitter policies.
- http://an.kaist.ac.kr/traces/WWW2010.html
- Limited set of flight data containing origin, destination, departure time, return time, price and date. Only has flights originated from SEA.
- Provided by Hopper
- Description of data formats: http://131.193.40.52/data/README.txt
- Data listing: http://131.193.40.52/data/
- Taken around of the time of Elizabeth Taylor's death in late March 2011, this dataset consists was a search of all tweets containing the word "taylor" in them.
- JSON format