Framework

"Data is the new oil"

Ways to acquire data (typical data source)

Download from an internal system
Obtained from client, or other 3rd party
Extracted from a web-based API
Scraped from a website
Extracted from a PDF file
Gathered manually and recorded

Data Formats

Flat files (e.g. csv)
Excel files
Database (e.g. MySQL)
JSON
HDFS (Hadoop)

In Search of Data

"Data is an abstraction of the reality."

What assumptions have been in this entire data collections process?
Are we aware of the assumptions in this process?
How to ensure that the data is accurate or representative for the question we are trying to answer?