"Data is the new oil"
Ways to acquire data (typical data source)
- Download from an internal system
- Obtained from client, or other 3rd party
- Extracted from a web-based API
- Scraped from a website
- Extracted from a PDF file
- Gathered manually and recorded
Data Formats
- Flat files (e.g. csv)
- Excel files
- Database (e.g. MySQL)
- JSON
- HDFS (Hadoop)
"Data is an abstraction of the reality."
- What assumptions have been in this entire data collections process?
- Are we aware of the assumptions in this process?
- How to ensure that the data is accurate or representative for the question we are trying to answer?