strong mathematical background, computing skills, + data.
- Statistics (traditional analysis)
- Data Munging (parsing, scraping, and formatting data)
- Visualization (graphs, tools, etc.)
Data science is about buiding data products, not just answering questions.
Data-driven apps like spellchecker, machine translator;
Interactive visualizations like Google flu application, Global burden of disease
Online Databases Enterprise data warehouse, Sloan disital sky survey
closely tied to data warehouses and database-oriented technology. It is about building permanent infrastructure for others to use, while DS is more about using the data products to answer the questions and communicating the results.
Statistics method is the core of data science, but data science typically deals with much larger data sets.
This focus on relational data model (rows and columns)
probably clocest to data science. But sometimes choosing the right method is a minor part in ds, preparing, manipulating, cleaning, wrangling of data would take much more time.
How am I going to get into the field?
I might need to catch up on statistical modeling and how to communicate results, which means I am going to audit some stat courses next year. :D
tools v.s. abstractions
glm(…) in R,
For this course: lean towards abstractions
* It is all about math. * Everything is a relation
the distinction between structural manipulation of data and statistical manipulation of data
ad hoc files
For this course: lean towards structures
great barrier: the variety of incompatible data formats, non-aligned data structures, and inconsistent data semantics.
main memory on a single machine v.s. cloud
S3, Azure Storage
For this course: lean towards cloud
assume proficiency in Python, Java, R
assume little or no programming
For this course: lean towards analysts
Science: Empirical -> Theoretical -> Computational -> eScience
- Volume num of rows
- Variety num of cols
- Velocity num of rows per unit time