week1 Notes

Examples and the Diversity of Data Science

(empty)

Working Definitions of Data Science

Characteristics of DS

strong mathematical background, computing skills, + data.

Mike Driscoll’s three sexy skills of data geeks
  • Statistics (traditional analysis)
  • Data Munging (parsing, scraping, and formatting data)
  • Visualization (graphs, tools, etc.)

Interesting Vocabulary

data wrangling, data jujitsu, data munging

Data Products

Data science is about buiding data products, not just answering questions.

Data-driven apps like spellchecker, machine translator;

Interactive visualizations like Google flu application, Global burden of disease

Online Databases Enterprise data warehouse, Sloan disital sky survey

Business Intelligence

closely tied to data warehouses and database-oriented technology. It is about building permanent infrastructure for others to use, while DS is more about using the data products to answer the questions and communicating the results.

Statstics

Statistics method is the core of data science, but data science typically deals with much larger data sets.

Data(base) Management

This focus on relational data model (rows and columns)

Visualization
Machine Learning

probably clocest to data science. But sometimes choosing the right method is a minor part in ds, preparing, manipulating, cleaning, wrangling of data would take much more time.

Four Dimensions of Data Science

How am I going to get into the field?

I might need to catch up on statistical modeling and how to communicate results, which means I am going to audit some stat courses next year. :D

ds

Breadth

tools v.s. abstractions

tools (hands-on)

Hadoop, PostgreSQL, glm(…) in R, Tableau

abstractions (sophisticated) *

MapReduce, Relational Algebra, Logistic Regression, InfoVis

For this course: lean towards abstractions

* It is all about math. * Everything is a relation

Depth

the distinction between structural manipulation of data and statistical manipulation of data

structures *

Management, Relational Algebra, Standards

statistics

Analysis, Linear Algebra, ad hoc files

For this course: lean towards structures

great barrier: the variety of incompatible data formats, non-aligned data structures, and inconsistent data semantics.

Scale

main memory on a single machine v.s. cloud

Desktop

main memory, R, local files

Cloud *

distributed, Haddop, S3, Azure Storage

For this course: lean towards cloud

Target
hackers

assume proficiency in Python, Java, R

analysts

assume little or no programming

For this course: lean towards analysts

Science: Empirical -> Theoretical -> Computational -> eScience

3V’s of Big Data

  • Volume num of rows
  • Variety num of cols
  • Velocity num of rows per unit time