### Examples and the Diversity of Data Science

(empty)

### Working Definitions of Data Science

#### Characteristics of DS

strong mathematical background, computing skills, + data.

##### Mike Driscoll’s three sexy skills of data geeks

- Statistics (traditional analysis)
- Data Munging (parsing, scraping, and formatting data)
- Visualization (graphs, tools, etc.)

Interesting Vocabulary

`data wrangling`

, `data jujitsu`

, `data munging`

##### Data Products

Data science is about buiding data products, not just answering questions.

`Data-driven apps`

like spellchecker, machine translator;

`Interactive visualizations`

like Google flu application, Global burden of disease

`Online Databases`

Enterprise data warehouse, Sloan disital sky survey

#### Distinguishing DS from related topics

##### Business Intelligence

closely tied to data warehouses and database-oriented technology. It is about building permanent infrastructure for others to use, while DS is more about using the data products to answer the questions and communicating the results.

##### Statstics

Statistics method is the core of data science, but data science typically deals with much larger data sets.

##### Data(base) Management

This focus on relational data model (rows and columns)

##### Visualization

##### Machine Learning

probably clocest to data science. But sometimes choosing the right method is a minor part in ds, preparing, manipulating, cleaning, wrangling of data would take much more time.

#### Four Dimensions of Data Science

How am I going to get into the field?

I might need to catch up on statistical modeling and how to communicate results, which means I am going to audit some stat courses next year. :D

##### Breadth

tools v.s. abstractions

###### tools (hands-on)

`Hadoop`

, `PostgreSQL`

, `glm(…) in R`

, `Tableau`

###### abstractions (sophisticated) *

`MapReduce`

, `Relational Algebra`

, `Logistic Regression`

, `InfoVis`

__ For this course:__ lean towards abstractions

* It is all about math. * Everything is a relation

##### Depth

the distinction between structural manipulation of data and statistical manipulation of data

###### structures *

`Management`

, `Relational Algebra`

, `Standards`

###### statistics

`Analysis`

, `Linear Algebra`

, `ad hoc files`

__ For this course:__ lean towards structures

great barrier: the variety of incompatible data formats, non-aligned data structures, and inconsistent data semantics.

##### Scale

main memory on a single machine v.s. cloud

###### Desktop

`main memory`

, `R`

, `local files`

###### Cloud *

`distributed`

, `Haddop`

, `S3, Azure Storage`

__ For this course:__ lean towards cloud

##### Target

###### hackers

assume proficiency in Python, Java, R

###### analysts

assume little or no programming

__ For this course:__ lean towards analysts

### Related Topics

Science: Empirical -> Theoretical -> Computational -> eScience

#### 3V’s of Big Data

**Volume**num of rows**Variety**num of cols**Velocity**num of rows per unit time