The Data Analysis Life Cycle

Data scientists must be knowledgeable about every aspect of the data lifecycle, from acquisition to archival storage.
The main phases are:

Experimental Design
Data Acquisition
Cleaning and Organization
Model selection and/or design
Running the data analysis or training the model
Evaluating the results, drawing conclusions
Communicating the findings to the community
Archiving, storing, and making data, code, etc. available for the future

Cleaning and Organization

Common data quality issues

Non-standard storage format
Missing values
Duplicate samples
Duplicate variables
Incorrect values
Inconsistent value encoding

Model selection and/or design

What kind of analysis are you performing?
What questions are you hoping to answer?
What will be done with the answers once you find them?

Model selection and/or design

Types of Models

Supervised
- Model is fit (or “trained”) to minimize error on a “training” dataset when the “answer” is known.
- Then, you can use the model to make predictions on new samples later. (predictive models)

Unsupervised
- Model tries to provide information about the structure inherent to the dataset, without any “correct answer” or “ground truth” being provided.
- Examples: Clustering, Nearest-Neighbors, etc.
- May be either predictive or descriptive models.

Running the data analysis or training the model

Descriptive Modeling

Generally, you just run the analysis and report the findings. For example, you might analyze historical housing data for an area to see how housing prices have fluctuated as industry came to or left the area.

Predictive Modeling

Predictive models are often “trained” on an initial sample of the kinds of data that you are interested in (whether supervised or unsupervised). Then, you can make predictions about new values as needed.

Evaluating the results, drawing conclusions

Whether your model is descriptive or predictive, you need some way to evaluate and interpret the results.

This ties back to the original question you are hoping to answer.

How well did your analysis answer the question?

What insights does it give you into the process that produced the data you are analyzing?

Image: https://www.flickr.com/photos/153724200@N07/40932460914

Communicating the findings to the community

Results are not useful until someone knows about them!

In industry, this usually means informing management about the findings and their implications or potential ramifications.

In research/academia, this means presenting at a conference, publishing in a journal, writing a book, or teaching in a classroom.

Image: https://www.flickr.com/photos/sissa-official/15763669965

Archiving, storing, and making data, code, etc. available for the future

It is important that the process of your experiment is properly archived so that it can be repeated in the future.

The “digital age” makes that easier in some ways, and harder in others.

how to locate the information
archival storage
controlled access
how to verify the data hasn’t changed
algorithms are forever, but code isn’t.

Image: https://www.flickr.com/photos/doctorow/49501794586 - Cory Doctorow

The Data Analysis Life Cycle