The Data Analysis Life Cycle

Data scientists must be knowledgeable about every aspect of the data lifecycle, from acquisition to archival storage.
The main phases are:

  • Experimental Design
  • Data Acquisition
  • Cleaning and Organization
  • Model selection and/or design
  • Running the data analysis or training the model
  • Evaluating the results, drawing conclusions
  • Communicating the findings to the community
  • Archiving, storing, and making data, code, etc. available for the future

Cleaning and Organization

Common data quality issues

  • Non-standard storage format
  • Missing values
  • Duplicate samples
  • Duplicate variables
  • Incorrect values
  • Inconsistent value encoding

Model selection and/or design

  • What kind of analysis are you performing?

  • What questions are you hoping to answer?

  • What will be done with the answers once you find them?

Model selection and/or design

Types of Models

  • Supervised
    • Model is fit (or “trained”) to minimize error on a “training” dataset when the “answer” is known.
    • Then, you can use the model to make predictions on new samples later. (predictive models)
  • Unsupervised
    • Model tries to provide information about the structure inherent to the dataset, without any “correct answer” or “ground truth” being provided.
    • Examples: Clustering, Nearest-Neighbors, etc.
    • May be either predictive or descriptive models.

Running the data analysis or training the model

Descriptive Modeling

Generally, you just run the analysis and report the findings. For example, you might analyze historical housing data for an area to see how housing prices have fluctuated as industry came to or left the area.

Predictive Modeling

Predictive models are often “trained” on an initial sample of the kinds of data that you are interested in (whether supervised or unsupervised). Then, you can make predictions about new values as needed.

Evaluating the results, drawing conclusions

Whether your model is descriptive or predictive, you need some way to evaluate and interpret the results.

This ties back to the original question you are hoping to answer.

How well did your analysis answer the question?

What insights does it give you into the process that produced the data you are analyzing?

Communicating the findings to the community

Results are not useful until someone knows about them!

In industry, this usually means informing management about the findings and their implications or potential ramifications.

In research/academia, this means presenting at a conference, publishing in a journal, writing a book, or teaching in a classroom.

Archiving, storing, and making data, code, etc. available for the future

It is important that the process of your experiment is properly archived so that it can be repeated in the future.

The “digital age” makes that easier in some ways, and harder in others.

  • how to locate the information
  • archival storage
  • controlled access
  • how to verify the data hasn’t changed
  • algorithms are forever, but code isn’t.

The Data Analysis Life Cycle