Python Tools

Python Tools (Introduction)

This note will be accompanied by a live demonstration of a Python-based analysis of a dataset (“Wine Quality”).

Setting up a Python environment

https://www.python.org/ : The official Python site has the most current documentation, and links to tutorials, forums, etc.

Installing Packages

https://packaging.python.org/tutorials/installing-packages/ : How to install Python packages.

  • You will at least want to make sure you have the pip package manager installed.

Package Management for Projects

https://docs.astral.sh/uv/ : uv is an extremely fast Python package and project manager.

The ideal for a user who needs multiple “small” projects is to use an environment manager like uv so that you can have a “\(\textrm{project} = \textrm{directory}\)” logical model, as seen with the yellow boxes above.

Jupyter Notebooks and Jupyter Lab

https://jupyter.org/ : “Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.”

  • Or, use Jupyter Lab as a browser-based front-end for Jupyter: https://jupyterlab.readthedocs.io
    • You can install Jupyter Lab in your uv virtual environment, then run with uv run jupyter lab.
    • Doesn’t require local installation privileges, but you must be able to connect to Jupyter Lab instance via a web browser.

Python Language Fundamentals


See Computational and Inferential Thinking, Chapter 3: https://www.inferentialthinking.com/chapters/03/programming-in-python.html


But, we will not focus on the datascience library they use – instead, we will use widely-available and standard libraries (next slides).

Python Libraries for Data Science

Although almost any library could be used in a data science analysis, the following libraries are very commonly used for data analysis, wrangling, cleaning, and visualization, as well as mathematical modeling.

  • pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. The pandas library focuses on tabular data in a format called “DataFrame” that supports a rich set of access, query, and filtering methods. Pandas also makes input/output from/to structured text files easy.
  • Numpy: NumPy is the fundamental package for scientific computing with Python. It provides a rich set of operators for vector and matrix math operations, and much more.
  • Scipy: It provides many user-friendly and efficient numerical routines, such as routines for numerical integration, interpolation, optimization, linear algebra, and statistics.

Python Libraries for Data Science (2)

  • Matplotlib: Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
  • Seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn directly supports some advanced graph types that Matplotlib does not, such as pairwise correlation multi-plot graphs.
  • SciKit Learn: SciKit Learn provides “simple and efficient tools for predictive data analysis.” Essentially, it provides lots of data processing tools, including image processing, and basic machine learning models.

Python Libraries for Data Science (3)

  • Tensorflow: TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. We recommend using the Keras front-end to make creating models easier.
  • Statsmodels: Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. Statsmodels provides common statistical tools similar to those in R.
  • Pingouin: Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy.

Wine Quality Dataset


Wine Quality (UCI ML Repository)
https://archive.ics.uci.edu/ml/datasets/Wine+Quality


Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009]).