Python Tools

Python Tools (Introduction)

This note will be accompanied by a live demonstration of a Python-based analysis of a dataset (“Wine Quality”).

Setting up a Python environment

https://www.python.org/ : The official Python site has the most current documentation, and links to tutorials, forums, etc.

Installing Packages

https://packaging.python.org/tutorials/installing-packages/ : How to install Python packages.

You will at least want to make sure you have the pip package manager installed.

Package Management for Projects

https://docs.astral.sh/uv/ : uv is an extremely fast Python package and project manager.

Use a package manager to install third-party libraries and dependencies to extend the abilities of your Python toolset.
You can do this with pip and a virtual environment tool alone, but uv combines these tasks and makes it easier.
You can install uv by following these instructions.
This is a good tutorial to help you get started once you have uv installed.

The ideal for a user who needs multiple “small” projects is to use an environment manager like uv so that you can have a “\(\textrm{project} = \textrm{directory}\)” logical model, as seen with the yellow boxes above.

Jupyter Notebooks and Jupyter Lab

https://jupyter.org/ : “Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.”

I recommend using Visual Studio Code (https://code.visualstudio.com/) with the Python extension pack as the front-end for Jupyter notebooks. (See https://code.visualstudio.com/blogs/2021/11/08/custom-notebooks.)
- VS Code even understands virtual environments, such as the ones created by uv.

Or, use Jupyter Lab as a browser-based front-end for Jupyter: https://jupyterlab.readthedocs.io
- You can install Jupyter Lab in your uv virtual environment, then run with uv run jupyter lab.
- Doesn’t require local installation privileges, but you must be able to connect to Jupyter Lab instance via a web browser.

Python Language Fundamentals

See Computational and Inferential Thinking, Chapter 3: https://www.inferentialthinking.com/chapters/03/programming-in-python.html

But, we will not focus on the datascience library they use – instead, we will use widely-available and standard libraries (next slides).

Python Libraries for Data Science

Although almost any library could be used in a data science analysis, the following libraries are very commonly used for data analysis, wrangling, cleaning, and visualization, as well as mathematical modeling.

pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. The pandas library focuses on tabular data in a format called “DataFrame” that supports a rich set of access, query, and filtering methods. Pandas also makes input/output from/to structured text files easy.
Numpy: NumPy is the fundamental package for scientific computing with Python. It provides a rich set of operators for vector and matrix math operations, and much more.
Scipy: It provides many user-friendly and efficient numerical routines, such as routines for numerical integration, interpolation, optimization, linear algebra, and statistics.

Python Libraries for Data Science (2)

Matplotlib: Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
Seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn directly supports some advanced graph types that Matplotlib does not, such as pairwise correlation multi-plot graphs.
SciKit Learn: SciKit Learn provides “simple and efficient tools for predictive data analysis.” Essentially, it provides lots of data processing tools, including image processing, and basic machine learning models.

Python Libraries for Data Science (3)

Tensorflow: TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. We recommend using the Keras front-end to make creating models easier.
Statsmodels: Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. Statsmodels provides common statistical tools similar to those in R.
Pingouin: Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy.

Wine Quality Dataset

Wine Quality (UCI ML Repository): https://archive.ics.uci.edu/ml/datasets/Wine+Quality

Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009]).

Dataset Links

UCI ML Repository: https://archive.ics.uci.edu/ml/index.php
Kaggle Dataset Search: https://www.kaggle.com/datasets
Google Dataset Search: https://datasetsearch.research.google.com/
Microsoft Research Open Data: https://msropendata.com/
AwesomeData List of public datasets: https://github.com/awesomedata/awesome-public-datasets

Dataset Links (2)

NASA Datasets: http://data.nasa.gov
Earthquakes: https://earthquake.usgs.gov/
SpaceX Data API: https://github.com/r-spacex/SpaceX-API
This one isn’t really a dataset, but it is related and useful:: CSV Faker: https://github.com/pereorga/csvfaker/blob/master/README.rst

Python Tools