Deep Learning & How to Choose the Right Model

Deep Learning

Although the term deep learning is commonly heard, it is not well defined.

For our purposes, let’s let “deep” refer to any neural-network based model that contains more than a single “hidden layer”. This includes everything from Multi-Layer Perceptrons all the way up to convolutional neural networks or recurrent neural networks with hundreds of hidden layers.

Deep Learning: A subset of machine learning that uses artificial neural networks with many^† layers to analyze and learn from complex data.; ^† Many can be understood to mean “more than one”, but in practice it usually means significantly more than one.

Multi-Layer Perceptrons (MLPs)

This is the original “neural network”.

Useful if you have “traditional” (bag of variables) datasets and need a nonlinear model, if you have enough data.

Beware of making MLPs too “deep” - they can become hard to train.

Convolutional Neural Network (CNN)

Based (loosely) on concepts taken from models of how the human visual cortex operates.
Rely on convolutional operations to provide some translation invariance.
Learn visual features in a hierarchical fashion.
Also proven useful in non-vision applications.

Convolutional Neural Networks https://anhreynolds.com/blogs/cnn.html
Understanding Convolutions: https://towardsdatascience.com/intuitively-understanding-convolutions-for-deep-learning-1f6f42faee1
Convolution Arithmetic: https://github.com/vdumoulin/conv_arithmetic
10 CNN Architectures: https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d

Image from https://anhreynolds.com/blogs/cnn.html

Convolution

Each \(N\times N\) patch in the Input is “compared” (via dot product) to the filter (or kernel) and the result creates a single pseudo-pixel in the Output.

Image from https://anhreynolds.com/blogs/cnn.html

Convolution Filters : Edge detection

An example kernel that will provide vertical edge detection. Notice how it responds highly at the boundary between the “lighter” and “darker” pixels in the input.

In practice, we don’t hand-craft the filters—we let the network learn them. (In other words, the values in the filter are weights (or parameters) in the model.)

Image from https://anhreynolds.com/blogs/cnn.html

Sequential and Generative Models

Recurrent Neural Networks (RNNs) & LSTMs

Designed for sequential/timeseries data; output depends on previous inputs.
LSTMs added gating to preserve long-range dependencies.
Largely superseded by Transformers for most sequence and NLP tasks.

Generative Adversarial Networks (GANs)

Train a discriminative and a generative model in competition.
Produced impressive image generation results through ~2022.
Largely superseded by Diffusion Models for generative tasks.

LSTM Cell image: https://upload.wikimedia.org/wikipedia/commons/5/56/LSTM_cell.svg

Autoencoder

Autoencoder is one of the few unsupervised deep learning models.
It learns to reproduce its input as precisely as possible by first encoding it to a latent feature-space representation, then decoding that representation back to the original dimensionality.

Can be used for compression, denoising, and more.
Can be used for pre-training weights for “few shot learning”.

What is AE used for?: https://towardsdatascience.com/auto-encoder-what-is-it-and-what-is-it-used-for-part-1-3e5c6f017726
Autoencoders (Stanford): http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/
Autoencoders: https://www.jeremyjordan.me/autoencoders/
LSTM Autoencoders: https://machinelearningmastery.com/lstm-autoencoders/

Autoencoder Image: https://upload.wikimedia.org/wikipedia/commons/2/23/Autoencoder-BodySketch.svg

Diffusion Models

The dominant approach for generative image, audio, and video tasks as of ~2022–present.
Learn to reverse a gradual noise-addition process (denoising diffusion).
More stable to train than GANs; produce higher-quality and more diverse outputs.
Controllable via text prompts — a language model guides the generation process.

Examples: Stable Diffusion, DALL-E 3, Sora (video), MusicGen (audio).

Hugging Face Diffusers library: https://huggingface.co/docs/diffusers
DDPM paper: Ho et al. “Denoising Diffusion Probabilistic Models,” 2020. https://arxiv.org/abs/2006.11239

Transformer Networks

Transformer networks were introduced for natural language processing (NLP) but have since become the dominant architecture across many domains — text, vision, audio, and time-series.

The key feature of transformer networks is the self-attention mechanism, which allows the network to weigh different parts of the input differently based on relevance. This replaces the recurrent state used in RNNs.

Vision Transformers (ViT) apply the same architecture to images (split into patches), and now match or exceed CNNs on many vision benchmarks.

Images: A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Dec. 2017 [Online]. Available: http://arxiv.org/abs/1706.03762.

Transformer : Key Features

Self-attention mechanism: allows the network to weigh different parts of the input sequence differently based on relevance
No recurrent or convolutional layers: allows for parallel processing and faster training time
Multi-head attention: allows the network to learn more complex relationships between different parts of the input sequence
Advantages
- Improved performance over traditional RNN-based models
- Parallelizable and efficient
- Can handle longer input sequences
Disadvantages
- Fully-connected design means high parameter cost (space and time complexity).

Transformer Paper: https://arxiv.org/abs/1706.03762
Transformer Paper explained: https://towardsdatascience.com/attention-is-all-you-need-discovering-the-transformer-paper-73e5ff5e0634
How Transformers Work: https://towardsdatascience.com/transformers-141e32e69591

Large Language Models (LLMs)

An LLM (Large Language Model) is a transformer-based model pre-trained on massive text corpora, then fine-tuned to follow instructions. LLMs can generate fluent text and perform a broad range of language and reasoning tasks.

LLM: Key Features

Pre-trained on massive text corpora, then fine-tuned for instruction-following via RLHF (Reinforcement Learning from Human Feedback)
Capable of text generation, summarization, translation, coding, reasoning, and more
Typically use transformer networks as the underlying architecture
Open-weight models (Llama, Mistral, Falcon) now available for local or on-premise deployment

Advantages
- Can produce high-quality, human-like text across a wide variety of tasks
- In-context learning: provide examples in the prompt — no retraining required
- Can be adapted with minimal labeled data via prompting or lightweight fine-tuning
Disadvantages
- Known to “hallucinate” (give incorrect answers with high confidence)
- Legal issues surrounding training datasets and potential for copyright violation
- High energy and compute cost for training and inference
- Outputs can reflect biases present in training data
- Sensitive to prompt phrasing; reasoning is not formally verifiable

LLM: Practical Techniques for Data Scientists

Retrieval-Augmented Generation (RAG): Augment an LLM with a searchable knowledge base; retrieved text chunks are injected into the prompt at inference time. Reduces hallucination and grounds responses in private or up-to-date data.
Parameter-Efficient Fine-Tuning (PEFT / LoRA): Fine-tune only a small set of added low-rank parameters rather than the full model. Methods like LoRA and QLoRA allow adapting a 7B–70B model on a single GPU.
Prompt Engineering and Inference-Rime Compute: Structuring inputs — zero-shot, few-shot, chain-of-thought — to elicit better outputs without any retraining.

Hugging Face PEFT library: https://huggingface.co/docs/peft
LoRA paper: Hu et al., 2021. https://arxiv.org/abs/2106.09685
LangChain / LlamaIndex: popular orchestration frameworks for RAG pipelines

Transfer Learning

Deep learning models are difficult to train, and require massive labeled datasets.
Many real-world tasks have similarities though…

Is identifying birds that different from identifying objects in an image? They are both visual tasks… They both require us to use similar parts of our vision system. It makes sense that a network trained for one task might be able to become quickly proficient at a different (but similar) task.

Train a system on task A. Use the pre-learned weights \(W_A\) (except perhaps the top layer), train for additional epochs on a specialized task B. This “retraining” should take less time/examples than the original training.
Heavily used in applied machine learning, such as biomedical imaging, agricultural applications, etc.

Few-Shot Learning

Labeled data is usually hard to get. (Correctly labeled data is even harder.)
We need techniques to train networks with fewer labeled examples.
Trick is to first train network to perform a task that can be automated, then final training requires less data.
Advances in Few-Shot Learning: https://towardsdatascience.com/advances-in-few-shot-learning-a-guided-tour-36bc10a68b77
Few-Shot Learning survey paper: https://arxiv.org/abs/1904.05046
Prototypical Networks for Few-Shot Learning: https://www.cs.toronto.edu/~zemel/documents/prototypical_networks_nips_2017.pdf

In LLMs: Few-shot learning can be accomplished by providing some example (input and output) pairs as part of the prompt, prior to providing the input for an unknown. The LLM will then use the pattern from the examples to construct a prediction for the unknown.

Finding Existing Models

You can often use an existing model that can either apply directly to your task or can be fine-tuned through transfer learning to fit your task.

Hugging Face (https://huggingface.co) has emerged as the central hub for pre-trained models, datasets, and live demos. The transformers, diffusers, and datasets libraries provide a unified interface to thousands of models.

Models: https://huggingface.co/models
Datasets: https://huggingface.co/datasets
Spaces (live demos): https://huggingface.co/spaces

Other sources:

PyTorch Hub: https://pytorch.org/hub/
Keras Applications: https://keras.io/api/applications/
Eleuther AI: https://www.eleuther.ai/releases
Model Zoo (older resource): https://modelzoo.co/

Choosing a Model by Task

This article takes a look at three kinds of machine learning tasks (Classification, Regression, and Clustering) and present some of the best known models for each.
https://elitedatascience.com/machine-learning-algorithms

Here is a similar article that looks at many more kinds of machine learning tasks (and of course the same ones as above as well).
https://www.dataquest.io/blog/top-10-machine-learning-algorithms-for-beginners/

Papers with Code maintains a repository of state-of-the art research papers and provides open-source implementations and evaluation metrics.
https://paperswithcode.com/sota

Learn More

Deep Learning Book (by Ian Goodfellow, Yoshua Bengio, and Aaron Courville) - a comprehensive textbook covering a wide range of deep learning topics. https://www.deeplearningbook.org/

Papers with Code - a website that aggregates recent research papers and provides open-source implementations and evaluation metrics available with each of them. https://paperswithcode.com

MIT Deep Learning Series - a collection of video lectures by prominent researchers, designed to give a broad overview of the field of machine learning and deep learning. https://deeplearning.mit.edu/

Coursera Deep Learning Specialization - a series of online courses providing a graduate-level introduction to deep learning. https://www.coursera.org/specializations/deep-learning

PyTorch.org - the leading open-source platform for deep learning, with broad adoption across both research and industry. https://pytorch.org/

TensorFlow.org - a widely-used open-source platform for constructing and training machine learning models, including deep learning. https://www.tensorflow.org/

Deep Learning & How to Choose the Right Model