Deep Learning & How to Choose the Right Model

Deep Learning


Although the term deep learning is commonly heard, it is not well defined.


For our purposes, let’s let “deep” refer to any neural-network based model that contains more than a single “hidden layer”. This includes everything from Multi-Layer Perceptrons all the way up to convolutional neural networks or recurrent neural networks with hundreds of hidden layers.

Deep Learning
A subset of machine learning that uses artificial neural networks with many layers to analyze and learn from complex data.
Many can be understood to mean “more than one”, but in practice it usually means significantly more than one.

Multi-Layer Perceptrons (MLPs)


This is the original “neural network”.


Useful if you have “traditional” (bag of variables) datasets and need a nonlinear model, if you have enough data.


Beware of making MLPs too “deep” - they can become hard to train.

A Multi-Layer Perceptron1

Convolutional Neural Network (CNN)


  • Based (loosely) on concepts taken from models of how the human visual cortex operates.
  • Rely on convolutional operations to provide some translation invariance.
  • Learn visual features in a hierarchical fashion.
  • Also proven useful in non-vision applications.

Convolution


Each \(N\times N\) patch in the Input is “compared” (via dot product) to the filter (or kernel) and the result creates a single pseudo-pixel in the Output.

Convolution Filters : Edge detection

An example kernel that will provide vertical edge detection. Notice how it responds highly at the boundary between the “lighter” and “darker” pixels in the input.

In practice, we don’t hand-craft the filters—we let the network learn them. (In other words, the values in the filter are weights (or parameters) in the model.)

Sequential and Generative Models

Recurrent Neural Networks (RNNs) & LSTMs

  • Designed for sequential/timeseries data; output depends on previous inputs.
  • LSTMs added gating to preserve long-range dependencies.
  • Largely superseded by Transformers for most sequence and NLP tasks.

LSTM Cell

Generative Adversarial Networks (GANs)

  • Train a discriminative and a generative model in competition.
  • Produced impressive image generation results through ~2022.
  • Largely superseded by Diffusion Models for generative tasks.

Diagram of a GAN

LSTM Cell image: https://upload.wikimedia.org/wikipedia/commons/5/56/LSTM_cell.svg

Autoencoder

  • Autoencoder is one of the few unsupervised deep learning models.
  • It learns to reproduce its input as precisely as possible by first encoding it to a latent feature-space representation, then decoding that representation back to the original dimensionality.
  • Can be used for compression, denoising, and more.
  • Can be used for pre-training weights for “few shot learning”.

Autoencoder Image: https://upload.wikimedia.org/wikipedia/commons/2/23/Autoencoder-BodySketch.svg

Diffusion Models

  • The dominant approach for generative image, audio, and video tasks as of ~2022–present.
  • Learn to reverse a gradual noise-addition process (denoising diffusion).
  • More stable to train than GANs; produce higher-quality and more diverse outputs.
  • Controllable via text prompts — a language model guides the generation process.

Examples: Stable Diffusion, DALL-E 3, Sora (video), MusicGen (audio).

Transformer Networks

Transformer networks were introduced for natural language processing (NLP) but have since become the dominant architecture across many domains — text, vision, audio, and time-series.

The key feature of transformer networks is the self-attention mechanism, which allows the network to weigh different parts of the input differently based on relevance. This replaces the recurrent state used in RNNs.

Vision Transformers (ViT) apply the same architecture to images (split into patches), and now match or exceed CNNs on many vision benchmarks.

Images: A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Dec. 2017 [Online]. Available: http://arxiv.org/abs/1706.03762.

Transformer : Key Features

  • Self-attention mechanism: allows the network to weigh different parts of the input sequence differently based on relevance
  • No recurrent or convolutional layers: allows for parallel processing and faster training time
  • Multi-head attention: allows the network to learn more complex relationships between different parts of the input sequence
  • Advantages
    • Improved performance over traditional RNN-based models
    • Parallelizable and efficient
    • Can handle longer input sequences
  • Disadvantages
    • Fully-connected design means high parameter cost (space and time complexity).

Large Language Models (LLMs)

An LLM (Large Language Model) is a transformer-based model pre-trained on massive text corpora, then fine-tuned to follow instructions. LLMs can generate fluent text and perform a broad range of language and reasoning tasks.

LLM: Key Features

  • Pre-trained on massive text corpora, then fine-tuned for instruction-following via RLHF (Reinforcement Learning from Human Feedback)
  • Capable of text generation, summarization, translation, coding, reasoning, and more
  • Typically use transformer networks as the underlying architecture
  • Open-weight models (Llama, Mistral, Falcon) now available for local or on-premise deployment
  • Advantages
    • Can produce high-quality, human-like text across a wide variety of tasks
    • In-context learning: provide examples in the prompt — no retraining required
    • Can be adapted with minimal labeled data via prompting or lightweight fine-tuning
  • Disadvantages
    • Known to “hallucinate” (give incorrect answers with high confidence)
    • Legal issues surrounding training datasets and potential for copyright violation
    • High energy and compute cost for training and inference
    • Outputs can reflect biases present in training data
    • Sensitive to prompt phrasing; reasoning is not formally verifiable

LLM: Practical Techniques for Data Scientists

Retrieval-Augmented Generation (RAG)
Augment an LLM with a searchable knowledge base; retrieved text chunks are injected into the prompt at inference time. Reduces hallucination and grounds responses in private or up-to-date data.
Parameter-Efficient Fine-Tuning (PEFT / LoRA)
Fine-tune only a small set of added low-rank parameters rather than the full model. Methods like LoRA and QLoRA allow adapting a 7B–70B model on a single GPU.
Prompt Engineering and Inference-Rime Compute
Structuring inputs — zero-shot, few-shot, chain-of-thought — to elicit better outputs without any retraining.

Transfer Learning

  • Deep learning models are difficult to train, and require massive labeled datasets.
  • Many real-world tasks have similarities though…

Is identifying birds that different from identifying objects in an image? They are both visual tasks… They both require us to use similar parts of our vision system. It makes sense that a network trained for one task might be able to become quickly proficient at a different (but similar) task.

  • Train a system on task A. Use the pre-learned weights \(W_A\) (except perhaps the top layer), train for additional epochs on a specialized task B. This “retraining” should take less time/examples than the original training.
  • Heavily used in applied machine learning, such as biomedical imaging, agricultural applications, etc.

Few-Shot Learning

In LLMs: Few-shot learning can be accomplished by providing some example (input and output) pairs as part of the prompt, prior to providing the input for an unknown. The LLM will then use the pattern from the examples to construct a prediction for the unknown.

Finding Existing Models

You can often use an existing model that can either apply directly to your task or can be fine-tuned through transfer learning to fit your task.

Hugging Face (https://huggingface.co) has emerged as the central hub for pre-trained models, datasets, and live demos. The transformers, diffusers, and datasets libraries provide a unified interface to thousands of models.

Other sources:

Choosing a Model by Task


This article takes a look at three kinds of machine learning tasks (Classification, Regression, and Clustering) and present some of the best known models for each.
https://elitedatascience.com/machine-learning-algorithms


Here is a similar article that looks at many more kinds of machine learning tasks (and of course the same ones as above as well).
https://www.dataquest.io/blog/top-10-machine-learning-algorithms-for-beginners/


Papers with Code maintains a repository of state-of-the art research papers and provides open-source implementations and evaluation metrics.
https://paperswithcode.com/sota

Learn More

Deep Learning Book (by Ian Goodfellow, Yoshua Bengio, and Aaron Courville) - a comprehensive textbook covering a wide range of deep learning topics. https://www.deeplearningbook.org/

Papers with Code - a website that aggregates recent research papers and provides open-source implementations and evaluation metrics available with each of them. https://paperswithcode.com

MIT Deep Learning Series - a collection of video lectures by prominent researchers, designed to give a broad overview of the field of machine learning and deep learning. https://deeplearning.mit.edu/

Coursera Deep Learning Specialization - a series of online courses providing a graduate-level introduction to deep learning. https://www.coursera.org/specializations/deep-learning

PyTorch.org - the leading open-source platform for deep learning, with broad adoption across both research and industry. https://pytorch.org/

TensorFlow.org - a widely-used open-source platform for constructing and training machine learning models, including deep learning. https://www.tensorflow.org/

Deep Learning & How to Choose the Right Model