Foundation Models

Foundation Model #

AI models trained on massive datasets to perform a wide range of tasks with minimal fine-tuning.

  • are large deep learning neural networks

  • are large AI models trained on massive and diverse datasets (text, images, audio, or multiple modalities).

  • Contain millions or billions of parameters.

  • designed to perform a broad range of general tasks

  • designed for general-purpose intelligence, not a single task.

  • acts as base models for building specialised AI applications

  • uses transfer learning, allowing knowledge learned from one task to be reused for others.

  • can be adapted using fine-tuning or prompting.

Foundation models are trained once on diverse data and adapted many times to solve different tasks.


flowchart LR

    %% -------------------
    %% Data Sources
    %% -------------------
    subgraph DATA_GROUP["Data"]
        TEXT[Textual Data]
        STRUCT[Structured Data]
        SPEECH[Speech]
        SIGNALS[3D Signals]
        IMAGES[Images]
    end

    %% -------------------
    %% Foundation Model
    %% -------------------
    subgraph FM_GROUP["Foundation Model"]
        FM[Pre-trained Model]
    end

    %% -------------------
    %% Adaptation Layer
    %% -------------------
    subgraph ADAPT_GROUP["Adaptation"]
        FT[Fine-tuning]
        PROMPT[Prompting]
        RAG[RAG]
    end

    %% -------------------
    %% Tasks
    %% -------------------
    subgraph TASKS_GROUP["Tasks"]
        IE[Information Extraction]
        OR[Object Recognition]
        IF[Instruction Following]
        IC[Image Captioning]
        SA[Sentiment Analysis]
        QA[Question Answering]
    end

    %% -------------------
    %% Connections
    %% -------------------
    TEXT -->|Training| FM
    STRUCT -->|Training| FM
    SPEECH -->|Training| FM
    SIGNALS -->|Training| FM
    IMAGES -->|Training| FM

    FM -->|Adaptation| FT
    FM -->|Adaptation| PROMPT
    FM -->|Adaptation| RAG

    FT --> IE
    FT --> SA

    PROMPT --> IF
    PROMPT --> QA

    RAG --> IE
    RAG --> QA
    RAG --> IC

    %% -------------------
    %% Styling
    %% -------------------
    style TEXT fill:#C8E6C9
    style STRUCT fill:#C8E6C9
    style SPEECH fill:#C8E6C9
    style SIGNALS fill:#C8E6C9
    style IMAGES fill:#C8E6C9

    style FM fill:#90CAF9

    style FT fill:#BBDEFB
    style PROMPT fill:#BBDEFB
    style RAG fill:#BBDEFB

    style IE fill:#FFCCBC
    style OR fill:#FFCCBC
    style IF fill:#FFCCBC
    style IC fill:#FFCCBC
    style SA fill:#FFCCBC
    style QA fill:#FFCCBC

    style DATA_GROUP stroke:none,fill:transparent
    style FM_GROUP stroke:none,fill:transparent
    style ADAPT_GROUP stroke:none,fill:transparent
    style TASKS_GROUP stroke:none,fill:transparent

Traditional ML vs Foundation Models #

FeatureTraditional ML ModelsFoundation Models
Training dataSmall, task-specific datasetsMassive, diverse datasets
Model sizeSmall to mediumVery large (millions/billions of parameters)
PurposeSingle, specific taskGeneral-purpose
ReusabilityLimitedHigh
Training approachTrain from scratch per taskPre-train once, adapt many times
Transfer learningRare or minimalCore design principle
ExamplesLinear Regression, SVM, Decision TreesGPT, BERT, CLIP

Why Foundation Models Are Different #

  • Traditional ML models are built for one problem at a time.
  • Foundation models learn general representations of language, vision, or sound.
  • This enables them to be reused across many applications with minimal additional training.

Areas of Application #

  • Natural Language Processing (NLP)
  • Computer Vision
  • Speech Recognition
  • Multimodal AI (text + images + audio)

Common Foundation Model Examples #

GPT (Generative Pre-trained Transformer) #

  • Focus: Text generation and understanding
  • Tasks: Chatbots, summarisation, translation, code generation
  • Example: ChatGPT

BERT (Bidirectional Encoder Representations from Transformers) #

  • Focus: Language understanding
  • Tasks: Search, question answering, sentiment analysis
  • Used heavily in search engines

CLIP (Contrastive Language–Image Pretraining) #

  • Focus: Text + Image understanding
  • Tasks: Image classification using text prompts
  • Enables multimodal AI systems

Foundation Models in the AI Stack #

flowchart TB
    DATA[Massive Datasets]
    FM[Foundation Model]
    TASK1[NLP Tasks]
    TASK2[Vision Tasks]
    TASK3[Speech Tasks]

    DATA --> FM
    FM --> TASK1
    FM --> TASK2
    FM --> TASK3

How Foundation Models Are Used #

  • Pre-trained once on large datasets
  • Adapted using:
    • Fine-tuning
    • Prompt engineering
    • Task-specific heads
  • Serve as the backbone for modern AI systems, including Large Language Models (LLMs)

FM Summary #

  • Foundation models are general-purpose AI models.
  • They power modern systems like LLMs and multimodal AI.
  • They reduce cost, time, and complexity in AI development.
  • They represent a major shift from task-specific ML to scalable intelligence.

Home | Generative AI