Reinforcement Learning (RL) #

RL is learning by trial and error.

Reinforcement Learning (RL) is a type of machine learning where an autonomous agent learns to make decisions by interacting with an environment.

Instead of being told the correct answer, the agent:

takes actions
observes outcomes
receives rewards or penalties
gradually learns a strategy that maximises long-term reward

Reinforcement Learning teaches an agent how to act, not what to predict.

flowchart LR
    A[Agent] -->|Action| B[Environment]
    B -->|State + Reward| A

1. Agent #

The learner and decision-maker
Chooses actions based on its current policy

2. Environment #

Everything the agent interacts with
Responds to actions by changing state and giving rewards

3. State #

A snapshot of the environment at a given time

4. Action #

A choice made by the agent

5. Reward #

Feedback signal
Positive reward → good outcome
Negative reward → bad outcome

6. Policy #

The agent’s strategy
A mapping from states to actions
Goal: maximise cumulative reward over time

What Is an Autonomous Agent? #

An autonomous agent is a system that:

observes its environment
makes decisions independently
acts without direct human instruction

Examples #

Robots
Self-driving cars
Game-playing agents (Chess, Go, Atari)
Recommendation systems

How Reinforcement Learning Differs from Other ML Types #

Reinforcement Learning vs Supervised Learning #

Supervised Learning

Uses labelled data
Learns from examples of correct answers
Goal: minimise prediction error

Reinforcement Learning

No labelled “correct” actions
Learns from rewards and penalties
Goal: maximise long-term reward

RL is not told what the right action is — it must discover it.

Reinforcement Learning vs Unsupervised Learning #

Unsupervised Learning

Finds hidden patterns in data
No explicit objective tied to actions
Focuses on structure (clusters, embeddings)

Reinforcement Learning

Learns by interaction
Uses a reward function
Focuses on decision-making

RL learns what to do, not just what exists.

A Key Conceptual Difference #

Supervised & Unsupervised Learning:
- Assume data points are independent
- Learn a static model from a dataset
Reinforcement Learning:
- Data arrives as sequences
- State → Action → Reward → Next State
- Decisions affect future data

Reinforcement learning models sequential decision-making, which closely mirrors how humans and animals learn.

Reinforcement Learning from Human Feedback (RLHF) #

RLHF is a modern extension of reinforcement learning where humans help define what “good” looks like.

How RLHF Works #

Humans evaluate model outputs (e.g., ranking answers)
A reward model is trained from this feedback
Reinforcement learning optimises the agent using this reward model

Why RLHF Matters #

Some concepts are hard to define mathematically:

humour
helpfulness
politeness
safety

Humans can easily judge these qualities, even if they can’t formalise them.

Example #

It’s hard to write a formula for “funny”, but easy for humans to rate jokes.
That feedback becomes a reward signal to improve the model.

RLHF is widely used in:

Large Language Models
Chatbots
AI assistants

Where Reinforcement Learning Is Used #

Robotics
Autonomous vehicles
Game-playing AI
Recommendation systems
Resource allocation
Training LLM behaviour (via RLHF)

Summary #

Reinforcement Learning trains agents, not predictors
Learning happens through trial and error
Rewards guide behaviour instead of labels
Decisions are sequential and interdependent
RLHF brings humans into the reward loop

Reinforcement Learning teaches machines how to behave by rewarding good decisions and punishing bad ones.

References #

Home | Machine Learning