Reinforcement Learning (RL) #
RL is learning by trial and error.
Reinforcement Learning (RL) is a type of machine learning where an autonomous agent learns to make decisions by interacting with an environment.
Instead of being told the correct answer, the agent:
- takes actions
- observes outcomes
- receives rewards or penalties
- gradually learns a strategy that maximises long-term reward
Reinforcement Learning teaches an agent how to act, not what to predict.
flowchart LR
A[Agent] -->|Action| B[Environment]
B -->|State + Reward| A
1. Agent #
- The learner and decision-maker
- Chooses actions based on its current policy
2. Environment #
- Everything the agent interacts with
- Responds to actions by changing state and giving rewards
3. State #
- A snapshot of the environment at a given time
4. Action #
- A choice made by the agent
5. Reward #
- Feedback signal
- Positive reward → good outcome
- Negative reward → bad outcome
6. Policy #
- The agent’s strategy
- A mapping from states to actions
- Goal: maximise cumulative reward over time
What Is an Autonomous Agent? #
An autonomous agent is a system that:
- observes its environment
- makes decisions independently
- acts without direct human instruction
Examples #
- Robots
- Self-driving cars
- Game-playing agents (Chess, Go, Atari)
- Recommendation systems
How Reinforcement Learning Differs from Other ML Types #
Reinforcement Learning vs Supervised Learning #
Supervised Learning
- Uses labelled data
- Learns from examples of correct answers
- Goal: minimise prediction error
Reinforcement Learning
- No labelled “correct” actions
- Learns from rewards and penalties
- Goal: maximise long-term reward
RL is not told what the right action is — it must discover it.
Reinforcement Learning vs Unsupervised Learning #
Unsupervised Learning
- Finds hidden patterns in data
- No explicit objective tied to actions
- Focuses on structure (clusters, embeddings)
Reinforcement Learning
- Learns by interaction
- Uses a reward function
- Focuses on decision-making
RL learns what to do, not just what exists.
A Key Conceptual Difference #
Supervised & Unsupervised Learning:
- Assume data points are independent
- Learn a static model from a dataset
Reinforcement Learning:
- Data arrives as sequences
- State → Action → Reward → Next State
- Decisions affect future data
Reinforcement learning models sequential decision-making, which closely mirrors how humans and animals learn.
Reinforcement Learning from Human Feedback (RLHF) #
RLHF is a modern extension of reinforcement learning where humans help define what “good” looks like.
How RLHF Works #
- Humans evaluate model outputs (e.g., ranking answers)
- A reward model is trained from this feedback
- Reinforcement learning optimises the agent using this reward model
Why RLHF Matters #
Some concepts are hard to define mathematically:
- humour
- helpfulness
- politeness
- safety
Humans can easily judge these qualities, even if they can’t formalise them.
Example #
It’s hard to write a formula for “funny”, but easy for humans to rate jokes.
That feedback becomes a reward signal to improve the model.
RLHF is widely used in:
- Large Language Models
- Chatbots
- AI assistants
Where Reinforcement Learning Is Used #
- Robotics
- Autonomous vehicles
- Game-playing AI
- Recommendation systems
- Resource allocation
- Training LLM behaviour (via RLHF)
Summary #
- Reinforcement Learning trains agents, not predictors
- Learning happens through trial and error
- Rewards guide behaviour instead of labels
- Decisions are sequential and interdependent
- RLHF brings humans into the reward loop
Reinforcement Learning teaches machines how to behave by rewarding good decisions and punishing bad ones.