Professional Certificate in AI Technologies for Marine Industry · Guide

Reinforcement Learning

Reinforcement Learning is a powerful machine learning technique that allows an agent to learn how to make decisions by interacting with an environment. In this course, we will explore key terms and vocabulary related to Reinforcement Learni…

14 min read Updated 6 May 2026

**Agent**: An entity that interacts with the environment in Reinforcement Learning. The agent takes actions based on the information it receives from the environment and receives rewards or penalties based on those actions.

**Environment**: The external system or domain in which the agent operates. It provides feedback to the agent in the form of rewards or punishments based on the actions taken.

**State**: A representation of the current situation of the agent within the environment. The state provides information about the agent's position, condition, and any other relevant variables.

**Action**: A decision made by the agent to transition from one state to another. Actions can be discrete (e.g., moving left or right) or continuous (e.g., adjusting speed).

**Reward**: A scalar value that the agent receives from the environment after taking an action. Rewards can be positive (encouraging the agent to repeat the action) or negative (discouraging the agent from repeating the action).

**Policy**: A strategy that the agent uses to determine which action to take in a given state. The policy can be deterministic (always choosing the same action) or stochastic (choosing actions based on probabilities).

**Value Function**: A function that estimates the expected cumulative reward that an agent can achieve from a given state or state-action pair. Value functions help the agent evaluate the goodness of different actions or states.

**Q-Value**: The expected cumulative reward of taking a particular action in a given state and following a specific policy thereafter. Q-values are used in Q-Learning and other algorithms to estimate the value of actions.

**Exploration**: The process by which the agent tries out different actions to learn about the environment and improve its policy. Exploration is essential for discovering optimal strategies.

**Exploitation**: The process by which the agent selects actions based on its current knowledge to maximize immediate rewards. Exploitation is crucial for making efficient decisions.

**Discount Factor**: A parameter that determines the importance of future rewards in Reinforcement Learning. A discount factor of 0 means the agent only cares about immediate rewards, while a discount factor of 1 considers all future rewards equally.

**Markov Decision Process (MDP)**: A mathematical framework used to model sequential decision-making problems in Reinforcement Learning. An MDP consists of states, actions, transition probabilities, and rewards.

**Bellman Equation**: An equation that expresses the relationship between the value of a state or state-action pair and the values of its successor states. The Bellman equation is used in many Reinforcement Learning algorithms.

**Exploration-Exploitation Tradeoff**: The dilemma faced by the agent in deciding whether to explore new actions or exploit known actions to maximize rewards. Balancing exploration and exploitation is a critical challenge in Reinforcement Learning.

**Policy Gradient**: A class of algorithms in Reinforcement Learning that directly optimize the policy of the agent to maximize rewards. Policy gradient methods are often used in continuous action spaces.

**Temporal Difference (TD) Learning**: A learning algorithm that updates value functions using the difference between current estimates and updated estimates based on experience. TD learning is a fundamental concept in Reinforcement Learning.

**Deep Reinforcement Learning**: A combination of Reinforcement Learning and deep neural networks. Deep Reinforcement Learning algorithms use neural networks to approximate value functions or policies in complex environments.

**Actor-Critic**: A class of algorithms that combines elements of policy-based methods (actor) and value-based methods (critic) in Reinforcement Learning. Actor-Critic methods are effective in handling high-dimensional action spaces.

**SARSA**: A model-free Reinforcement Learning algorithm that updates Q-values based on state-action-state-action-reward transitions. SARSA is an on-policy algorithm that learns the Q-values of the policy being used.

**Q-Learning**: A model-free Reinforcement Learning algorithm that updates Q-values based on the maximum Q-value of the next state. Q-Learning is an off-policy algorithm that learns the optimal Q-values regardless of the policy being followed.

**Deep Q-Network (DQN)**: A deep reinforcement learning algorithm that combines Q-Learning with deep neural networks. DQN is known for its success in learning to play Atari games directly from pixels.

**Policy Iteration**: An algorithm that iteratively evaluates and improves the policy of the agent based on the value function. Policy iteration alternates between policy evaluation and policy improvement steps.

**Value Iteration**: An algorithm that iteratively computes the optimal value function by maximizing the expected cumulative reward. Value iteration converges to the optimal policy by updating value estimates.

**Monte Carlo Methods**: A class of algorithms in Reinforcement Learning that learn value functions by sampling sequences of states, actions, and rewards. Monte Carlo methods estimate value functions directly from experience.

**On-Policy Learning**: A Reinforcement Learning approach where the agent learns the value of the policy it follows. On-policy learning updates the policy and value function simultaneously.

**Off-Policy Learning**: A Reinforcement Learning approach where the agent learns the value of a different policy from the one it follows. Off-policy learning decouples the policy being evaluated from the policy being improved.

**Function Approximation**: A technique in Reinforcement Learning that uses parameterized functions to approximate value functions or policies. Function approximation allows learning in high-dimensional or continuous state spaces.

**Curriculum Learning**: A training strategy in Reinforcement Learning where the difficulty of tasks is gradually increased to facilitate learning. Curriculum learning helps agents learn more complex policies efficiently.

**Transfer Learning**: A technique in Reinforcement Learning where knowledge or policies learned in one task are applied to improve learning in a related task. Transfer learning accelerates learning in new environments.

**Multi-Agent Reinforcement Learning**: A framework in Reinforcement Learning where multiple agents interact with each other and the environment. Multi-agent reinforcement learning is used to model cooperative or competitive scenarios.

**Simulated Annealing**: A stochastic optimization technique inspired by the annealing process in metallurgy. Simulated annealing is used in Reinforcement Learning to explore the action space efficiently.

**Stochastic Environment**: An environment where the outcomes of actions are probabilistic rather than deterministic. Stochastic environments pose challenges in learning optimal policies.

**Deterministic Environment**: An environment where the outcomes of actions are predictable and do not involve randomness. Deterministic environments are easier to learn optimal policies in.

**Reward Shaping**: A technique in Reinforcement Learning where additional rewards are provided to guide the agent towards desirable behaviors. Reward shaping can accelerate learning by providing informative feedback.

**Sparse Rewards**: A challenge in Reinforcement Learning where rewards are only given infrequently, making it difficult for the agent to learn. Sparse rewards require exploration strategies to discover rewarding actions.

**Function Approximation Error**: The discrepancy between the true value function and the approximated value function. Function approximation errors can lead to suboptimal policies and slow convergence in Reinforcement Learning.

**Policy Evaluation**: The process of estimating the value function of a policy by simulating interactions with the environment. Policy evaluation is a crucial step in Reinforcement Learning algorithms.

**Policy Improvement**: The process of updating the policy based on the estimated value function to maximize rewards. Policy improvement aims to make the agent's decisions more optimal over time.

**Generalization**: The ability of an agent to apply learned knowledge from one state to similar states in the environment. Generalization allows agents to adapt to new situations efficiently.

**Overfitting**: The phenomenon where a learning algorithm performs well on training data but fails to generalize to unseen data. Overfitting can lead to suboptimal policies in Reinforcement Learning.

**Underfitting**: The phenomenon where a learning algorithm fails to capture the underlying structure of the data. Underfitting can result in poor performance and slow learning in Reinforcement Learning.

**Bias-Variance Tradeoff**: The balance between bias (systematic errors) and variance (random errors) in learning algorithms. Finding the right balance is crucial for achieving good generalization in Reinforcement Learning.

**Function Approximation Techniques**: Various methods used to approximate value functions or policies in Reinforcement Learning, including linear regression, neural networks, decision trees, and more.

**Exploration Strategies**: Techniques used by agents to explore the environment and learn optimal policies, such as epsilon-greedy, softmax, Upper Confidence Bound (UCB), and Thompson sampling.

**Reward Function**: A function that maps states and actions to rewards in Reinforcement Learning. The reward function defines the goal of the agent and guides its learning process.

**Discounted Return**: The sum of discounted rewards that an agent receives over time. Discounted return is used to evaluate the long-term performance of policies in Reinforcement Learning.

**Batch Reinforcement Learning**: A setting in Reinforcement Learning where the agent learns from a fixed dataset of experiences rather than interacting with the environment in real-time. Batch reinforcement learning is useful for offline learning.

**Online Reinforcement Learning**: A setting in Reinforcement Learning where the agent learns by interacting with the environment in real-time. Online reinforcement learning is used when the environment is dynamic or unknown.

**Model-Based Reinforcement Learning**: An approach in Reinforcement Learning where the agent learns a model of the environment (transition dynamics and rewards) to make decisions. Model-based RL can improve sample efficiency.

**Model-Free Reinforcement Learning**: An approach in Reinforcement Learning where the agent learns directly from experience without building an explicit model of the environment. Model-free RL is more flexible but requires more data.

**Curse of Dimensionality**: The challenge in Reinforcement Learning where the complexity of the state or action space grows exponentially with the number of dimensions. The curse of dimensionality can lead to computational inefficiency.

**Function Approximation Challenges**: Issues such as vanishing gradients, overfitting, and catastrophic forgetting that arise when using function approximation in Reinforcement Learning. Addressing these challenges is crucial for successful learning.

**Policy Gradient Methods**: A class of algorithms that optimize the policy of the agent directly by computing gradients of expected rewards. Policy gradient methods are suitable for continuous action spaces.

**Natural Policy Gradient**: An optimization technique that accounts for the geometry of the policy space in Reinforcement Learning. Natural policy gradients can improve convergence and stability of policy updates.

**Trust Region Policy Optimization (TRPO)**: An algorithm in Reinforcement Learning that constrains policy updates to ensure small changes and stable learning. TRPO prevents large policy changes that can lead to instability.

**Proximal Policy Optimization (PPO)**: An algorithm in Reinforcement Learning that simplifies the trust region approach by using a clipped objective function. PPO is efficient and straightforward to implement compared to TRPO.

**Deterministic Policy Gradient (DPG)**: An algorithm in Reinforcement Learning that optimizes deterministic policies using gradient-based methods. DPG is suitable for continuous action spaces with deterministic policies.

**Deep Deterministic Policy Gradient (DDPG)**: An algorithm that combines DPG with deep neural networks for learning continuous control policies. DDPG is effective in high-dimensional action spaces.

**Twin Delayed Deep Deterministic Policy Gradient (TD3)**: An extension of DDPG that uses two Q-functions and delayed policy updates for improved stability. TD3 is known for its robust performance in complex environments.

**Distributed Reinforcement Learning**: A framework where multiple agents learn simultaneously and share experiences in a distributed environment. Distributed reinforcement learning can accelerate learning and handle large-scale tasks.

**Meta-Reinforcement Learning**: A higher-level learning approach where the agent learns how to learn efficient policies across different tasks. Meta-reinforcement learning enables rapid adaptation to new environments.

**Imitation Learning**: A technique in Reinforcement Learning where the agent learns by imitating expert demonstrations rather than exploring the environment. Imitation learning can accelerate learning in complex tasks.

**Inverse Reinforcement Learning**: A framework in Reinforcement Learning where the agent infers the reward function from observed behavior. Inverse reinforcement learning is used to learn reward functions from demonstrations.

**Off-Policy Evaluation**: A technique in Reinforcement Learning that estimates the performance of a policy without executing it. Off-policy evaluation is useful for evaluating policies in simulation or using historical data.

**Off-Policy Correction**: A method in Reinforcement Learning that corrects for the bias introduced by learning from off-policy data. Off-policy correction ensures accurate value estimates in off-policy learning.

**Multi-armed Bandit Problem**: A classic problem in Reinforcement Learning where the agent must choose between multiple actions (arms) with unknown rewards. The multi-armed bandit problem is used to study exploration-exploitation tradeoffs.

**Contextual Bandit**: A variant of the multi-armed bandit problem where the rewards depend on contextual information. Contextual bandits are used in personalized recommendation systems and online advertising.

**Epsilon-Greedy Strategy**: An exploration strategy in Reinforcement Learning where the agent chooses a random action with probability epsilon and the best action with probability 1-epsilon. Epsilon-greedy balances exploration and exploitation.

**Upper Confidence Bound (UCB)**: An exploration strategy in Reinforcement Learning that selects actions based on their estimated value and uncertainty. UCB balances exploration by considering both rewards and confidence intervals.

**Thompson Sampling**: A probabilistic exploration strategy in Reinforcement Learning that samples actions according to their posterior distribution. Thompson sampling is effective in handling uncertainty and improving exploration.

**Stochastic Gradient Descent (SGD)**: An optimization technique used to update the parameters of neural networks in Reinforcement Learning. SGD minimizes the loss function by adjusting the weights of the network.

**Experience Replay**: A technique in Reinforcement Learning that stores experiences in a replay buffer and samples mini-batches for training neural networks. Experience replay improves sample efficiency and stabilizes learning.

**Reward Scaling**: A normalization technique in Reinforcement Learning that scales rewards to a specific range to facilitate learning. Reward scaling prevents large rewards from dominating the learning process.

**Action Space**: The set of possible actions that the agent can take in a given state. The action space can be discrete, continuous, or both, depending on the task.

**State Space**: The set of all possible states that the agent can encounter in the environment. The state space defines the complexity of the problem and the information available to the agent.

**Exploration-Exploitation Dilemma**: The fundamental tradeoff in Reinforcement Learning between exploring new actions to learn more about the environment and exploiting known actions to maximize rewards.

**Policy Search**: A class of algorithms in Reinforcement Learning that search for optimal policies by directly manipulating policy parameters. Policy search methods can be gradient-based or evolutionary.

**Value-Based Methods**: A class of algorithms in Reinforcement Learning that estimate value functions to make decisions. Value-based methods focus on learning the value of states or state-action pairs.

**Policy-Based Methods**: A class of algorithms in Reinforcement Learning that directly optimize policies to maximize rewards. Policy-based methods learn the best actions to take in each state.

**Temporal Difference Error**: The difference between the observed reward and the predicted value in Reinforcement Learning. Temporal difference errors drive learning by updating value estimates.

**Batch Size**: The number of experiences sampled from the replay buffer for training neural networks in Reinforcement Learning. Batch size affects the stability and convergence of learning algorithms.

**Learning Rate**: A hyperparameter that controls the size of updates to neural network weights in Reinforcement Learning. The learning rate influences the speed and stability of learning.

**Discount Factor**: A hyperparameter that determines the importance of future rewards in Reinforcement Learning. A discount factor of 0 values immediate rewards, while a discount factor of 1 considers all future rewards equally.

**Policy Gradient Theorem**: A mathematical result that provides the gradient of the expected reward with respect to policy parameters in Reinforcement Learning. The policy gradient theorem guides the optimization of policies.

**Bellman Optimality Equation**: An equation that describes the optimal value function in Reinforcement Learning. The Bellman optimality equation is used to derive optimal policies in Markov Decision Processes.

**Markov Property**: A property of states in Reinforcement Learning where the future state depends only on the current state and action. Markov states have no memory of previous states.

**On-Policy Learning**: A learning approach in Reinforcement Learning where the agent learns the value of the policy it follows. On-policy learning updates the policy and value function simultaneously.

**Off-Policy Learning**: A learning approach in Reinforcement Learning where the agent learns the value of a different policy from the one it follows. Off-policy learning separates policy evaluation from policy improvement.

**Dyna-Q Algorithm**: An algorithm in Reinforcement Learning that combines Q-learning with a model of the environment for planning. Dyna-Q uses simulated experiences to improve Q-value estimates.

**Differential Q-Learning**: A variant of Q-learning that uses a differentiable function approximator to estimate Q-values. Differential Q-learning enables end-to-end training of neural networks.

**Hindsight Experience Replay**: A technique in Reinforcement Learning that replays experiences with different goals to improve learning. Hindsight experience replay allows agents to learn from failures.

**Adversarial Reinforcement Learning**: A framework where multiple agents compete or cooperate to optimize their policies. Adversarial reinforcement learning is used to model competitive scenarios.

**Self-Play**: A technique in Reinforcement Learning where an agent learns by playing against itself. Self-play is used to improve performance and facilitate learning in zero-sum games.

**Bootstrap**: A technique in Reinforcement Learning that initializes value functions using estimates from previous iterations. Bootstrapping accelerates learning by leveraging prior knowledge.

**Model Predictive Control (MPC)**: A control method in Reinforcement Learning that predicts future states and computes optimal actions based on a model of the environment. MPC is used in robotics and control systems.

**Action-Value Function**: A function that estimates the expected cumulative reward of taking a particular action in a given state. Action-value functions are used in Q-learning and SARSA algorithms.

**Off-Policy Correction**: A technique in Reinforcement Learning that corrects for the bias introduced by learning from off-policy data. Off-policy correction ensures accurate value estimates in off-policy learning.

**Multi-Step Returns**: A technique in Reinforcement Learning that estimates the value of states or actions by summing rewards over multiple time steps. Multi-step returns improve sample efficiency.

**Model-Based Reinforcement Learning**: An approach in Reinforcement Learning where the agent learns a model of the environment to make decisions. Model-based RL can improve sample efficiency and planning.

**Recurrent Neural Networks (RNNs)**: A type of neural network architecture used in Reinforcement Learning to model sequential data. RNNs are effective in handling temporal dependencies in environments.

**Long Short-Term Memory (LSTM)**: A variant of recurrent neural networks that can capture long-term dependencies in sequences. LSTMs are used in Reinforcement Learning to model complex dynamics.

**Curiosity-Driven Exploration**: A strategy in Reinforcement Learning where the agent explores the environment based on curiosity or novelty. Curiosity-driven exploration encourages agents to discover new states.

**Intrinsic Motivation**: A concept in Reinforcement Learning where agents are driven by internal goals rather than external rewards. Intrinsic motivation can improve exploration and learning efficiency.

**Episodic Reinforcement

Key takeaways

Reinforcement Learning is a powerful machine learning technique that allows an agent to learn how to make decisions by interacting with an environment.
The agent takes actions based on the information it receives from the environment and receives rewards or penalties based on those actions.
It provides feedback to the agent in the form of rewards or punishments based on the actions taken.
The state provides information about the agent's position, condition, and any other relevant variables.
**Action**: A decision made by the agent to transition from one state to another.
Rewards can be positive (encouraging the agent to repeat the action) or negative (discouraging the agent from repeating the action).
The policy can be deterministic (always choosing the same action) or stochastic (choosing actions based on probabilities).

Reinforcement Learning

Key takeaways

More from Professional Certificate in AI Technologies for Marine Industry