Data Mining and Analysis

Expert-defined terms from the Undergraduate Certificate in AI for Indirect Tax Management course at Greenwich School of Business and Finance. Free to read, free to share, paired with a globally recognised certification pathway.

Data Mining and Analysis

A : #

A:

Association Rule Mining #

Association rule mining is a data mining technique used to discover interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using measures of interestingness.

Attribute #

An attribute is a characteristic or feature of an object. In the context of data mining, an attribute is a column in a database table that describes a characteristic of the object being studied.

Attribute Selection #

Attribute selection is the process of selecting a subset of relevant attributes for use in data mining tasks. This can help improve the accuracy and efficiency of data mining algorithms.

B : #

B:

Bayesian Network #

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies using a directed acyclic graph (DAG). It is used to reason about uncertain knowledge and make probabilistic inferences.

Binning #

Binning is the process of dividing a continuous attribute into a set of intervals, or bins. This is often done to simplify the data and make it more suitable for data mining algorithms.

C : #

C:

Classification #

Classification is a data mining task that involves building a model that can predict the class or category of a given instance based on its attributes.

Clustering #

Clustering is a data mining task that involves grouping similar instances together based on their attributes.

CNN (Convolutional Neural Network) #

A CNN is a type of neural network that is commonly used for image recognition tasks. It is designed to automatically and adaptively learn spatial hierarchies of features from images.

Correlation #

Correlation is a statistical measure that describes the degree to which two variables move in relation to each other.

D : #

D:

Decision Tree #

A decision tree is a type of machine learning model that uses a tree-like structure to make decisions based on the attributes of a given instance.

Dimensionality Reduction #

Dimensionality reduction is the process of reducing the number of attributes or features in a dataset. This can help improve the performance and interpretability of data mining algorithms.

Discriminant Analysis #

Discriminant analysis is a statistical technique used to classify or predict the group membership of instances based on their attributes.

E : #

E:

Ensemble Learning #

Ensemble learning is a machine learning technique that involves combining the predictions of multiple models to improve the accuracy and robustness of the final prediction.

Entropy #

Entropy is a measure of the disorder or randomness of a system. In the context of data mining, entropy is used to evaluate the purity of a set of instances and determine the optimal split in a decision tree.

F : #

F:

Feature Engineering #

Feature engineering is the process of creating new features or attributes from the existing data in a dataset. This can help improve the performance and interpretability of data mining algorithms.

Feature Selection #

Feature selection is the process of selecting a subset of relevant features or attributes for use in data mining tasks. This can help improve the accuracy and efficiency of data mining algorithms.

G : #

G:

Genetic Algorithm #

A genetic algorithm is a optimization technique inspired by the process of natural selection. It is used to find the optimal solution to a problem by iteratively evolving a population of candidate solutions.

Graphical Model #

A graphical model is a statistical model that represents a set of variables and their dependencies using a graph. It is used to reason about uncertain knowledge and make probabilistic inferences.

H : #

H:

Hierarchical Clustering #

Hierarchical clustering is a clustering technique that involves grouping instances into a hierarchical structure of clusters.

Homoscedasticity #

Homoscedasticity is the property of a statistical model where the variance of the errors is constant across all levels of the predictor variables.

I : #

I:

Instance #

An instance is a single observation or data point in a dataset.

Information Gain #

Information gain is a measure of the reduction in entropy that results from splitting a set of instances based on a given attribute. It is used to evaluate the quality of a split in a decision tree.

Instance #

Based Learning: Instance-based learning is a machine learning technique that involves making predictions based on the similarity of the given instance to instances in the training data.

K : #

K:

K #

Means Clustering: K-means clustering is a clustering technique that involves partitioning a set of instances into k clusters based on their attributes.

K #

Nearest Neighbors (KNN): KNN is a instance-based learning algorithm used for classification and regression tasks. It works by finding the k instances in the training data that are most similar to the given instance and using their labels or values to make a prediction.

L : #

L:

Latent Variable #

A latent variable is a hidden or unobserved variable that is not directly measured but is inferred from other observed variables.

Linear Regression #

Linear regression is a statistical model used to predict a continuous variable based on one or more predictor variables.

Logistic Regression #

Logistic regression is a statistical model used to predict a binary variable based on one or more predictor variables.

N : #

N:

Naive Bayes #

Naive Bayes is a probabilistic machine learning algorithm used for classification tasks. It is based on Bayes' theorem and assumes that the attributes are independent of each other given the class.

Neural Network #

A neural network is a machine learning model inspired by the structure and function of the human brain. It consists of interconnected nodes or neurons that process and transmit information.

O : #

O:

Overfitting #

Overfitting is a common problem in machine learning where a model is excessively complex and fits the training data too closely, resulting in poor generalization performance on new data.

Outlier #

An outlier is a data point that is significantly different from the other data points in the dataset.

P : #

P:

Principal Component Analysis (PCA) #

PCA is a dimensionality reduction technique that involves transforming the original attributes into a new set of uncorrelated attributes called principal components.

Probability Density Function (PDF) #

A PDF is a function that describes the probability distribution of a continuous random variable.

Probability Mass Function (PMF) #

A PMF is a function that describes the probability distribution of a discrete random variable.

R : #

R:

Random Forest #

Random forest is an ensemble learning algorithm that involves building multiple decision trees and combining their predictions to make a final prediction.

Regression #

Regression is a data mining task that involves building a model that can predict a continuous variable based on its attributes.

Resampling #

Resampling is a technique used to estimate the performance of a machine learning algorithm by repeatedly drawing samples from the training data and evaluating the algorithm on those samples.

S : #

S:

SVM (Support Vector Machine) #

SVM is a machine learning algorithm used for classification and regression tasks. It works by finding the hyperplane that maximally separates the classes or predicts the values with the minimum error.

Supervised Learning #

Supervised learning is a machine learning paradigm where the model is trained on labeled data to make predictions on new,

May 2026 cohort · 29 days left
from £99 GBP
Enrol