Natural Language Processing

Expert-defined terms from the Graduate Certificate in Machine Learning in Polymer Science and Engineering course at Greenwich School of Business and Finance. Free to read, free to share, paired with a globally recognised certification pathway.

Natural Language Processing

Natural Language Processing (NLP) #

Natural Language Processing (NLP)

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI)… #

It involves the development of algorithms and models that enable computers to understand, interpret, and generate human language. NLP encompasses a range of tasks such as text classification, sentiment analysis, machine translation, and speech recognition.

NLP is essential for various applications such as chatbots, virtual assistants,… #

It enables machines to process and analyze large volumes of text data, extract valuable insights, and communicate effectively with humans. NLP algorithms often rely on machine learning techniques to train models on annotated text data and improve their performance over time.

Tokenization #

Tokenization

Tokenization is the process of breaking down a text into smaller units called to… #

These tokens can be words, phrases, or characters, depending on the task at hand. Tokenization is a crucial step in NLP because it helps convert raw text data into a format that can be easily processed by machine learning models.

For example, consider the sentence "Natural Language Processing is fascinating #

" After tokenization, the sentence may be broken down into tokens such as ["Natural", "Language", "Processing", "is", "fascinating"]. Tokenization can also involve removing punctuation, numbers, and special characters to clean the text data before further analysis.

Stemming #

Stemming

Stemming is a text normalization technique that aims to reduce words to their ro… #

The goal of stemming is to treat different variations of a word as the same word, which helps improve text analysis and information retrieval tasks. Stemming algorithms remove suffixes from words to extract their core meaning.

For example, the words "running," "runs," and "ran" may all be stemmed to "run"… #

While stemming can help simplify text data and reduce the vocabulary size, it may also lead to inaccuracies or loss of meaning in some cases. Stemming is commonly used in information retrieval systems, search engines, and text mining applications.

Lemmatization #

Lemmatization

Lemmatization is another text normalization technique that, like stemming, aims… #

However, unlike stemming, lemmatization considers the context of the word and ensures that the resulting lemma is a valid word in the language. Lemmatization often involves using dictionaries or language rules to map words to their root forms.

For example, the words "better," "best," and "good" may all be lemmatized to "go… #

Lemmatization is more accurate than stemming but can be computationally more expensive and slower. It is commonly used in applications where word sense disambiguation is crucial, such as in machine translation or sentiment analysis.

Bag of Words (BoW) #

Bag of Words (BoW)

The Bag of Words (BoW) is a simple and commonly used representation of text data… #

It involves converting a document or text into a sparse vector where each element corresponds to the frequency of a particular word in the document. The order of words is disregarded in the BoW model, and only the presence or absence of words is considered.

For example, consider two sentences #

"Machine learning is fascinating" and "Machine learning is challenging." In the BoW representation, the vocabulary may consist of ["Machine", "learning", "is", "fascinating", "challenging"], and the sentences may be represented as [1, 1, 1, 1, 0] and [1, 1, 1, 0, 1], respectively. BoW is used in text classification, sentiment analysis, and information retrieval tasks.

Term Frequency #

Inverse Document Frequency (TF-IDF)

Term Frequency #

Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents. TF-IDF combines two metrics: term frequency (TF), which measures how often a word appears in a document, and inverse document frequency (IDF), which penalizes words that are common across all documents.

TF #

IDF is calculated as the product of the term frequency and the inverse document frequency for each word in a document. Words with high TF-IDF scores are considered important for distinguishing the document from others in the collection. TF-IDF is commonly used in information retrieval, text mining, and document clustering to identify key terms and improve the representation of text data.

Word Embeddings #

Word Embeddings

Word embeddings are dense vector representations of words in a high #

dimensional space where words with similar meanings are closer to each other. Word embeddings capture semantic relationships between words and enable machine learning models to understand and process natural language more effectively. Word embeddings are typically learned from large text corpora using neural network-based models like Word2Vec, GloVe, or FastText.

For example, in a word embedding space, the vectors for "king" and "queen" may b… #

" Word embeddings can be used to initialize neural networks, improve the performance of NLP models, and support tasks like text classification, named entity recognition, and machine translation. Pre-trained word embeddings are also available for different languages and domains.

Named Entity Recognition (NER) #

Named Entity Recognition (NER)

Named Entity Recognition (NER) is a task in NLP that involves identifying and cl… #

NER systems aim to extract and label entities of interest to facilitate information retrieval, question answering, and text summarization tasks.

For example, in the sentence "Apple is headquartered in Cupertino," NER would id… #

NER models are often trained using annotated data and utilize techniques like conditional random fields (CRF) or recurrent neural networks (RNNs) to predict named entities in text. NER is essential for applications like information extraction, sentiment analysis, and entity linking.

Part #

of-Speech Tagging (POS Tagging)

Part #

of-Speech Tagging (POS Tagging) is a process in NLP that involves assigning grammatical categories (such as noun, verb, adjective) to words in a sentence. POS tagging helps analyze the syntactic structure of text data and understand the relationships between words in a sentence. POS tagging is often used in grammar checking, information retrieval, and text-to-speech systems.

For example, in the sentence "The quick brown fox jumps over the lazy dog," POS… #

" POS tagging can be performed using rule-based approaches, statistical models, or deep learning techniques like recurrent neural networks (RNNs) or transformers.

Sentiment Analysis #

Sentiment Analysis

Sentiment analysis, also known as opinion mining, is a task in NLP that involves… #

Sentiment analysis can classify text as positive, negative, or neutral and assign sentiment scores to quantify the intensity of emotions. Sentiment analysis is used in social media monitoring, customer feedback analysis, and market research to understand public opinion and sentiment trends.

For example, in the sentence "The movie was amazing and the acting was superb,"… #

Sentiment analysis models can be based on machine learning algorithms like support vector machines (SVM), recurrent neural networks (RNNs), or transformers, and often require annotated data for training. Challenges in sentiment analysis include handling sarcasm, irony, and context-dependent sentiments.

Text Classification #

Text Classification

Text classification is a task in NLP that involves categorizing text documents i… #

Text classification is used in spam detection, sentiment analysis, topic modeling, and document categorization. Machine learning algorithms like Naive Bayes, support vector machines (SVM), and deep learning models are commonly used for text classification tasks.

For example, in email spam detection, text classification can determine whether… #

Text classification models are trained on labeled data and learn to map text features to class labels. Challenges in text classification include handling imbalanced datasets, noisy text data, and multi-label classification scenarios. Hyperparameter tuning and feature selection are crucial for improving text classification performance.

Machine Translation #

Machine Translation

Machine translation is the task of automatically translating text from one langu… #

Machine translation systems aim to preserve the meaning and fluency of the source text while producing a grammatically correct translation in the target language. Machine translation is used in applications like language localization, cross-border communication, and multilingual content generation.

Chatbots #

Chatbots

Chatbots, or conversational agents, are AI #

powered applications that simulate human-like conversations with users through text or speech interfaces. Chatbots use natural language processing (NLP) and machine learning techniques to understand user queries, provide responses, and perform tasks autonomously. Chatbots are used in customer service, information retrieval, virtual assistants, and e-commerce to enhance user experience and automate interactions.

For example, chatbots like Amazon Alexa, Apple Siri, and Google Assistant can an… #

Chatbots can be rule-based or AI-driven, with AI chatbots leveraging NLP models like language models, intent classifiers, and dialogue managers to engage users in more natural conversations. Challenges in chatbot development include maintaining context, handling user ambiguity, and ensuring privacy and security.

Speech Recognition #

Speech Recognition

Speech recognition, also known as automatic speech recognition (ASR), is the tas… #

Speech recognition systems use acoustic models, language models, and speech processing algorithms to transcribe audio input into written text. Speech recognition is used in voice assistants, dictation software, voice-controlled devices, and accessibility tools to enable hands-free communication and interaction.

For example, speech recognition systems like Google Speech #

to-Text, Apple Siri, and Amazon Alexa can transcribe spoken commands, dictate text messages, and interact with users through voice input. Speech recognition models are trained on speech data and language resources to recognize phonemes, words, and sentences accurately. Challenges in speech recognition include dealing with accents, background noise, and speech variations across speakers.

Dependency Parsing #

Dependency Parsing

Dependency parsing is a syntactic analysis task in NLP that involves identifying… #

Dependency parsing represents the syntactic structure of a sentence as a directed graph, where words are nodes and dependencies are edges. Dependency parsing helps understand the dependency relationships between words and construct parse trees for linguistic analysis.

For example, in the sentence "The cat chased the mouse," dependency parsing woul… #

Dependency parsing models can be based on transition-based or graph-based algorithms and utilize dependency labels to capture different syntactic relations. Dependency parsing is used in applications like information extraction, question answering, and machine translation.

Language Modeling #

Language Modeling

Language modeling is the task of predicting the probability of a sequence of wor… #

Language models learn the statistical properties of natural language data and estimate the likelihood of word sequences occurring in a text. Language modeling is used in machine translation, speech recognition, text generation, and information retrieval to capture syntactic and semantic patterns in language.

For example, a language model trained on a large corpus of text can generate coh… #

Language models like GPT-3 (Generative Pre-trained Transformer 3) use transformer architectures to model long-range dependencies and generate human-like text. Language modeling tasks include next-word prediction, text completion, and language understanding, which are crucial for various NLP applications.

Text Generation #

Text Generation

Text generation is the task of automatically producing coherent and meaningful t… #

Text generation models use language models, recurrent neural networks (RNNs), or transformer architectures to generate sentences, paragraphs, or entire documents. Text generation is used in chatbots, content creation, storytelling, and dialogue systems to generate human-like text responses.

For example, text generation models like OpenAI's GPT #

3 can generate realistic news articles, poems, and programming code based on the input provided. Text generation models are trained on large text corpora and utilize techniques like beam search, sampling, and temperature scaling to control the diversity and fluency of generated text. Challenges in text generation include maintaining coherence, avoiding biases, and controlling the output length.

Topic Modeling #

Topic Modeling

Topic modeling is an unsupervised learning technique in NLP that involves identi… #

Topic modeling algorithms like Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) cluster words into topics based on their co-occurrence patterns. Topic modeling is used in document clustering, information retrieval, and content recommendation to discover hidden patterns in text data.

For example, given a set of news articles, topic modeling can identify topics li… #

Topic modeling can help organize and summarize large text corpora, extract key themes, and improve information retrieval and recommendation systems. Challenges in topic modeling include selecting the number of topics, interpreting results, and handling noisy or ambiguous text data.

Text Summarization #

Text Summarization

Text summarization is the task of condensing a longer text into a shorter, more… #

Text summarization can be extractive, where sentences or phrases from the original text are selected to form the summary, or abstractive, where new sentences are generated to convey the summary. Text summarization is used in news aggregation, document summarization, and content generation to provide users with concise and informative summaries.

For example, given a long article on a news website, a text summarization system… #

Text summarization models can leverage techniques like sentence scoring, attention mechanisms, and encoder-decoder architectures to produce accurate and coherent summaries. Challenges in text summarization include maintaining coherence, avoiding redundancy, and handling diverse text genres and styles.

Named Entity Disambiguation #

Named Entity Disambiguation

Named Entity Disambiguation is the task of resolving ambiguities in named entiti… #

Named entities like person names, organization names, and location names may refer to multiple entities with similar names, which can lead to ambiguity in text analysis tasks. Named Entity Disambiguation helps disambiguate named entities and improve the accuracy of information extraction and entity linking systems.

For example, in the sentence "Apple is a leading tech company," Named Entity Dis… #

Named Entity Disambiguation models use context information, entity embeddings, and knowledge graphs to resolve entity references and disambiguate named entities accurately. Challenges in Named Entity Disambiguation include handling rare entities, entity aliases, and cross-lingual disambiguation.

Text Clustering #

Text Clustering

Text clustering, also known as document clustering, is an unsupervised learning… #

Text clustering algorithms like k-means, hierarchical clustering, and DBSCAN partition text data into clusters to identify patterns, topics, or themes in the documents. Text clustering is used in information retrieval, data exploration, and recommendation systems to organize and categorize text data efficiently.

For example, given a collection of research papers, text clustering can group pa… #

Text clustering algorithms rely on text similarity metrics, clustering criteria, and evaluation metrics to determine the optimal clustering structure. Challenges in text clustering include selecting the number of clusters, handling noise, and interpreting cluster results effectively.

Text Preprocessing #

Text Preprocessing

Text preprocessing is a crucial step in NLP that involves cleaning, normalizing,… #

Text preprocessing tasks include tokenization, stemming, lemmatization, stop-word removal, and spell checking to improve the quality and consistency of text data. Text preprocessing helps reduce noise, improve model performance, and enhance the interpretability of NLP systems.

For example, text preprocessing may involve converting text to lowercase, removi… #

Text preprocessing pipelines often include data cleaning, text normalization, and feature engineering steps to prepare text data for machine learning models. Challenges in text preprocessing include handling multilingual text, domain-specific jargon, and noisy text sources.

Text Similarity #

Text Similarity

For example, text similarity measures like cosine similarity, Jaccard similarity… #

Text similarity algorithms leverage word embeddings, semantic models, and machine learning techniques to capture the semantic relationships between words and sentences accurately. Challenges in text similarity include handling synonyms, polysemy, and varying text lengths in the comparison.

Language Translation Evaluation #

Language Translation Evaluation

Language translation evaluation is the process of assessing the quality and accu… #

Language translation evaluation metrics like BLEU (Bilingual Evaluation Understudy), METEOR (Metric for Evaluation of Translation with Explicit Ordering), and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measure the adequacy and fluency of machine translations.

For example, language translation evaluation metrics compare the output of a mac… #

Language translation evaluation helps researchers and developers benchmark machine translation models, improve translation quality, and compare different systems objectively. Challenges in language translation evaluation include handling subjective judgments, domain-specific translations, and language-specific nuances.

Text Generation Evaluation #

Text Generation Evaluation

Text generation evaluation is the process of assessing the quality and coherence… #

Text generation evaluation metrics like perplexity, BLEU (Bilingual Evaluation Understudy), and human evaluation scores measure the fluency, relevance,

May 2026 cohort · 29 days left
from £99 GBP
Enrol