Go back
LLM
OOV
Embedding
Calendar08 JulyClock1 min

How do LLMs handle out-of-vocabulary words?

Large Language Models (LLMs) handle out-of-vocabulary (OOV) words, which are words not present in their training data, in several ways:

1. Subwording: Break down unknown words into subwords, or smaller units, like prefixes, suffixes, and roots, to represent them.

2. Tokenization: Represent OOV words as special tokens, like <UNK> or <OOV>, to indicate they are unknown.

3. Character-level modeling: Process text at the character level, rather than word level, to handle OOV words.

4. Vocabulary expansion: Dynamically add new words to the model's vocabulary during training or inference.

5. Pre-training on large datasets: Expose the model to a vast amount of text data, increasing the chances of encountering rare or unknown words.

6. Using word embeddings: Represent words as vectors (embeddings) that can capture semantic relationships, even for OOV words.

7. Generative models: Use generative models, like language generators, to generate new words or complete missing words.

By employing these strategies, LLMs can effectively handle OOV words, improving their robustness and ability to generalize to unseen data.

Acquiring high-quality AI datasets has never been easier!!!

Get in touch with our AI data expert now!

Prompt Contact Arrow