Essential LLM Terms
Essential LLM Terms Every Professional Should Understand

Model Training and Data Glossary

Training a Large Language Model (LLM), such as ChatGPT, Gemini or Claude, is a complex journey that begins long before we try to fetch our poems, emails and codes from the model. In this guide, we’ll break down some of the most important training and data terms for LLMs, walking through them in sequence to give you a simple, easy-to-follow foundation.

Understanding these terms isn’t just essential for ML and AI experts — most professionals in the field already know them well. It’s even more critical for business users who want to confidently adopt AI and integrate it effectively into their processes.

LLM 1. Corpus

Explanation

A corpus is the giant collection of text used to train the model — like books, websites, articles, code, and more. Think of it as the model’s reading material before it learned anything. By definition, the larger and more diverse the corpus a model has learned from, the more intelligent, knowledgeable, and effective it becomes at understanding and generating text. However, if a model’s training data does not include recent information, it will struggle to answer questions related to current events or recent developments, simply because it has never seen that information during training.

Example

OpenAI GPT models (like GPT 3.4 and GPT 4o) were trained on large corpus of information available on web, like Wikipedia, public websites, and books.

Tokens 2. Tokens

Explanation

These are the smallest pieces of text the model understands. During both training and text generation, the raw text is split into smaller units called tokens. These tokens can represent words, subwords, or even individual characters, depending on how the model is designed. The process of breaking down text into these smaller, manageable pieces is called tokenization. In LLMs, both learning and generation happen in terms of tokens — the model doesn’t operate on raw text directly but instead processes sequences of tokens to understand input and predict output.

Example

“Fantastic” might be broken into parts like Fan + tas + tic in a LLM model.

Prompts 3. Prompts

Explanation

A prompt is the input text or instruction you give to a trained model (like GPT) to get a response. It’s how you “talk” to the model and tell it what you want. When you type a prompt, the model reads it, interprets it based on everything it learned. The better and clearer prompt we use, the more accurate response we get from the model.

Example

  • Prompt – Write a story about a dragon who can’t breathe fire.
  • Response – The Model will write a story based on your instruction.
  • Pretraining 4. Pretraining

    Explanation

    The first phase of training, where the model learns general language patterns from large text datasets made available to it (Corpus). The model does not have any specific task in mind. This is a collective learning step where the model learns from everything given to it. At the end of pretraining, we get a large language model with generic information on everything. They are Jack of all Trades.

    Example

    The model learns things like grammar, facts, and reasoning just from reading text, e.g., “The sky is blue” is correct and “Sky blue is the” is incorrect grammar. ChatGPT 4.0 and Gemini are appropriate examples of pretrained models.

    Fine-tuning 5. Finetuning

    Explanation

    Fine-tuning is the process of taking a pretrained language model and training it further on a smaller, specific dataset with a specific task or domain in mind. Pretrained LLMs have already learned general language skills from vast amounts of text. These models may not have access to you internal company policies. Even if they do, they give equal importance to all its knowledge base. After fine-tuning, the model becomes proficient in a specific task by leveraging both its general language understanding and specialized knowledge.

    Example

    To launch an LLM-based chatbot for Zaufany, we can fine-tune an existing model (e.g., ChatGPT-4) using our company’s data including product and service details, past customer queries and responses, and internal policies and procedures. This will allow the chatbot to respond accurately and contextually to customer inquiries.

    Context Window 6. Context Window

    Explanation

    The Context Window is the amount of text (tokens) the model can “see” and “remember” at a given point in time when generating a response. It is like a short-term memory that a model can pay attention to for a prompt. Models can refer to only a limited amount of text at once. When you are having a conversation with a model, it reads the previous text and not only the last question to figure out what to say next. If the conversation or document gets too long, old parts may be “forgotten” or truncated (cut off) if they are outside the size of context window.

    Example

    To launch an LLM-based chatbot for Zaufany, we can fine-tune an existing model (e.g., ChatGPT-4) using our company’s data including product and service details, past customer queries and responses, and internal policies and procedures. This will allow the chatbot to respond accurately to customer inquiries.

    Looking for hands-on LLM training sessions tailored for students or professionals? Reach out to us to explore how you can upskill your team and unlock real value with AI.