A language model is not a search engine, and it is not thinking. It is a statistical machine that learns patterns in text and generates the most statistically probable next word based on everything that came before. A 70-billion-parameter transformer is just a very large pattern-matching engine, trained on billions of examples of human text, tuned to predict the next token (roughly: next few characters) with accuracy. When you give it a prompt, it doesn't "look up" the answer. It generates the answer one token at a time, each token selected from a probability distribution over all possible next tokens.
This is profound and limiting at once. It means the model can only do what the training data taught it to do—generate text that "looks like" its training data. It means hallucinations are inevitable (the model will confidently generate false information that is statistically plausible). It means the model has no access to real-time information, no memory of previous conversations beyond the current context window, and no ability to "think" in the philosophical sense. And yet it means the model can synthesize patterns across domains, make analogies, draft arguments, and produce writing that is often indistinguishable from human-generated text.
The transformer architecture, introduced in Vaswani et al.'s 2017 paper "Attention Is All You Need," solves a critical problem: how to process and relate every word in a sequence to every other word, so that the model can understand that "she" in "The queen lifted her crown" refers to the queen, not someone else.
Self-Attention: The Core Mechanism
Self-attention works like this: for every word in the input, the model creates three vectors—a query (what am I looking for?), a key (what am I?), and a value (what information do I carry?). It then computes how much each word should "attend to" each other word by comparing queries to keys. This attention score (a number between 0 and 1) determines how much of each word's value gets mixed into the final representation.
Imagine you're reading a sentence: "The bank approved the loan." The word "bank" needs to figure out: am I referring to a financial institution or a riverbank? It attends to nearby words like "approved" and "loan"—these words have high relevance (high attention score), so their patterns flow into bank's final representation, biasing it toward "financial institution." The word "the" has low relevance, so it contributes less.
This happens in parallel across many "heads" of attention—different pattern-detection heads that each look for different types of relationships (syntactic, semantic, positional, etc.). The transformer then stacks many layers of this attention mechanism, each refining the model's understanding of the input.
Why This Matters for Output Quality
Modern language models like GPT-3 or Claude use billions of these attention heads, stacked across dozens of layers, processing through up to 200k tokens of context. The more parameters (roughly: more attention heads and layers), the more nuanced patterns the model can detect. But more parameters also means:
Language models are trained on a simple objective: given text up to position N, predict the word at position N+1. This is called next-token prediction. The model is shown billions of examples: (prompt → next word), and it adjusts its parameters to minimize prediction error.
Tokens, Not Words
A token is not always a word. It's usually a subword unit—roughly 4 characters on average. So "unhappiness" might be split into tokens like ["un", "happy", "ness"]. This matters because:
Temperature and Sampling
After training, the model has learned a probability distribution over next tokens. When you use the model, you can set a "temperature" parameter:
This is why the same prompt can produce different outputs even with the same model—you're sampling from a probability distribution, not retrieving a stored answer.
Synthesis and Analogy
If the training data contained examples of poetry, essays, code, scientific papers, and news articles, the model learned patterns across all of them. It can therefore:
This is often framed as "creativity," but it's more accurate to call it interpolation in pattern space—the model is finding a position in the learned pattern landscape that fits your constraints.
Reasoning and Multi-Step Problem Solving
Transformers can do reasoning tasks: solving math problems, debugging code, analyzing arguments. But how?
Research suggests the model learns to simulate reasoning steps. It doesn't "understand" math the way a calculator does. Instead, it learned patterns like:
And it reproduces this pattern structure. This is why:
What It Cannot Do
A hallucination occurs when the model generates false information with high confidence. Why does this happen?
The model's training objective is to predict the next token. It was never trained to say "I don't know." It was trained on text where humans always provide an answer—sometimes correct, sometimes incorrect, but always formatted as an answer. So the model learned to generate answer-shaped text, even when the correct answer is unknown.
Additionally, if multiple plausible continuations fit the statistical pattern, the model will pick one—and that picked continuation might be factually false. For example:
The model generated a statistically plausible answer that is factually outdated. It has no access to "truth"—only to patterns in text.
A surprising empirical finding: larger models are better at almost everything. The relationship between model size and task performance follows a power law—doubling the number of parameters gives a consistent improvement.
This matters because:
Early language models (GPT-3) had a context window of 4,096 tokens (roughly 3,000 words). Modern models have much larger windows:
A larger context window allows the model to:
But a larger context window also means:
The model is best used when:
The model is poorly suited when:
Psychology: Language models as extensions of cognition. The transformer's attention mechanism parallels human selective attention—focusing on relevant information while filtering noise. But where human attention is flexible and context-dependent, transformer attention is fixed once trained. This creates a hybrid cognitive tool: the model provides breadth (fast access to patterns across domains) that the human mind must filter for depth.
Creative Practice: Transformers as collaborative thought partners. The model's capacity for synthesis and analogy makes it useful for generating unexpected combinations or reframing problems. But the model cannot evaluate its own output, cannot know when it's hallucinating, and cannot develop personal voice (it averages across training data). Useful for ideation; dangerous for final output without human judgment.
History: Language models as compression of historical pattern. Transformers are trained on historical documents, so they've learned how humans have solved similar problems before. This makes them useful for research and analogy, but also reproduces historical biases and assumptions without flagging them.
The Sharpest Implication
Language models will not get smarter in the ways you might expect. More parameters don't teach the model to reason differently—they give it access to more patterns to memorize and combine. This means:
Generative Questions