Article
Beyond the Keyword: How AI Taught Search to Understand You
We've all screamed at our search bars, right?
That feeling when you search for "soda" and get zero results, just because the file you know is in there says "pop." Or you search for "laptop with a good graphics card" and the search just shows you every single "laptop" and every single "graphics card" in the store, but not the one you want.
This is the search problem. For decades, we've been trying to get computers to just... get us. To understand what we mean, not just what we type.
This isn't a single "aha!" moment; it's an evolutionary journey. We're going to walk that path together, from a "Dumb Counter" to a "Smart Sorter" and finally to an "AI Mind Reader." This is the story of how search evolved from TF-IDF to BM25 to SPLADE.
📚 Chapter 1: The "Dumb Counter" Librarian (TF-IDF)
In the beginning, we had TF-IDF (Term Frequency-Inverse Document Frequency).
It's a very clever, very mathematical way of finding keywords, but let's think of it as a librarian who's super fast but... not very smart. This librarian finds books for you by following two simple rules:
-
Term Frequency (TF): How many times is my word in this book?
- The logic: "You searched for 'dragon.' This book mentions 'dragon' 50 times. This other one mentions it once. You probably want the first book." Simple enough.
-
Inverse Document Frequency (IDF): How special is this word in the whole library?
- The logic: This is the clever part. The word "the" is in every book. It's not special at all, so it gets a "specialness" score of 0. But a word like "gorgonzola" is only in a few books. It's super special! It gets a high score.
The TF-IDF score is just TF × IDF. This means a word is "important" if it's common in this one book but rare in the rest of the library.
For its time, this was genius. But it had huge, game-breaking flaws.
- The "Keyword Stuffing" Flaw: What if a spammer just wrote "CHEAP LAPTOP CHEAP LAPTOP" 1,000 times? Our "Dumb Counter" librarian would think, "WOW! This must be the best document ever on cheap laptops!" It's easily fooled by spam.
- The "Length" Flaw: A 1,000-page encyclopedia that mentions "raven" 10 times would get a higher score than a 1-page poem named "The Raven" that mentions it 5 times. The encyclopedia has more mentions, but the poem is clearly more about ravens. TF-IDF just couldn't figure that out.
We needed a librarian that wasn't just fast, but smarter.
📊 How TF-IDF Works
Here's a visual representation of TF-IDF's simple counting approach:
The problem is clear: if the exact word "soda" doesn't appear, the document gets a score of zero. No understanding, just counting.
🏆 Chapter 2: The "Smart Sorter" Librarian (BM25)
This is the "Glow-Up" of TF-IDF. BM25 (Best Matching 25) is the algorithm that powered almost every major search engine for decades. It's the default for most databases today, and for good reason: it's a phenomenally smart sorter.
BM25 is basically TF-IDF's kid who went to college and learned from its parent's mistakes. It fixes the two big flaws perfectly.
-
The Fix for "Keyword Stuffing" (Term Saturation): BM25 is like a friend who gets bored easily.
- The first time you say "laptop," it's super interested (big score boost!).
- The fifth time, it's like, "Yeah, I get it, 'laptop'" (a much smaller boost).
- The 100th time, it's just ignoring you (zero extra score). This scoring curve "saturates," which makes keyword stuffing totally pointless. Problem solved.
-
The Fix for "Length" (Length Normalization): BM25 is aware. It starts by calculating the average document length of the whole collection. It knows the "average" document is, say, 300 words.
- So, when it sees the 1-page poem (100 words) with 5 "raven" mentions, it thinks, "Whoa! 5 mentions in such a short document? This must be insanely relevant!"
- When it sees the 1,000-page encyclopedia, it thinks, "Only 10 'raven' mentions in a doc this huge? Meh." It "normalizes" the score, giving the short, concise poem the win.
BM25 was king for a very long time. It's fast, efficient, and gives great results.
...But it still had that one, fundamental, dumb problem. It's just a counter. A really, really smart counter, but still a counter. It has no idea what words mean.
If you search for "soda," it will never match "pop." If you search for "beverage for a party," it will never match "drinks for a celebration."
To solve this, we had to leave the world of "counting" and enter the world of "understanding." We had to bring in AI.
📊 How BM25 Improves Upon TF-IDF
Here's how BM25's smart weighting works:
BM25 is smarter, but it still can't connect "soda" with "pop" conceptually!
🤖 Chapter 3: The "AI Mind Reader" (SPLADE)
This is the new frontier of Learned Sparse Retrieval (LSR). Its most famous and effective model: SPLADE (Sparse Lexical and Expansion Model).
This is a total game-changer. SPLADE is built on a Transformer AI model (specifically, the BERT architecture). This AI has read basically the entire internet. It doesn't just know words; it knows concepts.
Here's the magic, step-by-step:
- Core Mechanism: SPLADE leverages the Masked Language Modeling (MLM) head of BERT. This is the part of the model that's trained to fill in the blanks, which forces it to understand the context of every single word.
- Understand & Expand: Instead of discarding the MLM head, SPLADE uses it to predict the relevance of every word in its vocabulary (around 30,000 terms) based on the input document. When you give it "A guide to French cheese," the model expands the term list to include all related concepts it learned: 'brie', 'camembert', 'gourmet', 'wine', etc.
- Prune & Sparsify: Here's the genius part (the "Sparse" in SPLADE): the model is trained with strong regularization to be a minimalist. It forces the scores of 99%+ of the vocabulary to be zero, leaving behind only the absolute most important and descriptive "tags" for this document.
The result is a "Smart Tag Cloud."
- A BM25 "tag cloud" for that doc would just be:
[guide, french, cheese] - A SPLADE "tag cloud" (its sparse vector) would be:
[guide, french, cheese, brie, camembert, gourmet, wine, dairy]
Now... when your user searches for "best brie recipe," SPLADE sees a match! Your document is found, even though it never contained the word "brie." The vocabulary mismatch problem is finally solved.
📌 Not the Only Player: Learned Sparse Retrieval (LSR)
While SPLADE is the most widely adopted, it is part of a broader, active research field called Learned Sparse Retrieval (LSR). Other notable models you might encounter include uniCOIL and DeepImpact. They all aim to achieve the same goal - neural-powered keyword search - but differ slightly in their training and approach (e.g., DeepImpact often expands documents before applying the learned scores, while SPLADE does it end-to-end).
📊 How SPLADE Expands Your Query
Let's visualize how SPLADE transforms a simple query into a rich concept map:
Notice how SPLADE not only keeps the original terms ("french", "cheese") but intelligently adds related concepts ("brie", "camembert", "gourmet", "dairy", "wine") with appropriate weights. The higher the weight, the more important the term is to understanding the query.
💡 Wait... Why Not Just Use "AI Search" (Dense Vectors)?
This is the central question. You've heard "semantic search" and "dense vectors." If SPLADE uses AI, and semantic search uses AI, aren't they the same?
Nope! They are two different AI strategies, good at opposite things.
-
Dense Vectors (Semantic Search):
- Analogy: A "GPS Coordinate" for meaning, or a "Vibe Check."
- How it works: It takes your whole document and squashes its entire meaning into a list of ~768 numbers, like
[0.1, -0.4, 0.9, ...]. - Good at: Finding holistic concepts. It knows "sad songs" is close to "lyrics about a broken heart." It's great at finding the "vibe."
- Bad at: Specificity. It "averages out" the meaning. If you search for a specific product ID like "SKU-ABC-123," the dense vector just sees "some product ID" and gets confused. It loses the specific keyword.
-
SPLADE (Learned Sparse Search):
- Analogy: A "Smart Tag Cloud" or an "AI-powered Index."
- How it works: It creates a huge list (~30,000 slots) that is mostly zeros but has high scores on very specific keywords, including its smart expansions.
- Good at: Precision. It loves specific keywords. It will see "SKU-ABC-123" and put a massive importance score on that exact term, making it impossible to miss.
- Bad at: Vague "vibes." A search for "that feeling you get on a rainy day" would be hard for SPLADE, but easy for a dense vector.
The ultimate setup isn't one or the other. It's Hybrid Search: using both at the same time. You get the "vibe" search from dense vectors and the "precision" search from SPLADE.
📦 Putting It All Together in Your Vector DB
This is where a modern vector database like Qdrant becomes so powerful. It's designed for this new, hybrid world. It doesn't force you to choose.
🔄 How Hybrid Search Works
Here's how Qdrant executes a hybrid search, combining dense and sparse vectors with RRF fusion:
The beauty of hybrid search is that it runs both searches in parallel, then intelligently combines the results. Documents that appear highly ranked in both searches get a significant boost, ensuring you get the most relevant results.
🎯 Beyond Retrieval: The Final Step is Reranking
You now have a Hybrid Search system that is both broad (high Recall from Dense) and specific (high Precision from Sparse). This is called First-Stage Retrieval - you've successfully identified a short list of, say, 50 potential documents.
But for a true production-grade system (especially for Retrieval-Augmented Generation, or RAG), we need one more step: Reranking.
Reranking is the quality control filter. It is done by a highly accurate but slow model called a Cross-Encoder.
| Concept | The Analogy | The Technical Difference |
|---|---|---|
| First-Stage Models (Dense/SPLADE) | The Matchmaker: They look at your query and your document separately, scoring them only on vector similarity. Fast, but lacks nuance. | Separate Encoding: Query and Document are encoded into vectors independently. |
| The Problem | Query: "Why is brie the best cheese for wine?" | Hybrid search might find a document that mentions brie highly and another that mentions wine highly, but misses the document that explicitly links the two. |
| Reranking (Cross-Encoder) | The Literary Critic: It reads your query and the document together, analyzing how every word in the query interacts with every word in the document. Slow, but highly nuanced. | Joint Encoding: Query and Document are fed into the Transformer network at the same time. |
| The Solution | The Cross-Encoder instantly sees that the document titled "Pairing Brie with Chardonnay Wine" is the perfect answer, even if the similarity score from the first stage wasn't the absolute highest. | It correctly promotes the most contextually relevant document to the #1 spot. |
Hybrid Search gets the candidates; Reranking picks the winner.
The Journey Continues
We've come a long, long way from just counting words. The "search problem" is finally being solved by combining these ideas. We started with TF-IDF (a dumb counter), got smarter with BM25 (a great sorter), and now, with AI models like SPLADE, we're teaching search to understand what we mean, not just what we say.
The future isn't dense vs. sparse. It's both, with a final, highly-accurate reranker to guarantee quality.