MongoDB Vector Search — From Keyword Matching to Semantic Search

The hidden results problem

Your search works. Or at least, it appears to. Users type a query, results come back, nobody complains too loudly. But there's a class of failure that's almost invisible: the results that should have appeared but didn't.

Consider a product catalog for an online store. A user types "laptop bag" into the search bar. The system returns two products that contain the word "laptop bag" in their title or description. Seems fine.

But there are a dozen other relevant products in the catalog. A "notebook carrying case," a "padded sleeve for 15-inch computers," a "tech commuter backpack with device compartment." None of them contain the literal string "laptop bag." So none of them appear.

The user sees two results and assumes that's all you carry. The system silently hid the most relevant products because the search mechanism is structurally incapable of understanding what the user meant.

How keyword search actually works

Most internal search implementations I've encountered use some form of regex or substring matching. The query string is scanned against a set of indexed fields — productName, description, category, specifications — and any document containing that exact character sequence is returned.

db.products.find({
  $or: [
    { productName: { $regex: "laptop bag", $options: "i" } },
    { description: { $regex: "laptop bag", $options: "i" } },
    { category: { $regex: "laptop bag", $options: "i" } },
    { specifications: { $regex: "laptop bag", $options: "i" } },
  ]
})

This works when the user's language exactly matches the document's language. Searching "laptop" finds documents that say "laptop." Searching "bag" also works because it's a substring of "bags."

But the approach is purely lexical. It has no concept of meaning.

Where it breaks

The failure modes are systematic, not edge cases:

User searches for	Expects to find	Why keyword search fails
"laptop bag"	Notebook sleeves, tech backpacks	The product says "carrying case," not "bag"
"winter jacket"	Parkas, puffer coats, insulated shells	The product says "thermal outerwear"
"kids tablet"	Educational devices, learning pads	The product says "children's interactive screen"
"gift for a runner"	Running shoes, fitness trackers, hydration gear	No field contains the concept of "gift for a runner"
"something for a road trip"	Coolers, car chargers, travel pillows	Conceptual queries have no literal match

No amount of field indexing can anticipate every way a user might express their intent. The limitation isn't in the implementation — it's in the paradigm.

The denormalization band-aid

One common reaction is to denormalize: pull related data from other collections into the searchable document. Say your catalog has a products collection with basic metadata, but the rich keyword-friendly descriptions live in a separate productDetails collection linked by SKU.

// Before: lean product document with references
{
  "_id": "prod_2241",
  "productName": "TechShield Commuter Pack",
  "brand": "TechShield",
  "skus": ["TS-441", "TS-442", "TS-443"]
}

// After: enriched with detail metadata
{
  "_id": "prod_2241",
  "productName": "TechShield Commuter Pack",
  "brand": "TechShield",
  "skus": ["TS-441", "TS-442", "TS-443"],
  "variantNames": [
    "TechShield Padded Laptop Bag 15-inch, Black",
    "TechShield Padded Laptop Bag 15-inch, Navy",
    "TechShield Padded Laptop Sleeve 13-inch, Gray"
  ]
}

Now a search for "laptop bag" will match this product because the string appears in variantNames. This works as a tactical fix. But it introduces a trade-off: every product document must be updated whenever variant data changes, the redundancy must be maintained over time, and you're still playing catch-up with user vocabulary.

A user who searches for "backpack for my MacBook" still won't match "Padded Laptop Bag" unless you keep expanding the denormalized fields. You're patching a fundamentally lexical system one synonym at a time.

Vector search: matching by meaning

Vector search takes a completely different approach. Instead of comparing character sequences, it compares meaning.

The core idea: convert text into high-dimensional numerical representations called embeddings. These are generated by machine-learning models (Voyage AI, OpenAI's text-embedding-3-small, open-source models like nomic-embed-text) trained on massive text corpora. The models learn semantic relationships between words and concepts.

In embedding space:

Words with similar meanings cluster close together (small vector distance)
Words with different meanings are far apart (large vector distance)

"laptop bag"     → [0.021, -0.187, 0.443, 0.078, ..., 0.312]  (768 dimensions)
"notebook sleeve" → [0.019, -0.174, 0.451, 0.065, ..., 0.298]  (nearby)
"refrigerator"   → [-0.342, 0.501, -0.113, 0.227, ..., -0.089] (distant)

When a user searches for "laptop bag," the query is converted into an embedding and compared against the pre-computed embeddings of all documents. Results are ranked by cosine similarity. The "notebook carrying case" appears — not because of a string match, but because the model understands that carrying cases and bags for laptops inhabit the same semantic neighborhood.

MongoDB Atlas Vector Search: implementation

MongoDB Atlas supports vector search natively. No separate search infrastructure, no Elasticsearch sidecar, no data synchronization pipeline. It runs on your existing cluster.

Step 1: Generate embeddings

For each document, concatenate the semantically meaningful fields and pass them through an embedding model:

def build_embedding_text(product):
    parts = [
        product.get("productName", ""),
        product.get("brand", ""),
        product.get("description", ""),
        product.get("category", ""),
        product.get("specifications", ""),
    ]
    return " | ".join(part for part in parts if part)

For the commuter pack, this produces:

"TechShield Commuter Pack | TechShield | Durable backpack with padded
device compartment and organizer pockets | Bags & Accessories | Water-
resistant nylon, fits up to 15-inch devices"

The resulting embedding captures the concept — "a bag for carrying tech devices, backpack form factor, protective padding." Store it as a new field on the document:

{
  "_id": "prod_2241",
  "productName": "TechShield Commuter Pack",
  "embedding": [0.019, -0.174, 0.451, 0.065, "...", 0.298]
}

Step 2: Create a vector search index

Define the index in MongoDB Atlas:

{
  "type": "vectorSearch",
  "fields": [
    {
      "path": "embedding",
      "type": "vector",
      "numDimensions": 768,
      "similarity": "cosine"
    }
  ]
}

The numDimensions must match your embedding model's output size. Cosine similarity is the standard choice for text embeddings.

Step 3: Query with $vectorSearch

At search time, embed the user's query with the same model and pass it to the $vectorSearch aggregation stage:

db.products.aggregate([
  {
    $vectorSearch: {
      index: "product_vector_index",
      path: "embedding",
      queryVector: embedQuery("laptop bag"),
      numCandidates: 100,
      limit: 20
    }
  },
  {
    $project: {
      productName: 1,
      brand: 1,
      category: 1,
      score: { $meta: "vectorSearchScore" }
    }
  }
])

A search for "laptop bag" now returns:

Rank	Product	Score
1	TechShield Commuter Pack	0.92
2	SlimGuard Notebook Sleeve 15"	0.89
3	UrbanGear Padded Carrying Case	0.86
4	ProTravel Tech Backpack	0.81

The TechShield product appears — not because of a string match, but because the model understands that a "commuter pack with padded device compartment" is semantically what someone means when they search for "laptop bag."

Why this is fundamentally better

Dimension	Keyword / Regex Search	Vector Search
Matching mechanism	Exact substring match	Semantic similarity
Handles synonyms	No ("bag" ≠ "case" ≠ "sleeve")	Yes (understands equivalence)
Handles paraphrasing	No	Yes ("something to carry my laptop in" → bags)
Requires denormalization	Yes — must copy data into searchable fields	No — meaning is captured in the embedding
Maintenance burden	High — keep redundant fields in sync	Low — re-embed only when source text changes
Typo tolerance	No ("laptpo bag" fails)	Partial (embeddings are robust to minor variations)
Conceptual queries	Impossible	Yes ("gear for tech commuters" surfaces relevant products)
Ranking quality	Binary (match or no match)	Continuous relevance score

The most significant advantage is the last one. Keyword search is binary — either a document contains the string or it doesn't. Vector search produces a relevance score, which means results can be ranked by how closely they match the user's intent.

Hybrid search: the pragmatic choice

Pure vector search has one weakness: exact matches. If a user types the precise product name — "TechShield Commuter Pack 15-inch Black" — keyword search will nail it immediately, while vector search might rank it highly but not necessarily first.

MongoDB Atlas supports hybrid search — combining full-text search scores with vector similarity scores using Reciprocal Rank Fusion (RRF):

db.products.aggregate([
  {
    $vectorSearch: {
      index: "product_vector_index",
      path: "embedding",
      queryVector: embedQuery("laptop bag"),
      numCandidates: 100,
      limit: 50
    }
  },
  {
    $unionWith: {
      coll: "products",
      pipeline: [
        {
          $search: {
            index: "product_text_index",
            text: {
              query: "laptop bag",
              path: ["productName", "brand", "description", "category"]
            }
          }
        },
        { $limit: 50 }
      ]
    }
  }
  // Reciprocal Rank Fusion to merge and re-rank results
])

This gives you the best of both worlds:

Exact product name searches are handled crisply by keyword matching
Exploratory or conceptual queries ("something waterproof for hiking with my laptop") are handled by vector similarity
Both signals are fused into a single ranked result set

Enriching embeddings for better results

While vector search solves the hidden results problem without denormalization, you can further improve quality by including related data in the embedding source text. This is a lighter-weight cousin of the denormalization approach — instead of restructuring documents for keyword scanning, you append context to the text that gets embedded:

def build_enriched_embedding_text(product, variant_names):
    base = build_embedding_text(product)
    variants = " | ".join(variant_names)
    return f"{base} | Variants: {variants}"

This gives the embedding model richer context, strengthening the semantic signal for terms that appear in variant details but not in the main product listing. The document structure remains unchanged — only the embedding benefits.

When to adopt this

If your search is backed by regex or basic $text queries in MongoDB, the path forward is clear:

Immediate: Audit your highest-traffic search queries. Identify which ones return fewer results than they should. This quantifies the hidden results problem in your system.
Short-term: If vector search adoption needs time, consider targeted denormalization for the worst-performing queries. This buys time without architectural change.
Medium-term: Implement MongoDB Atlas Vector Search. Generate embeddings from your document metadata, create the index, and validate with A/B testing against the current search.
Long-term: Adopt hybrid search combining keyword and vector signals. Extend to additional surfaces — product discovery, recommendations, conversational search.

The result is a search experience where users find what they're looking for, even when they don't use the exact words that appear in your data. That's not a nice-to-have. For any system where search drives engagement or revenue, it's the difference between a product that feels smart and one that feels broken.

This article is part of a series on databases and data infrastructure.

Why Your Search Returns Nothing — And How MongoDB Vector Search Fixes It