Deep dive: On the Theoretical Limitations of Embedding-Based Retrieval

I recently came across On the Theoretical Limitations of Embedding-Based Retrieval, a paper which overlaps with my interests on better understanding design decisions when designing, training, or selecting models. I enjoy the type of work in this paper and wish to see more of it, but this blog post is going to take a more critical look at the results. Papers in this vein need to be rigorous or at the very least well designed to be of use to practitioners.

Concurrently this video critique of the same paper by Yannic Kilcher came out. While there is overlap, the video has significantly more background on IR (which the reader is assumed to have for this post) and doesn’t dive as deeply into the methodological issues of the paper IMO. Moreover, this blog will detail additional experiments and suggestions for how this work could be used in the future. I recommend watching the video first, especially for people unfamiliar with IR or who are newer to field.

Now on to the show. The blog will have the following structure. First discuss the paper’s 3 contributions, followed by experiments I would have liked to see and finishing up with an analysis on future directions based on this paper.

proposing theoretical bounds on the minimal number of dimensions needed
measuring empirical bound on dimensions needed for specific properties
practical experiments
experiments I’d like to see
implications and future directions

Theoretical Bounds

Rather than go over the proof itself, the results will be presented and an interpretation of their implications provided. A different notation from the paper will be used for simplicity and the terminology used, mathematical or otherwise will be defined below. Note these definitions that will be used through out the post:

$C$ is the collection of documents $c$. $c$ is used to avoid confusion with dimension $d$

$Q$ is the queries $q$

$|X|$ refers to size of an array $X$, concretely $|C|$ is the number of documents, $|Q|$ number of queries

$rel$ is a $|C|x|Q|$ matrix representing the relationships st $GT_(ij) = 1$ iff a query and document are related otherwise $GT_ij = 0$

$e_q$ and $e_c$ are respectively vectors that capture the queries and documents

$\hat{rel}$ is the approximation of $rel$ given by $e_q$ and $e_c$

$rank$ refers to the rank of a matrix, ranking refers to act of ordering items, order will be used to refer to the order in which a models lists items

$signrank$ is the rank of a sign matrix $M$ where $M$ is a binary matrix in the range ${-1,1}$

$k$ refers to k elements used. It will be used in the the context of top-k

With these definitions in hand, we move on to the proof itself. The paper lays out three properties of $rel$ that can be preserved and their relationship to $\hat{rel}$. These are:

row wise order given by $rankrop A$ ; this represents the smallest number of dimensions that preserves the ground truth order between embeddings
local threshold rank given by $rankrt A$; this is the smallest number of dimensions that preserves a certain distance $\tau_i$ for a given row $q$
global threshold rank given by $rankgt A$; this is the smallest number of dimensions that preserves a certain distance $\tau$ for all rows

Property 1’s utility is intuitive; what is the point of a ranking system that doesn’t preserve order? As far as I can tell properties 2 and 3 don’t have a purpose besides being used establish a bound around the $d$ dimensions required. The proof is an inequality which ties all three of these properties together and finds a bounds defined using $signrank$ over $rel$, i.e., $signrank(rel) = rank(2 rel - 1)$.

$$signrank(rel)-1 <= rankrop(rel) = rankrt(rel) <= rankgt rel < signrank(rel)$$

The long and short of it is that the sign rank matrix is the minimal size required (look at both ends of the inequality) to preserve the 3 properties listed above, namely order preservation. I think there are two primary limitations for practical use that need to be further explored

Sign rank is difficult to compute. The authors briefly touch on this in the related works section, but essentially using Sign Rank (as of date of publishing) is a non-starter
Effect of changes to $rel$ w.r.t. to $d$ required, ideally in practice. I.E., does the minimum number of dimensions required change dramatically between p(y|x) shifts (e.g., new data, new split, new domain)

The effect of the changes to $rel$ be discussed later. For now we will look at the method the authors propose to use to estimate a bound instead of sign rank.

Empirical Bounds on Performance

Since sign rank is difficult to compute, the authors propose using “free embedding optimization”. The setup is that for an embedding of size $d$ they try to find the maximum size of $|C|$ they can represent regardless of the underlying task.

The method avoids using text representations altogether. Rather the goal is to directly embeddings $e_c$ and $e_q$ that can approximate $rel$. Direct optimization removes a confounding variable and establishes an upper bound on performance. $|Q| + |C|$ embeddings are initialized randomly. Then the embeddings are optimized to fit $rel$. Performance is evaluated with top-k accuracy, i.e., what percentage of the top-k items are correct. The idea is that for a dimension $d$, we can only represent $|C|$ documents so they need a method for finding how many documents can be represented. For a given $d$, the authors finding the largest $|C|$ that can be represented with it. They increase $|C|$ until the top-k accuracy decreases and this serves to indicate the upper limit a given size of embedding can represent.

One question that naturally arises is how many queries should the dataset have and what should $rel$ look like? In other words what dataset should we use to establish these bounds? The solution the authors arrive at is that every single top-k pair of documents needs to be represented. This leads to an interesting issue, which is that there are $\binom{|C|}{k}$ queries. This leads to very large values of $|Q|$, as the authors point out.

For this reason the authors only evaluate small values of $d$ ([4,45]) and $k$ (2). Using the 41 values calculated, a 3rd degree polynomial is fit.

$$f(d) = −10.5322 + 4.0309 d + 0.0520 d^2 + 0.0037 d^3$$

Using this polynomial, the max top-2 an embedding of a given $d$ can represent is extrapolated. Values found are:

num dimensions	num items
512	500k
768	1.7M
1,024	4M
3,072	107M
4,096	250M

These results establish the crux of the author’s argument that embeddings are unable to represent enough data for “web-scale data”. Looking a bit deeper, it’s a bit unclear if this is the case. The following subsections will discuss methodological and experimental design issues with this approach.

Methodological Issues

Curve Fitting and Extrapolation

The use of extrapolation can be fraught with issues. In this case the authors extrapolate 1-2 orders of magnitude using limited data and, as far as I can tell, don’t really justify their use of a 3rd order polynomial. That being said there is only one real zero and it does increase monotonically, so at least in the range over which it is defined there won’t be drops at higher values.

More notably the authors don’t provide any error bars for their data. Below is an example of what their estimate would look like with error bars and the 95% CI using both standard deviation and bootstrapping.

Results

The bounds around performance using 95% CI are rather large, nearly 50%. Below is table with the ranges

dimensions	Mean	Analytic CI Lower	Analytic CI,Upper	Bootstrap CI Lower	Bootstrap CI Upper
512	509,025.71	332,616.85	685,434.56	90,597.41	720,449.71
768	1,698,767.93	1,072,650.10	2,324,885.76	225,187.76	2,447,487.23
1024	4,005,321.60	2,483,952.83	5,526,690.37	438,732.75	5,827,806.20
3072	107,062,547.63	63,930,717.43	150,194,377.84	6,684,349.74	158,563,955.12
4096	253,473,998.20	150,616,883.99	356,331,112.41	14,319,291.91	376,219,318.51

In short, the authors’ estimates could be seriously off, even assuming there are no other issues. Interestingly the bootstrap estimate seems to have a preference for more conservative number of items. I suspect that is due to the fact that the larger # of dims are underestimated i.e., we can see that at the tail end of the original data, the points are above the curve plotted by the polynomial. So when bootstrapping, samples that are primarily composed of smaller values tend to estimate a very small number of items. I don’t if this trend would hold for higher dimensions though, since the experimental results seem to hover around the plot line.

Results Grouping taxonomy proposed by Samadarshi et al.

Results

Moving Goal Posts

Even if we take the extrapolated estimates at face value, the paper focuses on “web-scale” data and this term is not well-defined in the paper or in general. Moreover, the scales tackled according to their estimates are rather large (250M documents with the correct top 2 with 4096 embeddings). Even “just” 250M documents will cover most use cases assuming the value is correct. The statement that single vector embeddings are limited, based on these estimates, feels a bit contrived. In my personal experience, it’s rare to have datasets with more than a few million items that need to be mapped or retrieved.

Experimental Design Issues

The methodological issues are just the cherry on top. There are more fundamental issues that make even the adjusted estimates unlikely to be representative of real world performance.

Top-k Hyperparameter

The top-k value is critical in establishing bounds on $d$. A $k$ of 2 is used, but it’s unclear how other values would have effected these estimates. What number of documents can be represented for $k$ of 5,10,20 which are common values for which to report performance metrics in information retrieval.

The bigger issue is that it don’t always have a clear analog in practice. How do we even choose $k$ for a given problem? For user facing applications this might be easier, since we can suppose people won’t look at more than 2,5,10 results. What about pipelines with multiple steps? What about RAG, which top-k is suitable then? This might seem like a nit pick, but when your entire method for estimating the minimum dimension required for a given corpus is reliant on a hyper-parameter it is important to have an answer to that question of what $k$ value to pick.

Query/Corpus Relationship

This design choice to have as many queries may seem principled, but in practice is a poor one for two reasons:

It makes finding these bounds difficult computationally to estimate bounds, because $\binom{|C|}{k}$ grows very quickly for large values of $|C|$ and $k$
It’s unclear how close this bound is to typical IR query document relationships, at least as captured in datasets. Is this bound close regardless of the task used, or much higher?

The number of dimensions found by this method should be higher given its likely harder to fit than other combinations of queries. Section 5.4 and Figure 6 seem to indicate that this would be the case, although those experiments are used in a different context. I personally believe this casts doubt on the results as being valid. Likely these bounds are rather conservative assuming no other issues.

Variability Based on Initial Conditions

Optimization-based methods, as far as I’m aware, are not guaranteed to convergeto a global optimum or consistent solutions based on the initialization The paper appears to run each free-embedding experiment only once per configuration of $d$. Without repeated runs over different initializations, how can we be sure the number of dimensions found is accurate even if we are overfitting? A straightforward solution would be to repeat each optimization several times with different seeds and report the variance (or a confidence interval) on the maximum $|C|$ obtained. This would validate the robustness of the empirical bound and account for any issues arising from optimization settings.

Practical Performance

While the empirical bound is interesting, as discussed it doesn’t touch on what we can expect in practice. The practical section of the paper primarily serves to introduce a new dataset, LIMIT. The rationale is that current datasets don’t have a high enough ratio of queries to documents and therefore can’t completely test the combinations possible. Sections 5.4 and figure 6 demonstrate that having a denser $rel$ matrix makes the task more difficult for recall @ 100.

LIMIT simulates the query document relationship by creating user profiles consisting of “attributes” and simple few word queries. Attributes are statements such as “likes peperonni pizza”. Users are limited to 50 attributes per profile and queries ask for one attribute at a time so it is a relatively straightforward task. The attributes and subsequent profiles are generated by Gemini 2.5, checked for dupes,and hypernyms (we’ll circle back to this).

My critiques of this section are 2 fold

LIMIT is artificially, and unnecessarily, difficult
No practical implications of the theorem are demonstrated

Artificial Difficulty

So the authors pitch the LIMIT dataset as an easy dataset BM-25 can do, but neural models can’t. There are three design choices that make the value of this claim questionable IMO:

BM-25 is used to filter for profiles that are likely to be false positives
Removal of hypernyms in the attributes benefits models that don’t capture the semantics of words like the baseline (e.g., BM-25)
The domain used is unlike any the models. 5.3 aims to show this isn’t the case, but the experiment uses a split with new attributes. So it’s fundamentally a different distribution of the dataset. It would not be shocking if the model couldn’t learn $p(y|x)$ given how different $p(x)$ is. This isn’t accounted for in their explanation.

For me the best example of this is that the ColBERT results are quite a bit better than single embeddings, but still lower than BM-25. Given how Colbert is said to transfer domains better (I’ve heard this on twitter from the late interactions fans, but don’t have a citation atm) I think difference can be accounted for by the three reasons listed above.

Another oddity worth mentioning is that about half the embedding models used aren’t trained with MRL, which for a paper trying to show how limited dimensionality can cause issues seems odd considering this could be a confounding variable. The authors acknowledge this, but don’t really explain why this is a valid design choice IMO.

Missed Opportunities

Overall the practical section of the paper felt like a missed opportunity to actually to show the theorem does (or doesn’t) show up in practice. The authors even have code that could be used to test the $rel$ of real datasets using the free embedding method, more on that later. IMO this would provide a more realistic representation of the bounds. You could even use this method on your own dataset to get an idea of the scale. Instead the section’s focus is on showing how difficult LIMIT is for neural models and justifying the design choices used to make it that way.

Experiments I’d like to have seen

The theorem is a really good starting point for further work and I think there are a couple directions in which to take it. This section is primarily focused on applying theory to real world and understanding the practical implications/ utility of such a theory.

I’ve attempted the following experiments, but haven’t gotten results that make sense. The top-k accuracy appears fixed for specific datasets. For example nfcorpus always gets 5% top-2 accuracy, scifact 90% top-2 accuracy regardless of the number of dimensions or other factors.

I used WSL with an rtx 4090 and WSL with rtx 3090 (WSL is supported experimentally by jax), and got similar issues. I can’t tell if the lack of correct results is due to

Hardware/environment configuration
Misuse of the original code (the majority of the code used is copied from the paper)
Issues in the underlying code

I’ve spent a fair amount of time debugging the code (2) and suspect that issue is likely 1 or 3. I have limited access to supported hardware and configurations, so its hard to debug 1 atm. For now I’ve given up, but will try again later.

Estimates of minimum dimensions over real datasets

I’d like to see, for a given top-k accuracy, an estimate the minimum dimensions required over a real dataset. For simplicity smaller datasets could be used to start. I propose a grid search across six embedding sizes (64, 128, 256, 512, 1024, 2048) for two BEIR datasets:

nfcorpus
scifact

For each dataset, the code would go something like this:

Download and unpack the benchmark data.
Load documents, queries and relevance judgments (qrels).
Convert the sparse relevance information into a binary ground truth map required by the optimizer.
Call optimize_embeddings (the authors original code) with the current dimensionality together with a set of hyperparameters (learning rate, temperature, early‑stopping settings, etc.) that are pulled from DEFAULT_EXPERIMENT_PARAMS.
Calculate top-k accuracy
Find the elbow where performance stops increasing for a given number of dimensions

This method is a bit rudimentary, but would give an idea of where higher dimensions stops improving performance.

Variability of the method over multiple runs

In parallel to estimating minima, it would make sense to check how much initial settings effect results.

We would add another variable. Each combination of (dataset, dim) would be rerun three times using distinct seeds (I had picked 0, 42, 13044).
Each seed is passed directly into optimize_embeddings, which (presumably) seeds JAX’s PRNG and the Adam optimizer state.

If the results for top-k accuracy vary across seeds, then it would be clear an aggregation method would be necessary to account for this discrepancy.

Effects of Changing $rel$

The relationship between the bound on $d$ and $rels$ begs some questions about how changes in $rel$ effect the dimensions required. How does increasing the number of documents or queries drastically change the dimensions required to capture a given problem? Moreover, how do we guarantee that the relationships in the training data and the minimum dimensionality we determined with them are adequately high to do the task on held out data? If we need a dataset that is representative to accurately compute minimum dimensions that might be harder to use. Especially given that new, unseen queries are likely a relatively frequent occurrence given the long tailed nature of queries.

These shifts naturally occur in 3 ways IMO:

the task itself, i.e., shifts of $p(y|c,q)$ which is the functioning capturing the ground truth relationship.
new relationships, e.g., when new queries or documents are added
transfer between splits or domains

Practically each of these can be tested in the following ways:

use k-fold and see how variable $d$ is across splits
see how $d$ changes when splits (e.g., train, val, dev) are combined (which should be more representative of the overall task)
There are two sub-experiments
1. using the existing splits and measuring the difference between $d$ across splits separately
2. the same task across different domains

What’s next?

Observing Patterns in the Real World

This has been alluded in the previous sections and experimented with to a small extent. Beyond just estimating the dimensions required, it would be nice to see if the pattern were reflected. Specifically, we would expect to see a diminution in performance around theoretical limit. This would like present as a knee around the number of dimensions, in other words performance should stop increasing around that value found.

In practice this would look like performing free embedding experiments over BEIR (and/or other IR datasets) followed by performing the tasks with different dimensions. We would expect knees to occur near the theoretical limit in the performance chart. Maybe there is no knee or it is shifted. but that in of itself would be an interesting finding. Even a negative result like this could be used to understand how optimal learned representations are in terms of efficiency. If we can compute close bounds on dimensions, we can get insights on how efficient representations are? Are we far off the bounds?

Computable Sign Rank or Better Estimates on it’s Bounds

I’m unaware of the other uses of sign rank, but if this theory is correct a generalized and efficient version of sign rank that works over large sparse matrices would be very important to IR practitioners. It would allow us to quickly estimate rank. This falls outside of my area of expertise, but would be of immense value to practitioners given the reliability of the free optimization method is unclear at the moment.

Synthetic Estimates of Dimensionality

Assuming that we could reliably compute bounds and that they are accurate, could we use synthetic approaches to estimate the true dimension needed for a problem given a small sample? Lets say I only have 100 examples, would it be possible to synthetically grow $rel$ and get a good estimate of the true dimensionality required for a given task? Essentially figure out if can we approximate p(y|x) and if that gives us reliable estimates on $d$.

Alternative Uses of Single Embeddings

The authors discuss alternatives having found “major limitations” of embeddings. These include sparse representation and multi-vector representations. However given the weaknesses outlined above, it’s unclear to me that this is even true.

Assuming a fundamental limit does exist, even using the estimates presented by the paper, a large number of documents could still be represented by a single vector. It could make sense to subdivide a corpus into different areas or use multiple representations for a single document that are used under different circumstances. It is worth noting at that point, other representations might make more sense and it would heavily depend on the specific problem.Assuming a fundamental limit does exist, even using the estimates presented by the paper, a large number of documents could still be represented by a single vector. It could make sense to subdivide a corpus into different areas or have multiple representations for a single document that are used under different circumstances. It is worth noting at that point, other representations might make more sense and could be used depending on the type of similarity required.

Generalizing the Theorem to Reranking

On a tangent,lets accept the premise that single embeddings have a limit on the number of documents and that it is too small for most use cases. Lets say we decide to rely primarily on rerankers, there are still practice reasons to use single embedding models (speed, software support). What if instead of using single vectors to get the top-k directly, I just used them as rerankers? Instead of getting the top-k correct, all that matters in that case top-k in some larger k’. This would open the door to use single vectors as efficient representations instead and just size them so we have guarantees on k’. For example getting the top-k in k’ could be highly valuable and let us estimate worst case costs for rerankers.

Closing Remarks

This paper is a step in the right direction, providing theory to guide design choices when it comes to text representation. However it fails to provide practical insight. The proof itself seems good, but bluntly the practical findings, practical contributions, and conclusions of the paper are in my opinion, at best, deeply flawed.

I think this paper should be followed upon with much more thorough experiments over existing data to see if traces of this theorem can be detected in practice. Even if we can’t see a correlation between estimated bounds on dimensions and performance, that if of itself is interesting. It points to other issues we may need to fix. Until that’s done this proof is neat, but may not have any uses for practitioners. In short, based on this paper alone there is no clear reason to abandon single embedding vectors or increase their size endlessly. For now keep using empirical measurements of performance over your data and evaluate the design tradeoffs between any potential alternatives like multi-vector or sparse representations.

Uncited References

https://news.ycombinator.com/item?id=45068986 https://arxiv.org/abs/2407.15462 https://arxiv.org/abs/2407.13218 https://arxiv.org/pdf/2205.13147

Temperature, Tokens, and Long Tales/Tails

Blog Archive

Archive of all previous blog posts