WIP: LLMs for Text Normalization (i.e., Domain Adaptation)

A perennial task at work is mapping, specifically risk to control mapping. We’ve used mapping models across multiple areas of regulation, clients and tasks. Even when models are deployed over the same task, new combinations of regulation, client or both there is a drop in performance. There are ways of over coming this like additional training, but ideally models would work as well regardless of the source of the text. While there may be multiple causes for the drop, like slight differences in definitions of what is considered a good match is across clients (i.e., p(y|x)), IMO this likely isn’t the primary culprit. Transfers between clients or areas of regulation often result in new language that the underlying model has limited or no exposure to. Examples of this include word choice changes or acronyms that only make sense given external context. These issues are common in regulatory text and client-specific documents. Even for mappings with the same annotators, there can be drops for new “areas” (i.e., its a safe assumption p(y|x) is the same, p(x) is the only change.

One approach to solve this would be model focused (i.e., training to make the model more robust to domain). However this blog post will be focused on a data-centric approach, specifically normalization.

Description of approach

While doing error analysis on mapping tasks, the number of acronyms, spelling errors, and offhanded words in different languages present in examples that performed poorly stood out. Success in this context is measured by the ability to improve generalization to new “domains” (as defined by area of regulation and client). A while ago I attended a colloquium where one of the authors of DictBERT presented their work. Assuming that rare or otherwise unknown words are a cause of performance drops, a word replacement approach like DictBERT seemed like a viable approach especially on the smaller encoder models we typically use for mapping.

A replacement based normalization approach provides a handful of advantages:

it can be modified for new clients
runs quickly
“interpretable” (to some degree, we monitor the replacements and measure their effect on certain key metrics)

There are some downsides, primarily

cost of annotation. Unlike the original work we can’t necessarily use a dictionary. Often acronyms are domain specific and require annotation
replacements are not dynamic (the best replacement might change contextually)
it isn’t entirely clear what words should be replaced (DictBERT uses a heuristic to select words to be replaced)

Issue 2 is a fundamental limitation of a replacement based approach, but there are ways to mitigate this risk. Issue 1 can be addressed using models, prompting specifically and issue 3 can be figured out through trial and error.

Architecture

A pipeline is used to normalize text. There are three primary components to the pipeline:

Tokenizer. It produces word-like tokens and basic offsets used later for insertion. It’s a simple regex that returns the word and word position in indices
Replacement dictionary. It contains replacements and additional metadata about them (e.g., ambiguity)
Downstream model which uses the replaced text (typically an encoder like BERT)

Training

The pipeline has a “training phase” which is required to select the words that need to be replaced and the proper replacements. The current flow of the training is as follows:

flowchart TD
    %% Training pipeline
    T1[Raw Text] -->|raw string| Tok[Tokenizer]
    Tok -->|tokens + char positions| Meta[Create Metadata]
    Meta -->|token list, surrounding W-word windows| Sel["Word Selection (tf/df filter)"]
    Sel -->|candidate tokens + their windows| Gen[Replacement Generation]
    Gen -->|up to 20 W-word windows → model predicts: type, ambiguity, replacement, justification| Post[Post‑processing]
    Post -->|filtered replacements| Ins[Insertion]
    Ins -->|original text with replacements applied| Out[Output Text]

    classDef stage fill:#f9f9f9,stroke:#333,stroke-width:1.5px;
    class Tok,Meta,Sel,Gen,Post,Ins,Out stage;

There is still room for improvement within this flow. The next steps are integrating annotation to verify the model outputs (e.g., replacement, ambiguity). More over the post-processing step does not currently handle recursion (rare words within the replacement of a rare word).

Inference

After the replacements have been learned, the next step is to apply that dictionary on the downstream task. The flow looks as follows:

flowchart TD
    %% Inference pipeline
    I1[Raw Text] -->|raw string| ITok[Tokenizer]
    ITok -->|tokens + char positions| IRep[Replacement Engine]
    IRep -->|original text with replacements applied| IOut[Text for Downstream Task]

    classDef stage fill:#f9f9f9,stroke:#333,stroke-width:1.5px;
    class ITok,IRep,IOut stage;

Preliminary Results

The results are over proprietary data, so the underlying data can’t be shared. At a high-level it appears to work, but the effect is small. Multiple splits are used to better understand the effects of the data. The downstream mapping model is a BERT-based architecture, with a fair number of variants.

Word replacement was used in two settings

held out splits (4) with different sets with models trained on unexpanded data
held out splits (4) with different sets with models trained on expanded data

For the word replacement only words that occur in more than 2 documents and more than 5 times are considered. About 22k rare words were found. GPT-4o was used to generate the word replacements. About ~1/3 didn’t have any replacements so approximately 3.5% of words were replaced on the held out splits. Based on a subjective analysis this was due to ambiguous examples or examples without enough examples to determine their meaning.

In setting 1 the replacement leads to 1–2% improvement of recall@k for k=[1,2,5,10,20] on domain-specific models. Interestingly commercial models don’t improve. Need to do more fine-grained analysis to understand why. In setting 2 further improvements of 2–4% are had. This is likely due to the fact that the post-training similarity has increased.

Caveats

Bluntly the data I work with is unique and I highly doubt that these gains are universal. The task is rather different than most IR or even similarity tasks. Moreover the data often has a fair number of rare acronyms. Your mileage will vary and currently I have no method of knowing when this approach is worthwhile for your datasets.

Suspected Mechanisms (Hypotheses as to Why this Works)

I think there are multiple reasons replacement should/could work. I’m proposing 3 mechanisms of action here so future experiments can be focused on determining why and when normalization through word replacement might work.

Distributional shift

Word insertion/replacement might reshape the text to have a distribution more similar to the original pre-training or post-training distribution(s) for the downstream model. An argument could be made that implicitly this happens since the model is likely going to favor web like words and phases in the replacement models pre-training corpus (being a language model), but I have no proof for this. The only counter argument I see is that the model being used to replace words may not be the same one doing the underlying task. GPT-4o and BERT have overlap in terms of training data, but GPT-4o has a much larger corpus and likely a different distribution.

Measuring the shift created by word replacement requires a definition of domain. For our purposes we’ll use token distribution, since this makes it easier to shape the output. For open-source models with training data available we can measure token distrbution and use logit biases to make replacements more similar to the pre-training corpus. Similarity to pre-train data has been shown to improve performance across a number of tasks. This previous work is on decoders and not encoders like were used for mapping. The corpus could also be made more similar to fine-tuning data. In general shaping the replacements to be similar can be achieved through logit biasing.

To measure the effect that the change in distribution has I would propose measuring the change in similarity to both the pre-training and post training distribution and how that correlates to performance. In other words, do replacements which make a given item more or less similar to the pre-training distribution improve performance? Both these actions can be framed as binary;

whether or not a text is closer to a given distribution after replacement
whether or not the mapping metric (lets say NDCG) is doing better

Framing the results as binary is naive, but would allow for use of a simple statistical test. The phi coefficient would work well for this. This assumes that the representation results in some number of more distant. In all likelihood it might make more sense to use some statistical measure of correlation between the document and the distributions and performance.

Injecting Knowledge

Beyond the distribution of words, sometimes knowledge will be missing. Acronyms or other words rarely or not seen during either pre-training or post-training can be expanded into more common forms. This effectively injects external knowledge that the base model may not reliably “infer” on its own. The reason this would be important is that rare or domain specific words will likely have the most “information”. This is the principle on which tf-idf works (favoring rare words). If we are missing critical/rare words we can’t effectively do the task as critical words are missed. For example, if a document is referring to a region, but an acronym is used the mapping will be missing (e.g, APAC vs Asia Pacific). While surveying the data I found a couple sources of rare words. These categories are not exhaustive, but they cover most observed cases.

Errors
1. Spelling mistakes
2. Combined words
Acronyms
Rare words
Other
1. Translations

The first questions would be the impact of each and relative improvement from each category. Potentially some types of information are more critical/useful? A simple analysis looking at # of changes in a given type of change and performance could make sense in this context.

More Efficient Representations of Input text

Some words are not well tokenized. Replacing improperly tokenized with more common equivalents may reduce representational inefficiency while giving more meaningful tokens. I do think, subjectively, that the replacement tokens are more meaningful.

APAC
[AP][AC]

Asia Pacific
[Asia][ Pacific]

There are counter examples where less tokens are created originally, see the examples below:

AML
[AML]

Anti Money Laundering
[Anti][ Money][ Laundering]

FATF
[F][AT][F]

Financial Action Task Force
[Financial][ Action][ Task][ Force]

Regardless the tokens do appear meaningful. I think there are two separate claims here that need to be measured. The first is the correlation between changes in size and performance. Given that documents may have multiple changes there are confounding variables, so it may be worth just looking at global changes. To measure correlation we’ll again frame this problem as binary. Texts that are the same or smaller vs longer and see if there is a correlation.

For the meaningfulness of the tokens, this is a bit trickier to measure and define. My only thought here is to measure how many sub-words are used (i.e., the fertility of the tokenizer). For now, I’d focus on the first measure.

Next Steps

This post presented a word replacement based pipeline to normalize text and improve generalization for mapping texts. Word replacement is simple, fast, and partially interpretable, and in preliminary experiments it led to small but consistent recall improvements on held-out domains, especially when combined with data expansion. The idea is that some of the performance drop in new settings for mapping comes from acronyms, rare terms, spelling errors, and client-specific phrasing shift the model hasn’t seen. These gains are small and likely task specific, but they suggest normalization can be a useful tool in some settings.

To better understand when and why this approach works, the next step is to measure the underlying mechanisms of action. Specifically I suspect they are: changes in token distribution relative to pre-training or fine-tuning data, explicit injection of missing knowledge through acronym or rare word expansion, and changes in tokenization “efficiency”. Each of these mechanisms can be measured, for example by correlating distributional similarity or global changes in text length with downstream performance, or by breaking improvements down by replacement type. Note that the data used here has a high density of rare acronyms and domain-specific language, so it is important to replicate this work on non-proprietary datasets and more standard benchmarks to see how general these results are. The broader goal is not to claim that word replacement is a good general technique, but to better characterize the cases where normalization improves generalization and to make that decision more principled.

Deep dive: On the Theoretical Limitations of Embedding-Based Retrieval

Bye 2025, Hi 2026