Practical Tidbits: Taking a Magnifying Glass to (Text) Classifier Performance

At work I’ve become more involved in supervising and aiding with model deployment (and redeployment). This often involves domains different from their training data i.e., OOD (out of domain). In the model development process automated metrics and measures are often used to evaluate and tune models, which can be mismatched with users’ expectations (i.e., real world performance) particularly when OOD. In practice these models don’t work as well in these new domains and performance isn’t always up to snuff. This is where having a good evaluation strategy and troubleshooting techniques to address performance mismatches come into play.

This post isn’t about measuring general model capabilities, but rather performance on a specific task which primarily uses text as an input; only problems framed as classification will be discussed because it is easier to quantify and even qualify performance. While this may sound like a major limitation; evaluation metrics for open ended tasks such as summation can often be framed as classifiers e.g., checking factuality) or LLMs as a judge. I.E., this framework can help with generative workflows as well.

IMO there’s a lack of practical, actionable guidance that goes beyond the standard advice of using a train/val/test split and checking metrics. Papers often aim to present methods and even surveys, while very useful, are not aimed at practitioners but rather researchers. They can serve as a starting point, but often leave basic (i.e., unpublishable), but viable techniques and know-how unstated. Blogs do exist, but often discuss each of the topics below independently rather than a toolkit. To that end the goal of this post is to discuss how to ensure in situ (i.e., real world) performance for classification problems is a known quantity. At the end of this post you should be able to:

Identify issues that can cause a mismatch between desired, measured, and actual performance
Tools for tackling these issues
A practical approach for identifying and tackling the underlying cause

Sources of Error

Based on my experience and reading I believe there are primarily three types of errors that can cause poor in situ performance.

Evaluation errors
Model errors
Data errors

Note that these errors likely won’t exist in isolation, having one of these issues will obscure the other(s) or may even cause downstream issues. For example, poor data quality may make training a model harder. Poorly chosen metrics may make the model look like it isn’t under-fit. So while they issues are discussed separately, this may not be the case in practice. The distinctions drawn are meant to make reasoning about, discussing, and debugging causes for poorer performance easier.

Evaluation Errors

I would define evaluation errors as short comings in how model performance, in particular through automated metrics, is measured. Since models are mostly, if not always, selected based on automated metrics, any mismatches between user expectations and what measures/metrics/benchmarks capture may lead to a mismatch in performance. These issues arise primarily due to two things in my experience:

Evaluation set construction issues
1. test-train bleed; having training examples in the eval means the model has seen the “answers” and will artificially increase performance numbers
2. lack of real world data in splits; OOD performance isn’t measured in this case and models tend to have lower OOD performance
3. use of single splits; these provide no information about impact of real world trends on performance (i.e., not all OOD is the same)
Inappropriate choice of metrics
1. metrics that give artificially high values
2. metrics that are not invariant to imbalance; e.g., accuracy
3. metrics that are not robust to higher number of classes; e.g., accuracy
4. metrics don’t reflect what you are actually interested in

Data Errors

Data errors are issues due to the composition or quality of data. Poor quality data may not even reflect the underlying task well or may be hard to train on altogether.

Problem Formulation Errors:
1. Unsolvable examples with given data; e.g., critical information is missing for example no sentiment in a sentiment analysis task (part of review is missing)
2. Labeling errors; instructions are unclear or otherwise conflicting. What to label when leads to confusion and inconsistencies within the data
Lack of coverage; i.e., narrow domain/lack of diversity; note this is related, but distinct to evaluation set construction issues. Just because the eval set is well constructed, doesn’t mean the train set is as well

Model Errors

Model errors are related to training itself, rather than how models are evaluated or the data the model is trained on. Basically your model stinks and it has to do with training. This could include

poorly framing the problem; e.g., use of regression or a classification scheme that is too complicated
class imbalance; when improperly handled a mismatch between all the different classes can cause issues
poor feature selection; if features necessary for the task are present in the data, but not the model why would the model work properly?
under fitting; the model isn’t trained enough to work properly e.g., training loss is high due to poorly selected hyper-parameters

Tools to Fix and Troubleshoot Issues

Now that we’ve outlined the main causes of poor real-world performance, we’ll cover tools and strategies to identify and fix them. These are organized roughly by where they fall in the model development lifecycle. Note that they’re not locked to that order. Use what’s appropriate given your data and stage of development. When to use tools will be discussed in the next section, this section is meant to just cover what tools exist.

Some tools, like interpretability methods, are omitted because I haven’t found them broadly useful for this kind of troubleshooting and frankly lack experience using them. Interpretability makes sense to me in the context of tabular data, but given how common place embeddings are in NLP I’m unsure how to employ current techniques effectively. If that changes, I’ll update this post accordingly.

Annotation

This is the starting point of any classification problem. Poor annotation leads to data problems and IMO data problems are easier to prevent than to fix. If you’re inheriting a dataset, consider re-annotating part or all of it, especially if:

It’s a public dataset, reannotation efforts show how noisy labels are
Examples were labeled by a single annotator, this removes a fair number of checks and QA done during the process
It relies heavily on synthetic data or weak labels that were not validated by human

When creating your own dataset, IMO quality trumps size especially when it comes to evaluation/validation sets. I’ve heard people say something between nothing, while that is true to some extent, if your method of measuring performance is misleading you may have been better off just using “vibes”. When creating a dataset here’s what I aim for:

No malformed or unusable inputs (e.g., empty strings, corrupted files)
Clear task definitions and label guidelines
A feedback loop between data annotation and early model performance

To achieve this, I typically use a workflow like this

Initial gathering and cleanup of data
- High-level task design, specifically identifying required inputs
- Identifying all relevant sources of data
- Remove unusable inputs (too much noise, text too short, invalid metadata)
- Deduplicate where appropriate, repeated inputs are more likely to over represent specific examples
- Start with a small pilot and use this to flush out examples, prepare “gold datasets”, and processes for QAing
Annotation process
- Have “gold” examples you can use as a reference to detect poor annotators. You may want to use these to actively stop poor annotators to reduce the cost of the process
- Feedback previous examples to annotators to test intra-annotator consistency
- Write clear, detailed instructions that not only address the process in general, but cover edge cases adequately
- Enable flagging of unclear or ambiguous examples. In general have methods of taking and analyzing user feedback
- Use multiple annotators per item. I use 3–4, it doesn’t need to be an odd number. You aren’t using majority polling, you are looking for disagreement
- Focus first on building a high-quality evaluation set, then move on to training data
- Limiting the time annotators can spend in one go or in total. This can help avoid a long stretch of poorly annotated data
Validation
- Check inter-annotator agreement and filter hard data. IRT deals with this, no need to go overboard
- Measure intra-annotator inconsistency through repeated examples. This may be a sign of poor annotators or fatigue
- Remove annotators that performed poorly on gold references (ideally you would do this dynamically so as not to incur the cost of poor examples)
- Review data distribution (length, complexity, modality, etc.) to ensure coverage of all relevant examples
- Spot-check with manual review
- Potentially relaunch annotation with lessons learned or improved rules
- Actually review poor examples and have a panel of people to decide whether or not they should be included

I know this seems like a lot, but poor data has pernicious effects. It can run the gamut from total failures due to missing classes down to specific examples being misclassified which can be hard (read impossible) to detect. It’s better to spend the time upfront to making sure the data is good. Don’t forget annotation is not a set and forget thing. Data drift occurs and you may better define (or entirely redefine) your problem in time. You may also notices issues, mistakes, or even changes that you’d like to make with time. It is OK to start again or even just labels parts of your data.

Quality Control and Data Exploration

Once you have data quality control and understanding what is present in the data is important. That having been said I think the techniques used to explore data are often overkill for the purposes of QC. I’ve seen people spend weeks doing data analysis instead of labeling, manual review, and just training models which would have helped them better understand the data.

The following subsections will focus on how to properly sample and test data; this is a three step process

Sampling
Manual analysis
Labeling poor data

Sampling

There are different sampling methods, below is a table discussing them

Method	When to Use	Strengths	Caveats / Notes	How to Do It
Random Sampling	General-purpose use; when you want an unbiased snapshot of the dataset	Captures dominant patterns; simple to implement	Rare or subtle patterns may be missed if sample size is small	Use `random.sample()` or equivalent; ensure sample size is sufficient to detect trends
Metadata-Based Sampling	When data has meaningful human structure (e.g., document sections, categories)	Allows targeted inspection; works well with structured or parsed data	Requires meaningful metadata; may need iterative refinement (e.g., upsampling problematic areas)	Sample uniformly across metadata categories (e.g., subsections); optionally bias toward problem areas in second pass
Distance-Based Sampling	When you want to detect redundancy, edge cases, or outliers	Good for finding duplicates, out-of-distribution (OOD) samples, or structural imbalances	Needs a meaningful distance metric (e.g., embeddings); clustering may be overkill in many scenarios	Compute pairwise distances or use dimensionality reduction + clustering; sample from fringes or dense areas
Class-Based Sampling	When class imbalance exists or you want to assess intra-class consistency	Ensures minority classes are represented; helps spot class-specific issues	Assumes label availability; may not reveal input-level errors outside of specific classes	Stratify by label; sample equally or proportionally across classes, especially minority ones

You can mix and match methods, i.e., take the aggregate, of several methods. The size of the sample required can be estimated by calculating the desired power, margin of error, confidence level, etc.. Sample size determination](https://en.wikipedia.org/wiki/Sample_size_determination) can be quite complicated, but practically speaking take a size that is statistically meaningful, but can be checked by you/your team. I tend to sample 50-300 examples, but frankly that’s not to say you should do the same thing. The sample size chosen isn’t always the most principled statistically speaking.

Manual Analysis

With a sample picked out, next step is to make sure the task is feasible. IMO the best way to validate data is just to sample data and try to perform the task yourself. If you can’t do a task, your model probably can’t either. Some of you might reasonably might say your data requires expertise which you don’t have. While this is fair and may be a limiting factor in some cases, it is worth trying to learn the underlying process. I’m not a compliance expert, but through years of working on problems in the space can read regs, evaluate controls to a basic level that can help trouble shoot. I recommend you do the same, I personally feel that the idea of the machine learning expert revolutionizing a field with no understanding of it, just through their toolkit is naive. If need be, involve an expert in this step and get their feedback on the current data.

The goal here is to develop a list of labels for types of noise or labeling issues (see the next section for ideas)

Labeling Poor Data

At the risk of being a bit meta, the idea is to label the labeled data. This can help flag potential issues in the annotation instructions and prevent downstream issues caused by poor labels. Different schemes can be employed for labeling poor examples. A binary scheme could work. More specific classes can be used as well. I personally tend to use the following:

Mislabeled
Unsolvable example
Noisy input
Ambiguous example, i.e., multiple classes could apply here

Split Construction

QCing the data itself is just the first step. Another is verifying the dataset splits. You want splits that gives the change in performance expected when in a new domain. Incorrect split construction can give artificially high results. I think

proper split differentiation
the use of multiple validation splits (Fine-grained splits)

Testing for Split Overlap

Check whether your train and test sets are truly distinct. commonly called train-test bleed Overlap—especially near-duplicates—can inflate performance. Techniques like text similarity scoring or embedding-based nearest-neighbor checks can help here. You can use exact match from val set to remove overlapping examples in the train set. You could also use looser methods (e.g., n-gram overlap, embeddings, etc..). I would just check for large overlaps (99.5%+ jacard similarity).

You can also use statistical methods to check how “similar” splits are more globally. KL divergence over n-gram hashes could work here. To me it is unclear if these techniques tell you anything meaningful, so I leave it alone for now. The choice of method, underlying representation, etc.. effect the results so much. It makes it easy to torture and data and find misleading trends.

Fine-grained Splits

Beyond the standard train/val/test or k-fold setup, you can use tailored splits to diagnose specific issues. This would be akin to cohort analysis in data analytics, but done ahead of time. Example splits include, but are not limited to:

holding out a specific source of data and seeing how well the model generalizes to it
separating out recent or historic data
harder examples that are statistically unlikely, but correct such as [green bananas](https://ar5iv.labs.arxiv.org/html/1906.10169

Fine-grained splits are especially useful when you suspect OOD performance varies depending on which domain you’re shifting into. For example transferring models between clients or types of regulation. Ideally there would be a data-based method to estimate the shift, but having specific splits with held-out data works well in a pinch. In practice, I tend to use 2-3 splits for each type of shift I’m concerned with. For example, if I’m concern with a shift between types of regulations, jurisdiction, and underlying source for the documents I would have 6-9 splits, each one with held out data that has a new regulation, a new jurisdiction, or a new source. Having multiple splits for each type of shift gives an idea as to the variability for a given shift (e.g., not all new sources are equally hard).

Metrics/Measures

Books could be written on metrics and that might not cover all the intricacies. This section aims to highlight some key lessons about metric selection at the risk of not necessarily discussing them all.

Aggregate Metrics/Measures Lie (Sometimes)

Often when comparing and selecting models a single number is used. This is achieved, implicitly or explicitly, through aggregation. Examples of explicit aggregation include, but are not limited to, values like F1. F1 is the harmonic mean of precision and recall. MAP (mean average precision) is another example. Sometime people presents averages of averages when looking at performance over a benchmark.

Here’s a concrete example of when this can be an issue. We often need to “map” risks to controls, in other words make sure there are procedures in place to identify and address business risks that arise from operating a business. We often frame this as a retrieval problem. In the context of mapping there may be a high cost with missing a control mapped (e.g., not identifying properly in a risk is well addressed would take days of people work to create a new control on the issues with having duplicated controls) While having extra documents, while annoying, may just take a few minutes each to triage. For mapping high recall is more important. However if an aggregate measure like F1 is used, two models with different recalls can wind up with similar scores.

Implicit aggregation applies to most metrics (including precision and recall) simply due to the fact we are summing and in a sense averaging results. As a rule of thumb if you go from many numbers to one, you are likely losing information. Note this isn’t necessarily a bad thing as long as the aggregation doesn’t effect the ability to do the downstream task. However in a lot of commonly used metrics this isn’t the case. A good example of this is accuracy.

Overall accuracy reflects performance well under specific conditions:

Balanced classes
Few classes

If classes are not balanced, you can get a high accuracy but get poor performance compared to a random baseline (see below). This is a well known issue and often discussed in the context of rare activity detection like credit-card fraud. Fraud being a rare event, a model could just predict non-fraud and get 99.9% accuracy (essentially 100 - rate of fraud).

What is less widely discussed is that this also occurs if you use a large number of classes. Even if the classes are balanced, you can entirely be unable to quantify a specific class and have minimal impact of overall accuracy. If the data is balanced, one class being entirely misclassified effects accuracy by $$1/n$$%. As the number of classes $$n$$ grows, the effect of missing a class has a smaller impact. With 20-30 classes you could be unable to properly classify 10% of your data and still be “happy” with performance.

Beyond aggregation causing issues with performance, it also obscures the cost of bad performance. Lets take F1. It assumes that the cost of a false positive and false negative are the same. This may or may not be true. Here’s an example from my own work:

Frank Harrell discusses this is much more detail in this post. I highly recommend it. https://www.fharrell.com/post/class-damage/ He also goes beyond this and discusses the issues with threshold dependent metrics. However I haven’t really found a pratical way to use this my practice yet.

Ok, but What Does this Mean in Practice?

We just discussed limitations of metrics, especially around granularity, obscuring the cost of misclassification. Enough waxing poetics. I’ll discuss what I do. Note that this isn’t a hard and fast rule, just what seems to work best. Your mileage may vary.

Biggest win IMO is to avoid accuracy. If you insist on using accuracy at least present it per class, otherwise it is highly susceptible to class imbalances or problems with a larger number of classes. At the very least use precision and recall. These tend to work well for most tasks, especially for ranking or filtering. I mostly use classifiers in this context (or to add Metadata that will be later used for filtering), so precision and recall give me a good idea of how the model will cause downstream issues. F1 is an aggregate measure. Make sure you are good with the implications of that if you do use it. I personally don’t anymore, since it’s a bit arbitrary. I personally like SK learn. For reporting I use scikit-learn’s classification report. It works well, report per class performance, and has precision and recall.

Baselines

If you’ve ever read a machine learning paper you’ve likely seen baselines. Baselines are models besides the one being studied or proposed that serve as sanity checks and comparative tools. What counts as a baseline depends on context, but here are common types and what they’re good for:

Strong baseline models – If your model under performs a known strong system, it might be under-trained or misconfigured
Previous systems – Useful for tracking regressions and incremental improvements
Random or weak baselines – Surprisingly useful. Can expose issues like poor metric choice or eval datasets that are too easy
Partial feature models – Helps isolate feature importance and detect over-reliance on particular input types (e.g., text-only doing just as well as a multimodal model)
Partial training - If you have a more complicated procedure to train your model (e.g., use of post-training, multiple steps of finetuning) ablations that only use one or some of the steps may tell you if you’re approach is overly complicated and even under-performs simpler methods

Used properly, baselines can uncover:

Flawed metrics
Data quality issues
Under-training
Spurious correlations or overfit features
Shortcuts or potential overfitting due to complex training

When selecting baselines, identify which type are applicable and capture information important for your “use case”. The more that baselines you have, which cover different issues, the better off you are. Make sure you select strong baselines. While for a paper selecting a weak baseline might make publishing easier, but if you are going to deploy this model you are only going to cause yourself headaches.

Training

How to troubleshoot training is something complex and beyond the scope of this post. The specifics will vary based on the type of model used, which in turn changes the technique(s) used to train and hyperparameters available.

That having been said the model doesn’t exist in isolation. Models may be more or less susceptible to phenomena in the data like class imbalance. Rather than cover how to address these issues, I will present on way of keeping an eye out for issues.

Tracking Experiments

Experiment tracking tools like MLflow or Weights & Biases are par for the course when it comes to experimentation. They help keep tabs on the different experiments run and results. This can be helpful when trying to debug poor training or optimizing performance. Tracking tools also serve to keep track of what has been tried, the results, and resulting artifacts. I currently use them for tracking:

Loss curves – Especially for neural models, these can reveal training instability, underfitting, or poor hyperparameter choices
Performance - It’s easy to lose track of what models performed how. Having it logged and in a sortable format is great for addressing that
Hyperparameter trends – Logging all configurations makes it easier to identify which knobs actually matter when it comes to automated metrics

Error Analysis

Error analysis aims to understand errors made by a model and, ideally, find trends within them. You want to understand where and why your model fails. This is critical in high-stakes or compliance-heavy domains. For instance, if your regulatory parser consistently fails on one regulation type, that’s more than just a model bug, it could skew your whole dataset. Moreover it exposes you to risk and liability.

There are multiple error analysis approaches, but I think they can broadly be grouped into two categories.

Annotation Based (Manual)

Like QCing data, inspection of model errors is often where the real insights come from, but only if done thoughtfully. This has two aspects

Actually finding issues (proper sampling)
A meaningful error taxonomy

Sampling Schemes

The QCing section covered most sampling techniques, but other approaches are available after a model has been trained. Namely these include using logits as embeddings or alternatively use logits to measure the model’s “confusion”.

A model’s “confusion” can be measured using entropy. Taking the probabilities of every class can give us an idea of how certain it is.

$$ H(\mathbf{p}) = -\sum_{i=1}^{C} p_i \log p_i $$

The more evenly the probability is distributed, the higher the entropy, hence usage of “confusion”. Keep in mind, “confusion” doesn’t necessarily indicate an error per se. However high entropy examples are likely on the decision boundary and can indicate where more information is needed. You can sample evenly across bins over these values, or simply take the K most “confusing” examples. I tend to do the latter. For more detials on this approach to sampling I recommend Human-in-the-Loop Machine Learning by Robert Monarch. It is just covered in one chapter, so you may want to borrow a copy if this the only thing that interests you. That having been said it’s a great book IMO.

Labeling Schemes

One you’ve sampled data that is likely misclassified, next step is to label it. There are different avenues to take here. You can label bad data you find. You can track type of of confusion (see confusion matrix below) You can categorize mistakes. I tend to kind of wing this part based on what I see, so I don’t currently have specific guidance.

Model Based

While manual work is IMO best and in some respects unavoidable, it doesn’t scale and even with sampling techniques issues can be missed due to the finite time we have. Models can help find trends missed by manual processes due to the shear scale of data they can process. However due to the number of parameters and assumptions made models make torturing data quite a bit easier, so be careful with the “insights” gleaned during this process.

Correlation Between Metadata and Performance

Your data lives in context—social, temporal, topical. Some metadata to consider:

Time (e.g., year, season)
Source (e.g., author, publication)
Document type or structure
Demographics (e.g., race, gender, etc.—use responsibly)

Beyond measuring how well the model generalizes to held out cohorts with fine-grained evaluation sets, you might want to, or need to know, how model performance varies in these contexts for ethical or legal reasons. Model performance that varies widely across any of these is a signal worth investigating. Work with stakeholders to understand which dimensions matter and which should remain consistent. Then analyze model performance accordingly. This may be important in real world settings were discrimination or setting specific failures are a concern. This can be visualized as a matrix

Multivariate correlation might be more difficult, not necessarily in terms of implementation, but rather reliability. The number of possible correlations is high so false positives are bound to happen. That having been said This can be visualized using errors trees (in our case with metadata rather than features themselves)

Confusion Matrices

Confusion matrices are conceptually similar to correlation analysis, but focused on errors only. They track the true vs predicted class. This can help track if two classes are often getting confused for one another. Again SK learn has a tool for the job.

Clustering

Correlations are based on metadata, but trends might be only specific to the data. Perhaps certain areas in your text embedding are doing worse. One solution is to cluster based on input embeddings and/or output embeddings and see if there are clusters that under perform. Keep in mind this method is hit or miss IMO. It may or may not find trends, the failures might not fall neatly into a space or the way it is divided might obscure the issue. I wouldn’t spend too much time barking up this tree, UNI clustering is akin to divination.

Pilot Studies

Nothing substitutes for real users in real contexts. Pilots let you observe performance, kind of, in the wild. Feedback is often qualitative—but valuable.

Some tools to capture this:

Logging + lightweight feedback (thumbs up/down, comment boxes, etc.)
Structured user studies with measurable tasks
Surveys to gauge satisfaction or frustration

These surface not only model errors but UI/UX issues (which, while outside scope here, can tank adoption just as easily).

Scenarios

These scenarios describe real world issues and how to diagnose and “fix” them. This guide assumes you have a dataset, metrics, and a model. Each scenario describes the state of your system and some steps to diagnose the cause. The suggestions are meant to be followed in order, but feel free to read or try things in whatever order you’d like. In general, I suggest: data → evaluation method → model. This order is in difficulty of fixing the issue.

Metrics are bad, subjective performance is bad

Garbage Data

Check this first. If your data is bad, any downstream fixes will likely be a waste of time. Possible data issues include:

Uncleaned or noisy data; noise adds randomness and can lead to spurious correlations
Inconsistent labeling; different annotators, unclear guidelines, etc…
Too little data; especially problematic with many labels—e.g., 5 examples for 5 labels won’t yield anything meaningful
Train-test mismatch; if your evaluation data is substantially different from your training data, your model will underperform. This could be a signal that key subsets are missing from your training set.
Missing features to complete the task

Keep in mind noise can be both in the training or eval set. As a first step inspect both, starting with the eval set. Poor eval sets will through everything off, no matter how good the rest of your setup is. If you can’t complete (or even understand) most examples yourself, that may be a sign the problem is poorly defined or unrealistic.

Under-trained or Inappropriate Model

Simple, but worth checking. For example if your loss is high and you’re under performing compared to baselines, something’s likely wrong with training. Make sure to rule out that the task is even possible first—otherwise, your loss doesn’t mean much. Baselines will also highlight this issue. If your model is under performing strong baselines or existing systems, that may be a sign of this. For prompt-based or zero-/few-shot models, “training” can be as simple as tweaking instructions or providing good examples. Techniques like TextGrad exist, but I’ve had limited success. Honestly, hand-tuning against a validation set usually works fine.

If performance is similar to baselines, but loss looks good you may be barking up the wrong tree. I would revisit your working assumptions about the problem.

Problem Framing

The same task can be solved multiple ways. Some are more natural for the model.

One common challenge is parsing bullet points or structured lists from free text, like regulations. For instance:

“Applicants must provide: (a) proof of residency, (b) proof of income, and (c) identification documents.” → Should become a list of items.

We tried IOB tagging, which works—but nesting breaks things down. Precision/recall per tag looked fine, but subjectively, output was messy. We’re experimenting with structured parsers (think ARC parsers), which reduces the number of output classes and captures hierarchical structure better. Time will tell, but changing the annotation scheme reduces the number of classes and more importantly the number of minority classes.

Task too difficult

Some tasks might be too difficult for models. If you’ve done the previous steps and are still having issues it may be worth considering you are at this state. There are a couple of steps to do here

Missing Information
Change framing of the problem
Annotation efforts
Further breaking down the problem
Integrating people into the loop
Give up (i.e., focus on other problems you can solve)

Metrics are good, but subjective performance isn’t

This mostly occurs when the evaluation approach is inadequate or otherwise mismatched to the downstream task. There are many causes of this, which we’ll discuss below.

Poor Eval Strategy

Evaluation has two components:

choice of metric
construction of your eval set

Poor Choice of Metric

Some metrics are gameable or don’t capture what matters. For example, we once used top-3 accuracy. The problem? Three dominant classes made it easy to score well by just guessing them. I proposed using a random baseline based on class frequency to show that while the model beat random, it wasn’t doing anything impressive.

Compare to random baselines. If they do well, your metric or data split might be flawed.

Test-Train Bleed

If there’s leakage, your evaluation metrics will be inflated. Techniques to find leakage are discussed above in the split section.

Easy Evaluation Set

If your evaluation dataset is composed of mostly trivial cases, models will score well without being able to solve other problems.

Simple way to check: have annotators rate example difficulty or build a difficulty classifier. Another way of finding difficult examples is to use inter-annotator disagreement. However be cautious of using inter-annotator disagreement, as difficulty isn’t the only factor that can contribute that. More commonly ambiguity can as well and including this data can cause issues.

Once you have selected a method of identifying difficult examples you can stratify performance by difficulty bucket. With those buckets you can ensure there is enough hard data and see how well it performs. I would hold off on this test due to the amount of effort and come back to it towards the end.

Domain Shift between Eval Set and the “Real World”

If the dataset isn’t leaking or otherwise too easy, domain Shift can also be a culprit. Finegrained evaluation can help here by helping estimate the expect drop. If it’s harger than desired, adding data to improve the drop our outright similar data to the domain in question is important. Textual similarity and annotation can be used to this end to expand the training dataset.

Class-Specific Failure

If you’re not using class-aware metrics, you’ll miss if the model is struggling with a specific class. Try:

Per-class precision and recall
Confusion matrices
Correlation between data distribution and performance

For example, if minority classes have lower recall, that’s normal but not acceptable. Ways to mitigate:

Annotate more data: ideally for underperforming classes
Resample: up/down sampling
Reweight updates: only possible for some model types

Problem Framing

Poor framing may make a problem intractable, but it may also have more subtle effects such as poor class performance. This tends to happen when a framing creates either

a large number of classes
minority classes

If there are alternative framings that address these issues consider that. There may be ways to convert or otherwise simplify the problem

Construct changes (fundamentally different approach to the same problem, e.g., arc parser vs iob tag)
Decomposition (e.g., use a taxonomy to make a single step process into a single step one.

Metadata Correlation to Failure

Sometimes performance depends on metadata even when it shouldn’t.

Maybe your data comes from multiple sources, and one source has poor labeling. Or a language variant performs worse. Or users from one group see degraded results.

To diagnose:

Group by metadata.
Measure outlier performance.
Run significance tests (even basic ones like Welch’s t-test or bootstrapping).

Just be cautious about overfitting to combinations of attributes. It’s easy to start solving the eval instead of the task.

Domain-Specific Failures

Another type of metadata that can be added is “domain”. I’m not aware of a rigorous definition of domain in NLP. A lot of “OOD” papers will often use data from another source or about another subject and call that a new domain. I would break down the approaches to define domain into:

Topic-based: define domains based (e.g., legal, medical, social media), then use the same stratification techniques as metadata
Source-based: where text comes from (web text, wiki text, book text)
Clustering-based: Cluster examples using embeddings or bag-of-words. This is a more quantifiable definition, but is sample dependent

Domains can be tagged, either through rules or models. In turn the correlations can be used.

Training Data-Specific Causes

Poor training examples can cause down stream issues, especially if they are prevalent in your data. If something fails, check:

Are there similar examples that also fail?
Are they rare?
Are they mislabeled?

I would follow the “data analysis” section. Using similarity search (e.g., cosine over embeddings) can help identify patterns. But be careful: changing too many things based on isolated examples is a great way to overfit to your evaluation set.

Metrics are good, performance is good

Sit and wait. Lol well not quite, but this is a good place to be. Keep monitoring performance and collect user feedback. Focus shifts from “how do we fix this?” to “how do we maintain and update this?”

I don’t have much more advice here—this is an area I’m still learning about. I’ll either update this post or write a follow-up once I have more concrete practices to recommend.

Closing Remarks

Model deployment is a key part of the lifecycle. Despite recent progress, models remain imperfect. What hasn’t changed—arguably the only constant—is the need for thoughtful, careful evaluation. To that end this blog post

provided a taxonomy of common issues associated with training models
identified common remedies
detailed a step by step approach to identify and resolve common issues

Ultimately the advice in this point is a starting point, not an end all be all. As techniques change, problems, inputs, some of this advice may no longer be applicable or there may be alternatives. Don’t be afraid to add new techniques to your practice, but it is important to start somewhere when understanding model performance.

Work In Progress: LLMs for ETL

Blog Archive

Archive of all previous blog posts