● Note Note №01 10 min read

Bigger Isn't Always Better: Modern LLMs vs. BERT for Clinical Text Classification

We benchmarked six modern architectures against our published Bio+ClinicalBERT baseline for classifying antibiotic indications from EHR free-text. On this task, a 110M-parameter BERT matched or came within one F1 point of models 245x its size. The pattern is consistent with domain pretraining mattering more than scale, but the comparison comes with caveats.

NLPclinical-AItransformersLLMsEHR

In our paper last year (Communications Medicine, 2025), we reported that a fine-tuned Bio+ClinicalBERT achieved F1 scores of 0.97 internally and 0.96 to 0.98 externally for sorting free-text antibiotic indications into 11 infection source categories. On that task it outperformed regex, n-grams with XGBoost, and zero-shot GPT-4.

Since publication a number of new model families have appeared with claims of strong biomedical or clinical performance. We wanted to know how they would compare to our 110M-parameter BERT baseline given the same training data, the same task, and the same evaluation protocol.

The short version: on this task we did not see a clear scaling advantage. The four best models in our comparison all happened to be pretrained on biomedical or clinical text, regardless of parameter count. This is one task on one dataset, so we’d be cautious about generalising too far. The detailed numbers and confounds are below.

Data Flow

Benchmark pipeline: from EHR prescriptions through data splits, model fine-tuning, to evaluation

Setup

We re-used the task, data and evaluation from the original paper. Multi-label binary classification of approximately 4,000 labelled antibiotic indications from Oxford University Hospitals into 12 categories. The internal test set was held-out Oxford data (n=2,000) and the external test set came from Banbury (n=2,000). We used weighted F1 as the primary metric and a fixed prediction threshold of 0.5.

All models were trained on the same training split and evaluated identically. The smaller models (≤396M parameters) were fully fine-tuned. The larger ones used LoRA or QLoRA so they would fit on our single Quadro GV100 (32GB VRAM). We ran fp16 mixed precision throughout. We used a single random seed per model and did not run a full hyperparameter sweep, so the absolute numbers below should be read with that in mind.

ModelParamsPretraining DomainTokenizationFine-tuningHardware
Bio+ClinicalBERT110MPubMed + MIMIC-IIIWordPieceFullSingle GPU
BioClinical-ModernBERT396MPubMed + Clinical (50.7B tokens)WordPieceFullSingle GPU
MedGemma 4B4BMedical multimodalSentencePieceLoRA (r=16)Single GPU
MedGemma 27B27BMedical multimodalSentencePieceQLoRA 4-bit NF4Single GPU
BioMistral 7B7BPubMed (Mistral base)BPELoRA (r=16)Single GPU
Qwen2.5-7B7BGeneral web + codeBPELoRA (r=16)Single GPU
CANINE-S121MWikipedia (104 languages)CharacterFullSingle GPU
ByT5-small300MmC4 multilingualByteFullSingle GPU
!

Overall Results

ModelParamsTraining TimeEval F1Oxford F1Banbury F1
Bio+ClinicalBERT110M3.4 min0.920.970.96
MedGemma 4B4B10 min0.930.970.98
BioClinical-ModernBERT396M13 min0.930.970.98
MedGemma 27B (QLoRA)27B8.8 hr0.940.980.99
BioMistral 7B7B58 min0.870.910.89
Qwen2.5-7B7B55 min0.580.670.64
CANINE-S121M9 min0.650.790.79
ByT5-small300M18 min0.180.150.20

Four models reached or exceeded 0.97 F1 on both test sets: Bio+ClinicalBERT, BioClinical-ModernBERT, MedGemma 4B, and MedGemma 27B. The remaining four, ranging from 121M to 7B parameters, scored noticeably lower (0.15 to 0.91 across the two test sets).

The four top models share a property: each was pretrained on biomedical or clinical text. They differ widely in size (110M to 27B), tokenizer (WordPiece vs SentencePiece), and training regime (full fine-tuning vs LoRA/QLoRA). Within this group, the gap between the smallest and largest model is one F1 point or less. We can’t rule out that with a wider sweep this gap would widen, but at our budget it stayed small.

The four lower-performing models are more heterogeneous in size, but none of them was pretrained on biomedical text. We can’t fully separate “pretraining domain” from “tokenization”, because the BPE-tokenized models (BioMistral, Qwen) and the character/byte models (CANINE-S, ByT5) also tend to fragment clinical abbreviations differently from WordPiece or SentencePiece tokenizers. So this comparison is suggestive, not clean.

Per-Class Breakdown (Oxford Test Set)

The aggregate scores hide a lot, so we looked at per-class F1.

CategoryModernBERTMedGemma 27BBioMistralQwen2.5CANINE-S
Urinary0.970.980.940.180.00
Respiratory0.990.990.860.640.90
Abdominal0.960.970.920.770.85
Neurological0.910.930.840.000.00
Skin/Soft Tissue0.950.950.840.730.73
ENT0.920.950.840.070.00
Orthopaedic0.910.950.770.210.00
Other Specific0.890.930.670.040.00
No Specific Source0.980.980.930.760.94
Prophylaxis0.990.990.960.860.96
Uncertainty0.970.960.910.670.96
Not Informative0.991.000.850.000.00
!

Two patterns are worth flagging.

First, Qwen2.5-7B scores 0.00 on neurological, and below 0.10 on ENT, orthopaedic, “other specific”, and “not informative”. CANINE-S shows the same pattern more severely: respiratory and prophylaxis are reasonable, but most other categories sit at 0.00. After fine-tuning these models appear to predict the high-frequency labels and rarely (or never) trigger on the rare ones at threshold 0.5. This is a known failure mode in imbalanced multi-label classification, and we’d expect it to be at least partly mitigated by class-weighting or threshold tuning. We did not apply those techniques here, so the table reflects out-of-the-box behaviour with the same training recipe.

Second, BioMistral falls in between. It produces predictions on every category but tends to score 5 to 15 points below the domain-pretrained encoders on most classes. This is consistent with PubMed pretraining helping somewhat but not closing the gap to models that also saw clinical text.

Per-Class Breakdown (Banbury External Test Set)

The Banbury data comes from a different hospital site, which provides a (limited) test of generalisation. The strong models stay strong; the weaker ones shift around but do not catch up.

CategoryModernBERTMedGemma 27BBioMistralQwen2.5CANINE-S
Urinary0.991.000.970.210.00
Respiratory0.991.000.810.720.96
Abdominal0.950.980.900.640.76
Neurological1.001.001.000.260.00
Skin/Soft Tissue0.970.970.940.840.78
ENT0.920.960.870.000.00
Orthopaedic0.960.980.850.220.00
Other Specific0.860.910.560.120.00
No Specific Source0.990.990.840.620.94
Prophylaxis0.990.990.980.860.96
Uncertainty0.990.980.950.820.97
Not Informative0.991.000.930.000.00

MedGemma-27B reached F1 = 1.00 on four categories (urinary, respiratory, neurological, not informative). ModernBERT did the same on neurological. With a test set of n=2,000 and class imbalance, perfect F1 should be read as “no errors observed in this sample” rather than as a guarantee. Both top models stayed above 0.86 across all categories on Banbury, which is the kind of consistency one would want before considering deployment, though deployment would also require prospective validation we haven’t done here.

Spot-Check on Variant Inputs

We constructed 19 hand-written test inputs to probe model behaviour on the kinds of variation we expect in real clinical free-text: correct spellings, common typos, hyphenation variants, and common abbreviations. This is not a formal robustness benchmark and 19 inputs is far too small to draw quantitative conclusions, but the qualitative differences are still informative.

InputExpectedModernBERTMedGemma 27BBioMistralQwen2.5CANINE-S
sepsisno_specific_source
sespis (typo)no_specific_source
osteomyelitisorthopaedic
ostoemyelitis (typo)orthopaedic
pneumoniarespiratory
pnuemonia (typo)respiratory
utiurinary
cellulitisskin_soft_tissue
celulitis (typo)skin_soft_tissue
haprespiratory
caprespiratory
lrtirespiratory
nec fascskin_soft_tissue
peri-prosthetic infectionorthopaedic
periprosthetic infectionorthopaedic

A few things we’d flag, with the caveat that this is a probe rather than a benchmark.

MedGemma-27B was the only model in our comparison that handled every typo and hyphenation variant in this small set. It is also the largest and most expensive model, so this might reflect scale, training regime, or both. We can’t separate the two from this data.

ModernBERT failed on three of the nineteen inputs, including two typos and one hyphenation variant (“peri-prosthetic” vs “periprosthetic”). This kind of failure is recoverable with input normalisation in a deployed system.

BioMistral did not classify “hap” or “cap” correctly (these are common shorthand for hospital-acquired and community-acquired pneumonia). For a model pretrained on PubMed this was unexpected to us. One possible explanation is that BPE tokenisation splits these short strings into subwords that don’t align with the abbreviations as used in clinical practice, but we didn’t verify this with a tokenisation analysis.

CANINE-S handled typos relatively well, which is consistent with the hypothesis that character-level processing helps with surface-level noise. It still failed on the underlying classification task on most categories, so this robustness, on its own, is not enough to make it a useful model for the task.

What This Suggests, and What It Doesn’t

We’d state the takeaways as follows.

On this task, with this dataset, and at our compute budget, we did not see scale alone produce a meaningful gain over a domain-pretrained 110M-parameter encoder. Bio+ClinicalBERT sat within one F1 point of MedGemma-27B on both test sets. The 27B model is roughly 245x larger, required 4-bit quantisation to fit on the GPU, and took roughly 155x longer to train. The marginal gain on the headline F1 was small. The 27B model did appear more robust on our small typo probe, but we’d want a larger and more systematic robustness evaluation before treating that as an established advantage.

General-purpose LLMs underperformed in our setup even after fine-tuning. Qwen2.5-7B scored 0.67 F1 on Oxford and produced no positive predictions on five of twelve categories at threshold 0.5. BioMistral, which is a domain-adapted Mistral, came in around 6 F1 points below the smaller BERT. We’d hesitate to claim general LLMs “can’t” do this task. With class balancing, threshold tuning, more careful prompting at fine-tuning time, and longer training they may close some of the gap. We did not run those experiments.

Character and byte-level models did poorly on this task in our setup. CANINE-S produced predictions only for the most frequent categories, and ByT5-small scored 0.15 F1, close to chance. This is consistent with the view that, without medical pretraining, character-level signal alone is not sufficient to learn the mapping from clinical surface forms to category labels with our training set size (~4,000 examples). It is not evidence that character-level architectures are a dead end for clinical NLP in general.

The cost-performance picture in our benchmark is summarised below.

F1 scores and cost-performance scatter plot across all benchmarked models

The four domain-pretrained models cluster at high F1 with a range of training times. At our compute budget, training time scaled with model size much faster than F1 did.

Limitations

We want to be explicit about what this comparison is and isn’t.

It is a single task. Multi-label classification of antibiotic indications into 12 source categories is one slice of clinical NLP. Tasks involving longer documents, reasoning over multiple notes, or generation are not represented here, and our results probably say little about how these models compare on those.

It is two sites in one country. Both Oxford and Banbury are UK NHS hospitals. Behaviour on US, EU, or other clinical text styles may differ.

The training regime is not held constant. Smaller models were fully fine-tuned; larger models used LoRA or QLoRA. With more GPU memory and a fuller hyperparameter sweep, the larger models might gain ground. The single-GPU constraint is a real-world budget choice, but it is a constraint nonetheless.

We used one random seed per model and did not estimate variance. The differences within the top group (≤1 F1 point) are within the range we’d expect to see from seed variation alone, so the ordering inside that group should not be over-interpreted.

We used a fixed prediction threshold of 0.5 and did not apply class weighting. Per-class threshold tuning or balanced sampling would likely improve scores for the weaker models, particularly on rare classes. We chose to keep the recipe simple and identical across models, which favours models that can reach reasonable per-class F1 without those interventions.

The robustness probe is 19 inputs. It surfaces interesting differences but should not be treated as a quantitative robustness measure.

Practical Suggestions

With those caveats in mind, if we were starting a similar clinical text classification task tomorrow, here is what we would actually do.

We would start with Bio+ClinicalBERT. On this task it gave us 0.97 F1 in 3.4 minutes of training on a single GPU, and on a different but related task it would be the cheapest baseline to run. If it worked, we would stop there.

If we wanted a more modern encoder for longer-context inputs, BioClinical-ModernBERT performed at the same level on this benchmark and brings 8192-token context. We did not exercise the long-context capability here.

If robustness to noisy free-text input was a priority, MedGemma-4B was the best balance in our comparison: similar headline F1 to BERT, better behaviour on our (small) typo probe, around 10 minutes to train with LoRA. We would want to validate the robustness more rigorously before relying on it.

We would not, on the strength of this benchmark, choose a 7B+ general-purpose LLM for a comparable supervised classification task. The training is slower, the deployment is heavier, and on our task the F1 was lower. We would still consider large LLMs for zero-shot use cases, settings without labelled data, or generation tasks that don’t fit the supervised classification frame at all.

Domain pretraining looked, on this task, more important than scale. We aren’t claiming this generalises. We’re claiming the data we collected here is consistent with that hypothesis and with prior work pointing in the same direction, and that we would want to see scaling claims for clinical NLP demonstrated on labelled supervised tasks before treating them as established.


This work extends Yuan et al. (2025), “Transformers and large language models are efficient feature extractors for electronic health record studies”, Communications Medicine.