Large Language Models are surprisingly Small

12 Jul, 2025

This post also covers some thoughts on LLM evaluations

So much intelligence in a single hard disk?

How did we compress all human knowledge into less than a terabyte?

This has been blowing my mind. How can we fit all this intelligence into the palm of a hand?

But actually, I think I was wrong to be so impressed by this. The total memory to store all the English text on Wikipedia would be about 23 GB. LLMs use far more than this. If a thumb drive had a really good search engine program and 23 GB of all the Wikipedia text, I would also have this feeling of having so much knowledge in such a small device.

Evals of parametric memory

But this made me wonder. How good are LLMs at general knowledge? How lossy has their compression of facts been? If we took an LLM and made it answer cloze style questions about Wikipedia, how much of it would they be able to generate?

These benchmarks are amongst the easiest that LLMs can be finetuned to perform well on. So one might think they are useless, as they can be passed without much effort. However they could still be helpful probes into a particular LLM, especially if there are many such benchmarks in lots of specialist domains. For example, if you are in the business of marking medical school exams, then having an eval of medical school knowledge at hand would be helpful, as you could always test any new AI system against it and choose to use any AI system that passes that benchmark for your use case.

How good are LLMs at medical knowledge?

The best fine tuned model on Hugging Face gets about 90% of medical MCQ questions correct. That may well be superhuman, but its far from what I want it to be if I’m using an LLM as a second opinion for medical advice.

Making an eval

It seems like a lot of work to do this. That's why in many cases we are piggybacking on the immense intellectual and collective effort of producing standardised tests. These tests are already out there so its easy to make them programmatic evals.

If you are dealing with some specialist domain and you want to have a test ready so you know whether a new model is good in your domain, is it worth your time to go and create an eval? In many cases you can probably rely on the closest relevant eval. If I want to know how good an LLM is at quantum computing, I might get enough of a sense from a math eval.

However, maybe in the future we want LLMs or at least AI systems which get 100% accuracy on specific domains before we feel comfortable using them. For example, if I deploy my AI to help summarise news in Singapore, maybe I want to be sure it knows every possible small detail of Singaporean law, geography, cultures, et cetera. It’s a computer system right, it might as well be made to be perfect.

How big do your benchmarks need to be to be of significance? I think if they are publicly released then they cannot be small (~100 questions) because it will be easy to fine tune to saturate (i.e. do perfectly well on) the benchmark. But if they are not going to be public, then you can have just around 100 difficult questions and use it to know if the model is good for your use case.

Will anyone buy an eval?

Maybe if you sell an eval together with the fine tuning service - but then you are not blinded against the test so it won’t be convincing.

If there are lots and lots of evals and they need to be run many times during training and fine tuning then maybe a service that implements them all needs to be paid for, or many of them exist and someone aggregates them.

Maybe if there are many models out there and someone needs to pick the right one for the job. In this case they might buy or spend money to make an eval so they can pick the right one.

Maybe governments might buy evals to keep up with the state of AI capability and alignment.

A market analysis organisation might want to set up/ buy evals to monitor things like job displacement by AI.

It seems like companies are more likely to want to directly buy a fine tuned model for their needs though.

You should make an eval and express what you want AI systems that you use to know.

#gentle-computing