Apr 20, 2025 · 1 minute
read
plenty of room for more mini eggs
They say you can’t make it this far without some regrets. Well, I’m here to tell you that this Easter, I have a big one. I really should have picked up that 1kg bag of mini eggs in Waitrose back in February. I had plenty of space in the suitcases for another bag!! A nice round 5kg of egg goodness would have been just the ticket!
Anyway, a quiet but nice birthday. I finally visited The Toy Department out in Fairfield, confirming that people will indeed spend far too much money on old Transformers, and Maeryn even bought me some Lego! Plus, I now have the complete DC One Million Omnibus, which I think pretty much rounds out my Morrison DC collection.
In parenting news, I have finally caved and bought a Fire tablet on sale to be a wifi-less device to help with potty times. Loaded with all the Hey Duggee and Bluey that you can fit on there, I expect…
Mar 23, 2025 · 2 minute
read
two two two
full spectrum toddler powers!
Obviously, the big news of the week is that our tiny little dictator, I mean, lovely little Maeryn, is now 2 years old! We celebrated with cake, balloons, presents, and the all-important Sunday Roast. Well, she’s older now, so it’s time for her to really understand her heritage. Next week: Dennis Potter. Or maybe she just plays with her new Little People Barbie Dream House for the moment…
The Atlantic had an exposé this week about how Meta used LibGen to train Llama models, along with a little search bar to see if a book or author is present in LibGen (and thus likely the text(s) was used to train a bunch of LLMs). I realized late in the day that I would likely be in the database, and lo…I have 3 entries, including the German version of my PyTorch book. I am mostly fine with this, and I’m more amused that some of my writing has gone into training Meta’s LLM how to write PyTorch code, PyTorch being a Meta-owned project. Of course, it’s easy for me to say that, being both a wizened anti-copyright person that came of age during the Copyleft Wars of the 90s, and somebody that doesn’t make their main income by writing books. I can see exactly where others are coming from, but I also don’t want to restrict us to a world where only OpenAI and Anthropic has the money to build and research models because nobody else can afford the usage fees on Common Crawl.
(also, I note that the AI narration that The Atlantic sticks on the article was almost certainly powered by using copyrighted content too…so…you’know)
Mar 15, 2025 · 10 minute
read
embeddings
search
chunking
litany of failures
Okay, so this month we’re going to talk about something that doesn’t really work. You’ve seen all those amazing arXiv papers with their fancy new discoveries? Bah, that’s easy. The true apex of research is talking about the things that failed. So welcome to a litany of failure!
Anyway, this is an idea I’ve been wanting to try out for at least a year. I have entries in my notes file that go back to 2023, and really I just wanted to get it out of my head and using my co-worker, Claude Sonnet, to determine if there was any promise in the technique.
Welcome, then, to ‘Gist’, an attempt to improve the search relevance of chunking in vector search!
(if you’d like a précis of searching with vectors, head over here and then come back. The tl;dr: use a model to create vectors of your documents, and then make vectors out of a query and select the ‘closest’ document to the query vector)
Problems With Embeddings & Chunking
When you’re making an embedding of a document using a ‘standard’ embedding model, one of the issues you’ll run into is that most of these models have a context length of 512 tokens or about 400 words. This has given rise to the cottage industry of ‘chunking’, where a document is split somehow into small chunks and each of them is vectorized. This way, your searches can dig deep into the documents and hopefully get really good results as opposed to just searching across the first 400 words.
However…
Let’s consider a dumb-but-common chunking method - splitting on sentences. Here’s a simple example:
The BT Tower is a grade II listed communications tower in Fitzrovia, London, England, owned by BT Group. It has also been known as the GPO Tower, the Post Office Tower, and the Telecom Tower. The main structure is 581 feet (177 m) high, with aerial rigging bringing the total height to 620 feet (189 m).
This gives us three chunks:
The BT Tower is a grade II listed communications tower in Fitzrovia, London, England, owned by BT Group.
It has also been known as the GPO Tower, the Post Office Tower, and the Telecom Tower.
The main structure is 581 feet (177 m) high, with aerial rigging bringing the total height to 620 feet (189 m).
If you were embedding these three chunks, and you were also embedding lots of other documents in the same way, it’s possible that your search is going to run into problems. If you have a query like post office tower height
, you’re going to want that last chunk to score very highly. But that sentence, stripped from the rest of the paragraph, has no link to the concept of the tower whatsoever, and so neither is the embedding. Instead, what you’re likely to get is a response of all the chunks across your search index that mention height. Terrible!
The easiest fix to this, and one that would likely work well in this particular case is to split on paragraphs instead of sentences, so the embedding would have the context of the tower and the height in the same vector. But imagine a longer document, and you can see that you are likely to start missing context clues in your chunks, which will have a big impact on your search results.
But what to do…what to do?
The Idea
Admittedly, this is a pretty dumb idea, and it’s likely somebody has already done it before, but hear me out: what if the system could carry a CliffNotes version of the document with it for every chunk? That way, the search engine can be in page 23 of 113, but still have a general idea of what the chunk is talking about by relating it back to the notes. That should help boost the appropriate relevance when searching.
Turning that into an actual plan is even dumber: we get a LLM like Llama, Gemini, etc. to generate a 400 word summary of the document, taking advantage of their wide context windows to summarize the entire document. We then embed that, and here comes the magic!, we average this vector with every single chunk in the document.
It’s so stupid. And yet…compels me though…
A Brief PoC
So yes, I’ve had the idea rattling around in my head for a while, but never really enough time to sit down and do an evaluation. But then Claude 3.7 came out and I thought I’d use this as a chance to test that out and get this idea out of my brain.
Firstly, an evaluation dataset. I could have used MSMARCO, but I distrust it, knowing that a lot of the quality judgements in it are just plain wrong. Plus, every embedding model is trained on the MSMARCO data, so it’s not a fair test anymore (in my, and lots of others’ opinion). Instead, I downloaded 1000 random pages from Wikipedia and got Llama-3.3-70bn to generate possible search query terms for each page.
Now at this point, various IR people are yelling at me saying that that’s not fair either, as you can’t guarantee that the search terms I generate for one document are completely separate from any other document in the set…and that’s a good point, but this is not supposed to be a rigorous examination. It’s “does this even make sense to pursue?”
Anyhow, I now have documents, queries, and a mapping of what query goes with what document. Next up, we create a script that creates a summary of the documents, again using Llama-3.3-70bn, breaks the documents up into paragraph-level chunks, and then produces two embeddings per chunk, one just being the chunk itself, and the other being the chunk embedding averaged with the summary embedding.
Having got all that sorted, finally we embed the queries, identify the top-scoring chunks (and importantly, their document id), and write that out as a series of qrels run files, which we then use ranx
to rank with the answer set and give us a nice set of summary tables and plots. (This is the “draw the rest of the owl” part)
Results
In my best Peter Snow voice, this is just a bit of fun. I used Snowflake’s arctic-embed-xs model for embedding, as it’s very small, quick, and capable, and tested Gist in a paragraph splitting scenario. Have some tables and graphs. What we’re tracking here is NDCG, which scores between 0 and 1, putting higher weight to the correct documents appearing at their correct rank towards the start of the ranking list.
Paragraph Chunking & Averaged Summary Chunks
Here’s the data formatted as a markdown table:
Embedding Type |
ndcg@1 |
ndcg@5 |
ndcg@10 |
mrr |
map |
recall@100 |
embedding_base |
0.298597 |
0.456463 |
0.496966 |
0.424646 |
0.424646 |
0.727856 |
embedding_average |
0.536072 |
0.681063 |
0.698795 |
0.645529 |
0.645529 |
0.864128 |
A Wild Sava Approaches
Having seen the big jump in NDCG scores, I was cautiously excited, so I broke cover and told my co-worker, Sava about the idea. He was initially suspicious about my Eval scores, which made sense because I accidentally sent him scores based on sentence chunking rather than paragraphs, where the NDCG difference is even more pronounced. It was much more reasonable when I corrected for that, but he did point out I was missing another baseline: what are the NDCG scores if you just used the summary and didn’t have any chunks at all? A great idea from our Principal Research Scientist and I collected the results:
Here’s the data formatted as markdown:
Embedding Type |
ndcg@1 |
ndcg@5 |
ndcg@10 |
mrr |
map |
recall@100 |
embedding_doc_summary |
0.789178 |
0.853623 |
0.861771 |
0.841532 |
0.841532 |
0.984168 |
BOBBINS. I must confess I was somewhat crushed by this, and I still haven’t told him (until he reads this). I even went off and repeated the experiment with HuggingFace’s fineweb dataset…and…yes, as you can see, pretty much the same result.
Model |
ndcg@1 |
ndcg@5 |
ndcg@10 |
mrr |
map |
recall@100 |
embedding_base |
0.2793 |
0.418632 |
0.456080 |
0.391381 |
0.391381 |
0.6630 |
embedding_average |
0.5111 |
0.668388 |
0.691684 |
0.631249 |
0.631249 |
0.8803 |
embedding_doc_summary |
0.7092 |
0.797029 |
0.809260 |
0.781517 |
0.781517 |
0.9753 |
Conclusion
All that work and I might as well have just used the summary. So is this a complete washout? Not entirely, and this is how even ‘failed’ research can still be useful. For one thing, instead of actually going to the trouble of doing all that chunking, it’s possible that for a variety of search applications, let’s just use not bother and use a summary? That way, we get good results and we don’t have to store all those chunk vectors in the database
Also, there are some use cases where the averaged chunks could still be useful. In the currently common retrieval augmented generation pattern, chunks from various documents are sent to a large language model in order to answer a question. Even though today’s frontier models are powerful, they can still find themselves being confused by lots of different documents being sent to them in one go, so you want to send relevant information and not drown it in distracting text. If you’re trying to limit yourself to 5 or so docs to send, this technique gets you a much bigger chance of having the correct answer sent to the model…
There’s more that could be tested. We probably should have a better set of generated queries that can account for potential ‘bleed’ between queries and other documents by judging relevance of the query against other documents in the system (I have a version of this in progress as I type), as well as trying a bunch of other ideas. Should we try to append the summary in text form to the chunk text and then vectorize that new chunk? Just add title and maybe some important keywords? With those ideas, you do have to worry about total token counts again, but maybe if it’s kept concise and using a longer context embedding model, it would be easier to handle. Could we tailor the summary prompt to produce something more like an Anki card to see if that helps? Should we add a hyperparameter to multiply against the summary vector before it’s added and averaged against the chunk vector? Maybe we give up on the idea of the summary altogether and instead use a sliding window of previous and following chunks to try and keep context that way?. So many different directions that could be ran into the ground. There’s still so much left on the table in the world of IR and embeddings. So, ‘failure’, yes, but as failures go, not a bad one.
Mar 9, 2025 · 2 minute
read
all the diseases
Finally coming to the end of the flu. Still have the hacking cough and getting through a box of tissues a day, but it’s better than the start of the week.
Having said all that, still not a lot to talk about. I have finally done my taxes, although with the current state of the IRS, I’m not exactly convinced my refund will be coming to me any time soon. It’ll be nice once it does arrive though, as it’ll move me a big step closer to Weird Financial Goal.
I am still trying to work out Maeryn’s birthday cake. The basic contours are set: chocolate cake plus a raspberry component. But do I make it in the Berlin mold, which will be fancy, but I feel like it doesn’t have enough cake, as well as making a ganache layer much harder. Or I could use Kyoto and have a ganache filling and raspberries on top. It feels a more substantial cake size, but not overwhelming for toddler hands (I am fully aware that I’m over thinking this, and it could just be a box cake in a sheet pan and she’ll shove it into her face just the same, but left me have this!). Don’t worry, I still have two weeks left to figure it out…
First tech blog of the year has most of the first draft done. Unfortunately, it requires a vast amount of changes in the second draft, turning it from “look at this cool new technique” to more of “look at this idea that didn’t actually work out. Which is a little less fun to write up, and I need to do a few more experiments to convince myself one way or another, but I do feel it’s important to write up the things that fail as well as the things that succeed. So stay tuned for failure!
Feb 23, 2025 · 2 minute
read
Apple Juice FTW
Adventures in the UK
How can you pack even more into a four-day trip to the UK? Have you considered spending Saturday afternoon and evening in A&E? A sick baby that needed medical attention. And while I think we could probably write a few volumes on how 111 could be better set up for providing updates, and perhaps the US concept of Urgent Care clinics might be a way to alleviate some of the pressures on A&E departments, we walked into the Horton, were seen quickly, and thankfully, Maeryn perked up once her blood sugar level was helped by apple juice. And then we just walked out, and having now lived in the US for almost 14 years…I’ll confess even I found it strange not have handed over a credit card at any point. Maeryn is back to her usual “you will read all the books now, daddy” self, which is wonderful.
Anyway, the concert was great — there may be a separate post coming to unpack all that in the next week or so. It was my first time at Troxy, too; a lovely 1930s cinema that retains a lot of its former glory…unlike what they did to the Oxford Road Odeon in Manchester (still bitter about the loss of that fantastic Screen One). Now, I’m thinking of other venues in London I have known and loved…The Luminaire being the saddest loss, I think, even above the delights of The Astoria (come on, just look at the mirrorball in a venue for 200 people and tell me you wouldn’t fall in love with that place). A packed trip of highs and lows…but next time, maybe we’ll hopefully stay long enough to get over the jetlag before we get more…
And now back in the US as things continue to fall apart. But I have 4 kilograms of Mini Eggs, so things are not quite at their worst just yet. The snow is melting, we’re all house safe, and I’m putting together thoughts for Maeryn’s first homemade birthday cake.
No kings, no tyrants.