Why We Won't Achieve AGI Until Memory Is A Core Architectural Component
The Hard Part of Thinking Is Knowing What To Think About
Last time, I described a serious challenge faced by large language models like ChatGPT: they are trained on finished texts, without any visibility into the incremental and iterative process of outlining, fleshing out, and revision that is necessary to create a sophisticated piece of writing, design, code, strategic plan, or other complex artifact1. This is something we’ll need to address in order to achieve human-level general intelligence.
Today, I’m going to explore another set of challenges, under the general heading of “memory”, that also stand in the way of AIs undertaking deep work. I’ll argue that just as the current wave of progress in LLMs was spurred by deeply integrating a rich short-term memory (in the form of “attention”), the next step will require deeply integrating an equally rich long-term memory… and that current approaches, such as vector databases, won’t suffice because they are not integrated into the core of the neural network architecture and training process.
Memory Is Central To Getting Things Done In The Real World
We’ve all had the experience of being stymied by a failure of memory, such as being late to an appointment because we couldn’t find our keys. Even so, I think it’s easy to underestimate the degree to which memory is central to almost everything we do. Most of the time, we’re able to take our memory for granted the way a fish takes water for granted.
For example, suppose your cable box dies, leaving you without TV or Internet access. By some miracle, the cable company can send a technician the following afternoon, between 2:00 and 4:00 PM. You need to prepare to survive without connectivity until then; you also need to arrange to be home to meet the technician. How do you go about planning for that? Your mental process might look something like this (numbers are references to the notes below):
Hmm, someone will need to be here to meet the technician. I have that planning session that I’d rather not miss . Can Alice (your wife) be here? Let me check her calendar … nope, she has a client meeting. I guess Bob can cover for me at the planning session , I’ll make a note to let him know .
Wait, the last time we had a service technician come to the house in the middle of the afternoon, they had a hard time parking . I’ll ask Alice to take the car to work so our spot will be free .
Ugh, the new episode of <current popular show> drops tonight  and we wanted to watch it. People will definitely be discussing spoilers at work tomorrow. I guess I can solve this by streaming from my laptop, tethering the laptop to my phone for Internet access, and then connecting to the TV via HDMI .
Here is a partial list of ways in which memory is used during this brief scenario:
There’s a planning session tomorrow afternoon: this information comes from your prospective memory – remembering actions you will need to perform in the future.
Checking your wife’s calendar: here, you rely on external memory – information sources located outside of your head.
Bob can cover for me at the planning session: there are at least two steps here. First, you need to know who will be at the session; second, you need to know that of the attendees, Bob is a good choice to cover for you. These are both instances of semantic memory – the component of long-term memory that stores facts about the world.
Making a note to talk to Bob: assuming you write this down in a to-do list, you are updating a form of external memory.
The last time a service technician came to the house: this is an example of episodic memory, which records specific experiences (“episodes”) in your life. Also note that this memory was retrieved through a fairly subtle chain of associations: a repair person will be coming, they will presumably drive, therefore they will need to park, parking under these circumstances has been a problem in the past. This is only tenuously connected to your direct goal of “fix the cable box”.
Asking Alice to take the car to work: to arrive at this solution requires several implicit references to semantic and/or episodic memory – knowing that you have a reserved parking space, that if you are home to meet the technician then it will be occupied by your car, that Alice normally takes public transportation to work but could take the car instead, and so forth.
New episode tonight: prospective memory.
Using the laptop to watch TV without working cable: an example of procedural memory, the form of long-term memory which records learned skills and procedures.
Memory – multiple forms of both long-term and external memory – is central to every step of this thought process. Imagine opening up ChatGPT and typing “my cable box died and the technician is coming tomorrow afternoon, what should I do?”; it would be preposterous to expect it to be of any help, because it doesn’t have any access to any of the relevant information.
The Role of Memory In Performing Complex Tasks
Tasks that come up in the workplace often rely heavily on memory. Think about the difference between how you might hand a task to a newly hired co-worker, vs. someone you’ve been working with for ten years. Even if the new hire has an equal amount of career experience, you’d need to give them far more context, and allow a lot of extra time for them to do research and ask questions. That difference is entirely due to the information the ten-year veteran has in their memory.
In my last post, I talked about how complex tasks are carried out in a tangled series of steps, jumping from subtask to subtask as the shape of the overall plan evolves. At each step, the relevant facts must be brought into play. These might include:
The goals and detailed requirements of the project.
The (constantly evolving) plan for how the project will be completed. For instance, an outline or design sketch, along with any number of to-dos, open questions, unresolved difficulties, and other mental notes.
Information about the current subtask.
Information about the overall work done so far.
Information about the broader context. When writing code, this would include a vast amount of information about other subsystems in the overall codebase. More generally, information about company goals, procedures, projects, resources, coworkers, and myriad details of “how we do things here” – everything that distinguishes an experienced staffer from a new hire.
Often, your memory will spontaneously volunteer facts that you wouldn’t have known to go looking for. “The last time we had a service technician come to the house in the middle of the afternoon, they had a hard time parking” and “there’s a new TV episode that we really want to watch tonight” have that flavor: your long-term memory will associatively retrieve those facts without you having to “ask” for them. This is important, because it’s unlikely that you would have known to go digging for those particular pieces of information.
When I am writing code, I am constantly shuffling information between my short- and long-term memory, managing external memories such as a to-do list and a detailed project plan, and looking things up in the existing codebase and associated documentation. To think about how well an LLM might be able to undertake similar work, let’s review the memory facilities they have available. Briefly, current LLM-based systems can store information in the model weights; the token buffer; a vector database; or “plugin” tools. I’ll discuss each in turn.
LLMs “learn” a large amount of information during their training process. This information is somehow embedded in the model’s weights, i.e. the strengths of the connections between its simulated neurons. If we train an AI to do software engineering, then general knowledge of algorithms, data structures, and programming languages will reside in its weights.
Once the model has been trained, the weights do not change. Thus, they are not a potential home for any of the project-specific knowledge we’ve been discussing. They can only be used for general knowledge that can be baked in in advance of the model being used.
This is the technical term for the chat transcript of words which the model is processing. It holds the initial instructions that we provide to the model, as well as all of the model’s output.
LLMs seem to be pretty good at retrieving information from the token buffer as needed. However, it is not clear how far this can scale. Until recently, the token buffer was typically limited to a few thousand words. Newer systems are claimed to support far larger buffers, but they may not be able to process that information very effectively. I’m not aware of any impressive examples of models managing large amounts of information in a token buffer. One recent paper, Lost in the Middle: How Language Models Use Long Contexts, explicitly casts doubt:
We analyze language model performance on two tasks that require identifying relevant information within their input contexts: multi-document question answering and key-value retrieval. We find that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models. [emphasis added]
Another limitation of the token buffer it that there’s no way of managing the information it contains. Each new word generated by the model, regardless of importance, is added to the buffer. There is no mechanism for editing or removing information2. The token buffer can hold considerably more than a human’s short-term memory, but it’s far less flexible.
A vector database is a database that allows retrieving information according to conceptual similarity. (I’m going to gloss over the details of how this works, and why it’s called a “vector” database3.)
For instance, suppose an e-commerce vendor wants to allow customers to chat with an LLM to find products. They might create a database entry for each product description. Then, if you ask the chatbot about “Nerf guns with high fire rate4”, the system would perform the following sequence of events:
Search the database for product descriptions which are conceptually similar to “Nerf guns with high fire rate”.
Pick the best few matches – say, 5 matches.
Invoke the LLM with a prompt looking something like this:
An e-commerce customer has asked for “Nerf guns with high fire rate”. Below are five product descriptions which may relate to that request. If any of these products seem relevant, generate a recommendation; otherwise, ask for more information.
Product description 1: …
Product description 2: …
In other words, to make the relevant-seeming database entries visible to the LLM, we textually insert them into the token buffer before the LLM begins processing.
People are trying out many different uses for vector databases. To create a chatbot that “knows about” some complex technical product, you can put the product documentation into a vector database, one paragraph at a time. To create a chatbot with a long-term memory, you can put the chat history into a vector database.
I don’t know how well this works in practice, but I am pretty skeptical that it will do much to address the ways in which memory is needed when performing complex tasks. Some things to consider:
The database structure is extremely simple. There is no way to group entries together, link them to one another, etc. Each entry is retrieved independently of all others.
Retrieval is all-or-nothing. If an entry appears to be relevant, its complete text is dumped (expensively) into the token buffer.
The neural net has no ability to manage the information in the database – it has no means for editing, restructuring, or deleting entries.
I doubt that vector similarity comparisons (the mechanism vector databases use to evaluate conceptual similarity) will be deft and flexible enough to support the sort of associative retrieval miracles that our brains manage on a routine basis. In the broken-cable-box example, I don’t see how the model could manage to select “the last time we had a service technician come to the house in the middle of the afternoon, they had a hard time parking” over “the last time we had a service technician come to the house, their name was Chip” or any number of similarly irrelevant facts.
I don’t think the basic mechanisms of human long-term memory suffer from any of these limitations.
Take all this with a grain of salt; again, I don’t have any practical experience with vector databases. But as an example, consider the “chatbot with a long-term memory” scenario above, where each chat response is added to the database. Suppose that we want the chatbot to be able to keep track of a to-do list. The append-only nature of the chat history makes this infeasible. Once “note to self: do X” has been added to the history, it can never be removed. The model could add another entry saying “note to self: X is now done” – but both notes would persist forever, cluttering the database and making it impossible to retrieve the list of outstanding to-do items within any finite token window.
You could imagine writing some special code to help the model manage to-do lists. However, that would be a narrow solution to a very specific problem, which is exactly what modern deep learning architectures are supposed to get us away from. Such an approach would seem likely to result in systems that are just as brittle and failure-prone as older generations of software.
Some LLMs are trained to be able to invoke other pieces of software, called “plugins”. This can include tools such as a calculator, but it can also include conventional databases, Internet search, or anything you want to give the model access to.
For instance, suppose you were to ask an LLM for yesterday’s closing price of Apple stock. Its training data would not have any recent information, but if the model were appropriately trained and equipped, it could issue a web search to look up that information.
If we give a model a set of plugins for adding, updating, and retrieving information from a conventional database, then in principle it could use this for memory-like tasks. For instance, it could use a database table to manage the aforementioned to-do list. However, the model would have to do this explicitly – it would need to know what database commands to issue at any given moment, meaning that it would have to somehow be trained to do this, in a fashion that generalizes to every use case we have in mind for the model. This seems like it would require some sort of major advance in training techniques; it leads back to the challenges I discussed in my last post, where models will somehow need to learn to generate a series of useful actions gradually leading up to a result, rather than attempting to just reel off the final result in a single pass.
Summarizing Limitations of LLM Memory
I don’t think that current mechanisms for equipping LLMs with memory are up to the task of supporting complex thought. In brief:
Model weights are static; they cannot fluidly incorporate new information.
Token buffers appear to be limited in size5, and there is no mechanism for modifying or removing data from the buffer.
Vector databases are limited to a simple structure, and there is no native ability for the model to modify or remove data.
Plugins are only useful to the extent that the model has been trained to use them; I think it would require a significant breakthrough to integrate plugins into an LLM as deeply as long-term memory is integrated into human thought.
It should also be noted that when information is retrieved from either a vector database or a plugin, it must be inserted into the LLM’s token buffer, and incur the same processing cost as any other text that the model generates. If your memory worked like that, facts and ideas wouldn’t enter directly into your thought stream as a gestalt; instead, it would be like an assistant reading them out to you, one word at a time, every single time you need to remember anything at all. Trying to get any serious work done that way would be maddeningly inefficient.
Long-Term Memory Is A Deep Challenge
From a high level, cogitation can be loosely modeled as follows:
Select the appropriate information to be acted on at the current moment.
Process that information, resulting in some new information (a “thought”) and/or an action to be carried out.
Of course this is not the only way to model cogitation, and glosses over many aspects, but it highlights something important: your thought processes are only as good as the information you’re retrieving to think about. The fact that an idea can be “obvious in hindsight” illustrates this: often, the hard part in solving a problem is pulling together the right facts, and once you have done so, the solution is indeed obvious6.
The key design insight leading to the current generation of LLMs was the “attention mechanism” that retrieves the token buffer entries that are most relevant to predicting the next word. This was a breakthrough in information selection, analogous to short-term memory, and is deeply baked into the model’s system for processing information. To achieve human-level intelligence, we will need an equivalent breakthrough – possibly multiple breakthroughs – for long-term memory. Like “attention”, and unlike current vector databases, long-term memory will need to be tied directly into the model’s information processing architecture and training process7. For a number of reasons, this may turn out to be even harder than it was for short-term memory:
Long-term memory is far larger. The retrieval process will need to consider a much larger number of entries, without a correspondingly greater expense. And it will need to be much more selective, without filtering out important information. Instead of selecting the correct handful of facts out of a collection of a few thousand, we may need to select a few out of a few million, meaning that we need to be 1000x better at selecting the right entries.
For short-term memory, current LLMs have gotten away without needing a mechanism for updating or deleting information; they just let everything pile up in the token buffer. This certainly will not work for long-term memory.
If I spend two years writing a complex novel, the way in which I manage my long-term memory at the beginning may affect my ability to tie things together at the end. We’ll need to find effective ways of training a network when choices sometimes take a long time to play out.
The human brain makes use of multiple forms of long-term memory, including external memories; effective AGI may need to do the same. Effective use of external memories requires a whole additional suite of skills which I haven’t touched on, such as remembering where to look for a particular piece of information, or leveraging your visual system to quickly scan a page or jump to the important portion of a diagram.
The human brain contains about 100 trillion synapses (connections between neurons), and these connections have a complex structure – two adjacent neurons can have connections going in very different directions. This might be central to our ability to encode complex information retrieval patterns8. It doesn’t strike me as a foregone conclusion that we will be able to efficiently replicate this using current non-neural-net-based approaches, such as retrieving entries from a vector database using a simple metric like cosine similarity.
GPT-4 has, reportedly, about 1.8 trillion synapses – 55x less than the brain. These synapses have a much more regular – i.e. rigid – structure, and cannot encode new memories on the fly. It’s a design optimized for static, baked-in knowledge, and may not be of much help as we look for ways to accommodate fluid, evolving memory.
For AIs to undertake deep work, they’ll need fluid long-term memory. We’re one or more breakthroughs away from achieving that. Next time, I’ll explore the question of how long it might take to get there – how far off is human-level general intelligence?
Help me build a discussion around the ideas expressed in this blog – please subscribe, comment, and share!
At least, an iterative process is necessary for humans to create sophisticated results. But I am inclined to believe that this is fundamental, and not some limitation specific to human cognition. Plotting a novel or writing a tricky piece of code is a complex optimization problem, and solutions to such problems do not emerge in a single step.
Well, when the buffer fills up, the oldest tokens will fall off the back. However, there is no controlling this process; tokens are dropped in the order they were generated. Hence, it’s still not a useful way for selectively retaining useful information.
OK fine: many machine learning / AI systems represent concepts using a sequence of numbers. You train a neural network to translate some input, such as a chunk of text, into a sequence of numbers that somehow embody a fuzzy conceptual representation of the main idea in that input. In mathematics, a sequence of numbers is called a vector, so what you’ve done is built a system that, when given a chunk of text, returns a vector that represents the concept contained in that text. That vector is known as an “embedding” of the concept.
A vector database is a database where each entry is labeled with a vector. Internally, the database is structured such that you can hand it a vector, and ask to retrieve all of the entries whose vectors are similar. Vector embeddings have the nice – if bewildering – property that similar concepts yield similar numbers. As a result, you can use a vector database to retrieve records that are conceptually similar to some piece of input that you’re working with.
Asking for a friend.
As noted earlier: while there has been a flurry of recent papers and models supporting large token windows, I am dubious that this works well (at least, well enough to address the sorts of memory tasks I’m focusing on here), and I haven’t seen any evidence to the contrary.
For example, I recently listened to a podcast where a researcher at Microsoft described the following scenario. He presented an LLM with a set of sales data for an e-commerce site, in which sales were basically flat in each month from October through February, except for a bump in December. He explained that the store ran an ad in early December, and asked whether the ad was likely responsible for the increase in sales. The model concluded that this was in fact the likely explanation; it completely failed to take into account the fact that December is a holiday month, and traditionally sees a broad increase in retail spending. The difficulty here is not understanding that a historical sales bump every December could explain the bump this December; the hard part is figuring out that, among all known facts, the historical December bump is an important fact to retrieve here.
A recent paper, Focused Transformer: Contrastive Training for Context Scaling, describes work in this direction – integrating data retrieval directly into a LLM. However, it does not go very far toward addressing most of the requirements I discuss here.
Here’s something interesting I heard in a recent episode of the Dwarkesh Podcast:
One of the theories about how well we remember stuff in what circumstances is actually called predictive utility theory. And it suggests that the probability of retrieval of a particular item in a given situation actually does correspond with basically a model of to what extent the brain predicts it will be useful.
I wouldn’t be surprised if AI long-term memory also winds up needing a way to train a model to predict which information will be useful. That is a reasonable way of describing “attention” (LLM short-term memory).