Tiered Memory Architecture: RAM, SSD, and Tape for AI
Computers don’t keep everything in RAM. That would be impossibly expensive.
Instead, they use tiered storage:
- RAM โ Fast, expensive, limited capacity (active data)
- SSD โ Medium speed, medium cost (frequently accessed data)
- HDD/Tape โ Slow, cheap, massive capacity (archives)
Data moves between tiers based on access patterns. Hot data stays in RAM. Cold data gets archived to disk.
AI memory should work the same way.
If you load every conversation, every decision, and every fact into your AI’s context window, you’ll run out of tokens fast. And even if you don’t, the model will drown in noise.
The solution: tiered memory architecture.
In this guide, I’ll explain:
- Why flat memory doesn’t scale (lived experience)
- How tiered memory works (working/warm/cold)
- When to move data between tiers
- How to implement this yourself
Let’s start with the problem.
The Flat Memory Problem
Most AI agents use flat memory: one giant MEMORY.md file with everything in it.
This works great… for about two weeks.
Then:
MEMORY.mdgrows to 20,000 tokens- You hit context window limits
- The model starts ignoring facts buried deep in the file
- Response times slow down
- Costs skyrocket
Real example from my OpenClaw agent:
- Week 1: 3,000 tokens โ Fast, sharp, perfect
- Month 1: 15,000 tokens โ Slower, occasional misses
- Month 2: 40,000 tokens โ Hallucinations, context overflow
- Month 3: 65,000 tokens โ Broke entirely (couldn’t load context)
I had to delete half the file. But which half? I didn’t have a systematic way to decide.
This is where tiered memory comes in.
Tiered Memory Architecture (The Three Layers)
Here’s the model I use:
| Tier | Analogy | Loaded | Cost | Contents | Access Pattern |
|---|---|---|---|---|---|
| Working Memory | RAM | Always (every prompt) | High (token cost) | Core identity, active projects, critical rules | Read: every call |
| Warm Memory | SSD | On-demand (retrieved when relevant) | Low (vector lookup) | Recent conversations, frequently accessed facts | Read: when queried |
| Cold Storage | Tape | Explicit retrieval only | Minimal (storage cost) | Full logs, completed projects, old decisions | Read: rarely |
Working Memory (Always Loaded)
Target size: 3,000-5,000 tokens
Contents:
- Core identity (
AGENTS.mdโ who you are, personality, rules) - Active project context (what you’re working on right now)
- Critical preferences (never send emails without asking, etc.)
- Recent conversation summary (last 24 hours)
Why always loaded? This is your agent’s identity and immediate context. Without it, the agent doesn’t know who it is or what it’s doing.
How to maintain:
- Nightly consolidation (Dream Routine)
- Ruthless pruning (if it’s not critical, demote it)
- Token budget enforcement (never exceed 5K)
Warm Memory (Retrieved On-Demand)
Target size: Unlimited (stored in vector DB)
Contents:
- Conversations from the past 30 days
- Frequently accessed facts (insurance policies, project details, preferences)
- Decisions and lessons learned
- Tool configurations and snippets
Why not always loaded? It’s too big (50,000+ tokens). Loading it all would exceed context windows and cost a fortune.
How to access:
- Semantic search (vector database like Qdrant, Pinecone, Weaviate)
- Query when needed: “Find all mentions of insurance policies”
- Top 5-10 results get loaded into the current context
How to maintain:
- Auto-archive memories older than 7 days (move from working โ warm)
- Index embeddings for fast retrieval
- Decay score for infrequently accessed memories
Cold Storage (Archives)
Target size: Unlimited (flat files or S3)
Contents:
- Full conversation logs (raw transcripts)
- Completed project archives
- Old decisions no longer relevant
- Historical data for auditing
Why not searchable? Cold storage is for compliance and debugging, not daily use. You only access it when explicitly needed.
How to access:
- Manual retrieval (grep, file search)
- Explicit queries (“Show me logs from January 15”)
How to maintain:
- Archive daily logs after 30 days
- Compress old files (gzip, tar)
- Backup to cheap storage (S3, external drive)
Data Flow: Moving Between Tiers
Data moves between tiers based on access patterns and age.
Promotion (Cold โ Warm โ Working)
Scenario 1: Semantic Search Retrieves Old Memory
You ask: “What did I decide about using Postgres vs SQLite?”
- Query searches warm memory (vector DB)
- Finds relevant decision from 3 weeks ago
- That memory gets loaded into current context (temporarily promoted to working memory)
- Relevance score increases (accessed recently โ stays warm longer)
Scenario 2: User Explicitly References Old Data
You say: “Remember that insurance policy I mentioned in January?”
- System searches cold storage (archived logs)
- Finds the reference in
memory/2026-01-15.md - Extracts relevant facts
- Adds to warm memory (now searchable)
- Loads into working memory for current session
Demotion (Working โ Warm โ Cold)
Scenario 1: Completed Project
You finish building a Hugo site for investedges.com.
- Project context is in working memory (active)
- After 7 days of inactivity, it moves to warm memory (searchable but not auto-loaded)
- After 30 days, it moves to cold storage (archived)
Scenario 2: Low-Relevance Facts Decay
A memory: “User was debugging CSS grid on Feb 10”
- Week 1: Working memory (recent, might be relevant)
- Week 2: Warm memory (not referenced again โ demoted)
- Month 1: Cold storage (old, no longer relevant)
How to Implement Tiered Memory (Practical Guide)
Option 1: File-Based (Simple)
Working Memory: MEMORY.md (3,000 tokens, curated)
Warm Memory: memory/archive/ (daily logs, searchable with rg)
Cold Storage: memory/archive-old/ (gzipped files, not searched)
Nightly script:
#!/bin/bash
# Move working memory to warm after 7 days
find memory/ -name "*.md" -mtime +7 -exec mv {} memory/archive/ \;
# Move warm memory to cold after 30 days
find memory/archive/ -name "*.md" -mtime +30 -exec gzip {} \; -exec mv {}.gz memory/archive-old/ \;
Pros: Simple, no dependencies
Cons: No semantic search, manual pruning required
Option 2: Vector Database (Recommended)
Working Memory: MEMORY.md (loaded into every prompt)
Warm Memory: Qdrant / Pinecone (vector embeddings)
Cold Storage: S3 / local archive (full logs)
Setup:
from qdrant_client import QdrantClient
import openai
client = QdrantClient("localhost", port=6333)
def add_to_warm_memory(text, metadata):
embedding = openai.Embedding.create(
model="text-embedding-ada-002",
input=text
)["data"][0]["embedding"]
client.upsert(
collection_name="warm_memory",
points=[{
"id": metadata["id"],
"vector": embedding,
"payload": {"text": text, **metadata}
}]
)
def search_warm_memory(query, limit=5):
query_embedding = openai.Embedding.create(
model="text-embedding-ada-002",
input=query
)["data"][0]["embedding"]
results = client.search(
collection_name="warm_memory",
query_vector=query_embedding,
limit=limit
)
return [r.payload["text"] for r in results]
Pros: Semantic search, scalable
Cons: Requires vector DB setup
Option 3: MyDeepBrain (Coming Soon)
We’re building tiered memory natively:
- Automatic tier management
- Nightly consolidation
- Semantic search built-in
- No manual scripting
Join the waitlist to get early access.
Relevance Scoring (When to Move Data)
Not all memories are created equal. You need a relevance score to decide what stays in working memory.
Relevance Formula
relevance_score = (recency_weight ร recency)
+ (frequency_weight ร access_frequency)
+ (importance_weight ร user_rating)
Example weights:
recency_weight = 0.5(recent matters)frequency_weight = 0.3(often-used matters)importance_weight = 0.2(user can manually pin critical facts)
Scores:
0.9 - 1.0: Stay in working memory0.5 - 0.9: Demote to warm memory0.0 - 0.5: Archive to cold storage
Decay Mechanism
Memories fade over time unless reinforced:
def decay_relevance(memory, days_since_last_access):
decay_rate = 0.05 # 5% per day
memory["relevance"] *= (1 - decay_rate) ** days_since_last_access
return memory
Example:
- Memory starts at relevance = 1.0
- After 7 days with no access: relevance = 0.70
- After 14 days: relevance = 0.49 โ demoted to warm
- After 30 days: relevance = 0.21 โ archived to cold
Real-World Example: My Insurance Policy Memory
I track 12 insurance policies in my AI’s memory. Here’s how they’re tiered:
Working Memory (Always Loaded)
# Active Insurance Policies
- HDFC Life: Renews April 15, 2026 (next renewal <60 days)
- ICICI Pru: Renews June 22, 2026
Why working? I reference these weekly (payment reminders, renewal tracking).
Warm Memory (Searchable)
{
"policy": "LIC Jeevan Anand",
"renewal_date": "2027-01-10",
"premium": "โน25,000/year",
"last_accessed": "2026-01-20",
"relevance": 0.6
}
Why warm? Not imminently relevant, but I might query it.
Cold Storage (Archived)
memory/archive-old/2024-insurance-policy-comparison.md.gz
Why cold? Historical decision log from 2 years ago. Only needed for auditing.
Common Mistakes in Tiered Memory
Mistake 1: Too Much in Working Memory
If your working memory exceeds 5,000 tokens, you’re not being selective enough.
Fix: Move anything not referenced in 7 days to warm.
Mistake 2: No Promotion Mechanism
If you never promote old memories back to working, you lose valuable context.
Fix: When searching warm memory, temporarily promote top results.
Mistake 3: Deleting Cold Storage
“I’ll never need this again” โ famous last words.
Fix: Archive, don’t delete. Storage is cheap.
Mistake 4: No Decay Logic
All memories stay equally relevant forever โ working memory bloats.
Fix: Implement decay (5-10% per week for unreinforced memories).
Advanced: Multi-Dimensional Tiers
You can tier by type, not just age:
| Type | Tier | Rationale |
|---|---|---|
| Core identity | Working | Permanent, never demoted |
| Active projects | Working โ Warm after completion | Temporary but critical |
| Preferences | Working (if recent) / Warm (if old) | Can change over time |
| Decisions | Warm โ Cold after 6 months | Historical record |
| Transient tasks | Never stored | Ephemeral, no long-term value |
Key Takeaways
- Flat memory doesn’t scale beyond a few weeks
- Tiered memory = working/warm/cold (like RAM/SSD/tape)
- Working memory = 3-5K tokens (always loaded)
- Warm memory = unlimited (searchable via vector DB)
- Cold storage = archives (compliance, not daily use)
- Relevance scoring + decay decides when to move data
- Promotion is critical โ don’t lose old context forever
Tiered memory isn’t optional for long-running agents. It’s the only way to scale.
Want tiered memory without the complexity? MyDeepBrain handles working/warm/cold tiers automatically. Join the waitlist.
Want early access to MyDeepBrain?
We're building a self-hosted memory platform for AI assistants. Join the waitlist to be notified when we launch.
Join Waitlist