Dream Routine

Tiered Memory Architecture: RAM, SSD, and Tape for AI

February 18, 2026 · 8 min read

Computers don’t keep everything in RAM. That would be impossibly expensive.

Instead, they use tiered storage:

RAM — Fast, expensive, limited capacity (active data)
SSD — Medium speed, medium cost (frequently accessed data)
HDD/Tape — Slow, cheap, massive capacity (archives)

Data moves between tiers based on access patterns. Hot data stays in RAM. Cold data gets archived to disk.

AI memory should work the same way.

If you load every conversation, every decision, and every fact into your AI’s context window, you’ll run out of tokens fast. And even if you don’t, the model will drown in noise.

The solution: tiered memory architecture.

In this guide, I’ll explain:

Why flat memory doesn’t scale (lived experience)
How tiered memory works (working/warm/cold)
When to move data between tiers
How to implement this yourself

Let’s start with the problem.

The Flat Memory Problem

Most AI agents use flat memory: one giant MEMORY.md file with everything in it.

This works great… for about two weeks.

Then:

MEMORY.md grows to 20,000 tokens
You hit context window limits
The model starts ignoring facts buried deep in the file
Response times slow down
Costs skyrocket

Real example from my OpenClaw agent:

Week 1: 3,000 tokens → Fast, sharp, perfect
Month 1: 15,000 tokens → Slower, occasional misses
Month 2: 40,000 tokens → Hallucinations, context overflow
Month 3: 65,000 tokens → Broke entirely (couldn’t load context)

I had to delete half the file. But which half? I didn’t have a systematic way to decide.

This is where tiered memory comes in.

Tiered Memory Architecture (The Three Layers)

Here’s the model I use:

Tier	Analogy	Loaded	Cost	Contents	Access Pattern
Working Memory	RAM	Always (every prompt)	High (token cost)	Core identity, active projects, critical rules	Read: every call
Warm Memory	SSD	On-demand (retrieved when relevant)	Low (vector lookup)	Recent conversations, frequently accessed facts	Read: when queried
Cold Storage	Tape	Explicit retrieval only	Minimal (storage cost)	Full logs, completed projects, old decisions	Read: rarely

Working Memory (Always Loaded)

Target size: 3,000-5,000 tokens
Contents:

Core identity (AGENTS.md — who you are, personality, rules)
Active project context (what you’re working on right now)
Critical preferences (never send emails without asking, etc.)
Recent conversation summary (last 24 hours)

Why always loaded? This is your agent’s identity and immediate context. Without it, the agent doesn’t know who it is or what it’s doing.

How to maintain:

Nightly consolidation (Dream Routine)
Ruthless pruning (if it’s not critical, demote it)
Token budget enforcement (never exceed 5K)

Warm Memory (Retrieved On-Demand)

Target size: Unlimited (stored in vector DB)
Contents:

Conversations from the past 30 days
Frequently accessed facts (insurance policies, project details, preferences)
Decisions and lessons learned
Tool configurations and snippets

Why not always loaded? It’s too big (50,000+ tokens). Loading it all would exceed context windows and cost a fortune.

How to access:

Semantic search (vector database like Qdrant, Pinecone, Weaviate)
Query when needed: “Find all mentions of insurance policies”
Top 5-10 results get loaded into the current context

How to maintain:

Auto-archive memories older than 7 days (move from working → warm)
Index embeddings for fast retrieval
Decay score for infrequently accessed memories

Cold Storage (Archives)

Target size: Unlimited (flat files or S3)
Contents:

Full conversation logs (raw transcripts)
Completed project archives
Old decisions no longer relevant
Historical data for auditing

Why not searchable? Cold storage is for compliance and debugging, not daily use. You only access it when explicitly needed.

How to access:

Manual retrieval (grep, file search)
Explicit queries (“Show me logs from January 15”)

How to maintain:

Archive daily logs after 30 days
Compress old files (gzip, tar)
Backup to cheap storage (S3, external drive)

Data Flow: Moving Between Tiers

Data moves between tiers based on access patterns and age.

Promotion (Cold → Warm → Working)

Scenario 1: Semantic Search Retrieves Old Memory

You ask: “What did I decide about using Postgres vs SQLite?”

Query searches warm memory (vector DB)
Finds relevant decision from 3 weeks ago
That memory gets loaded into current context (temporarily promoted to working memory)
Relevance score increases (accessed recently → stays warm longer)

Scenario 2: User Explicitly References Old Data

You say: “Remember that insurance policy I mentioned in January?”

System searches cold storage (archived logs)
Finds the reference in memory/2026-01-15.md
Extracts relevant facts
Adds to warm memory (now searchable)
Loads into working memory for current session

Demotion (Working → Warm → Cold)

Scenario 1: Completed Project

You finish building a Hugo site for investedges.com.

Project context is in working memory (active)
After 7 days of inactivity, it moves to warm memory (searchable but not auto-loaded)
After 30 days, it moves to cold storage (archived)

Scenario 2: Low-Relevance Facts Decay

A memory: “User was debugging CSS grid on Feb 10”

Week 1: Working memory (recent, might be relevant)
Week 2: Warm memory (not referenced again → demoted)
Month 1: Cold storage (old, no longer relevant)

How to Implement Tiered Memory (Practical Guide)

Option 1: File-Based (Simple)

Working Memory: MEMORY.md (3,000 tokens, curated)
Warm Memory: memory/archive/ (daily logs, searchable with rg)
Cold Storage: memory/archive-old/ (gzipped files, not searched)

Nightly script:

#!/bin/bash
# Move working memory to warm after 7 days
find memory/ -name "*.md" -mtime +7 -exec mv {} memory/archive/ \;

# Move warm memory to cold after 30 days
find memory/archive/ -name "*.md" -mtime +30 -exec gzip {} \; -exec mv {}.gz memory/archive-old/ \;

Pros: Simple, no dependencies
Cons: No semantic search, manual pruning required

Option 2: Vector Database (Recommended)

Working Memory: MEMORY.md (loaded into every prompt)
Warm Memory: Qdrant / Pinecone (vector embeddings)
Cold Storage: S3 / local archive (full logs)

Setup:

from qdrant_client import QdrantClient
import openai

client = QdrantClient("localhost", port=6333)

def add_to_warm_memory(text, metadata):
    embedding = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=text
    )["data"][0]["embedding"]
    
    client.upsert(
        collection_name="warm_memory",
        points=[{
            "id": metadata["id"],
            "vector": embedding,
            "payload": {"text": text, **metadata}
        }]
    )

def search_warm_memory(query, limit=5):
    query_embedding = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=query
    )["data"][0]["embedding"]
    
    results = client.search(
        collection_name="warm_memory",
        query_vector=query_embedding,
        limit=limit
    )
    return [r.payload["text"] for r in results]

Pros: Semantic search, scalable
Cons: Requires vector DB setup

Option 3: MyDeepBrain (Coming Soon)

We’re building tiered memory natively:

Automatic tier management
Nightly consolidation
Semantic search built-in
No manual scripting

Join the waitlist to get early access.

Relevance Scoring (When to Move Data)

Not all memories are created equal. You need a relevance score to decide what stays in working memory.

Relevance Formula

relevance_score = (recency_weight × recency) 
                + (frequency_weight × access_frequency)
                + (importance_weight × user_rating)

Example weights:

recency_weight = 0.5 (recent matters)
frequency_weight = 0.3 (often-used matters)
importance_weight = 0.2 (user can manually pin critical facts)

Scores:

0.9 - 1.0: Stay in working memory
0.5 - 0.9: Demote to warm memory
0.0 - 0.5: Archive to cold storage

Decay Mechanism

Memories fade over time unless reinforced:

def decay_relevance(memory, days_since_last_access):
    decay_rate = 0.05  # 5% per day
    memory["relevance"] *= (1 - decay_rate) ** days_since_last_access
    return memory

Example:

Memory starts at relevance = 1.0
After 7 days with no access: relevance = 0.70
After 14 days: relevance = 0.49 → demoted to warm
After 30 days: relevance = 0.21 → archived to cold

Real-World Example: My Insurance Policy Memory

I track 12 insurance policies in my AI’s memory. Here’s how they’re tiered:

Working Memory (Always Loaded)

# Active Insurance Policies
- HDFC Life: Renews April 15, 2026 (next renewal <60 days)
- ICICI Pru: Renews June 22, 2026

Why working? I reference these weekly (payment reminders, renewal tracking).

Warm Memory (Searchable)

{
  "policy": "LIC Jeevan Anand",
  "renewal_date": "2027-01-10",
  "premium": "₹25,000/year",
  "last_accessed": "2026-01-20",
  "relevance": 0.6
}

Why warm? Not imminently relevant, but I might query it.

Cold Storage (Archived)

memory/archive-old/2024-insurance-policy-comparison.md.gz

Why cold? Historical decision log from 2 years ago. Only needed for auditing.

Common Mistakes in Tiered Memory

Mistake 1: Too Much in Working Memory

If your working memory exceeds 5,000 tokens, you’re not being selective enough.

Fix: Move anything not referenced in 7 days to warm.

Mistake 2: No Promotion Mechanism

If you never promote old memories back to working, you lose valuable context.

Fix: When searching warm memory, temporarily promote top results.

Mistake 3: Deleting Cold Storage

“I’ll never need this again” → famous last words.

Fix: Archive, don’t delete. Storage is cheap.

Mistake 4: No Decay Logic

All memories stay equally relevant forever → working memory bloats.

Fix: Implement decay (5-10% per week for unreinforced memories).

Advanced: Multi-Dimensional Tiers

You can tier by type, not just age:

Type	Tier	Rationale
Core identity	Working	Permanent, never demoted
Active projects	Working → Warm after completion	Temporary but critical
Preferences	Working (if recent) / Warm (if old)	Can change over time
Decisions	Warm → Cold after 6 months	Historical record
Transient tasks	Never stored	Ephemeral, no long-term value

Key Takeaways

Flat memory doesn’t scale beyond a few weeks
Tiered memory = working/warm/cold (like RAM/SSD/tape)
Working memory = 3-5K tokens (always loaded)
Warm memory = unlimited (searchable via vector DB)
Cold storage = archives (compliance, not daily use)
Relevance scoring + decay decides when to move data
Promotion is critical — don’t lose old context forever

Tiered memory isn’t optional for long-running agents. It’s the only way to scale.

Want tiered memory without the complexity? MyDeepBrain handles working/warm/cold tiers automatically. Join the waitlist.

Want early access to MyDeepBrain?

We're building a self-hosted memory platform for AI assistants. Join the waitlist to be notified when we launch.

Join Waitlist

Tags: tiered memory AI architecture memory hierarchy working memory vector database