The Strangler Fig Pattern in Practice: How We Migrated a Hospital from MUMPS to SQL in 4 Years
Gonzalo Monzón
Founder & Lead Architect
In 2004, Martin Fowler published a short essay about the Strangler Fig Application — a pattern inspired by the strangler figs of tropical rainforests that slowly wrap around their host tree, gradually replacing it until the original tree dies. The idea: instead of the catastrophic "big bang" rewrite, you incrementally replace a legacy system piece by piece while both systems coexist. It's an elegant metaphor. It's also, in most articles you'll find, entirely theoretical.
This is not a theoretical article. Between 2013 and 2017, I executed this pattern against one of the most hostile environments imaginable: a Hospital Information System (HIS) running on InterSystems Caché/MUMPS — 8 servers across multiple hospital centers, serving thousands of healthcare workers, with patient data that couldn't have a single second of downtime. Here's exactly how we did it.
What We Were Strangling
The Beast: InterSystems Caché with MUMPS
MUMPS (Massachusetts General Hospital Utility Multi-Programming System) is a programming language and database system designed in 1966. By 2013, it was 47 years old and still running critical hospital infrastructure worldwide — including the US Department of Veterans Affairs (VA), which runs the largest MUMPS installation on the planet.
InterSystems Caché is the commercial evolution of MUMPS. It adds ObjectScript (an OOP layer), a web server, and bolted-on SQL support. But underneath, the data model is unchanged: globals — hierarchical, multidimensional key-value trees.
// A patient record in Caché/MUMPS:
^PATIENT("12345","NAME") = "García López, María"
^PATIENT("12345","DOB") = "19580315"
^PATIENT("12345","ALLERGIES",1) = "Penicillin"
^PATIENT("12345","VISITS","20130415","DEPT") = "ER"
^PATIENT("12345","VISITS","20130415","DIAG",1) = "J06.9"
^PATIENT("12345","VISITS","20130415","LAB","GLUCOSE") = "127"
// Traversal: $ORDER returns the next sibling key
// $ORDER(^PATIENT("12345","VISITS","20130415","LAB","")) → "GLUCOSE"
There are no tables. No columns. No foreign keys. No schema. The "schema" is implicit in the application code — thousands of MUMPS routines that know which keys to read and write. If you want to query "all patients with glucose > 200 in the last month," you can't write a SQL query. You have to walk the tree.
The Scale
- 8 Caché servers across different hospital and clinic centers in the Xarxa Santa Tecla network (Tarragona, Spain)
- ~15 years of accumulated data — this was the same system I had helped deploy during the Y2K migration in the late 90s when I was on the vendor side (Valen Computer)
- Thousands of daily users — doctors, nurses, administrators, lab technicians
- HL7 and DICOM integrations — radiology (RIS/PACS), laboratory, pharmacy, admissions
- 24/7 uptime requirement — emergency rooms don't close
Why I Was the Right (and Only) Person for This
I had a unique advantage that no external consultant could match: I had helped build the system I was now replacing.
In the late 90s, working for Valen Computer, I had done the Y2K migration for many of these hospitals — including migrating them to Caché. I knew why every piece of data was structured the way it was. I had written some of the routines. I had maintained the system for years, supporting 250+ clients on 24/7 rotation. When two reputable Barcelona consulting firms tried to build BI solutions on top of this data and failed (they treated Caché as standard SQL and got garbage), I was the one who figured out how to extract meaningful data.
This is the uncomfortable truth about legacy migrations: the person most qualified to kill the legacy system is the person who built it.
The Architecture of Coexistence
The naive approach to replacing a legacy system is the "big bang" — freeze the old system, develop the new one for 18-24 months, switch over on a weekend. This fails almost every time, and it fails catastrophically in healthcare:
- The new system can't replicate 15 years of implicit business rules
- Data migration is never clean — edge cases everywhere
- Users revolt because their workflows change overnight
- If anything goes wrong, you have no fallback
Our strategy was the Strangler Fig: both systems would run simultaneously for as long as necessary, with real-time bidirectional data synchronization, and functionality would be migrated module by module. At any point, if something failed, the old system was still fully operational.
The Target Architecture
┌─────────────────────────────────────────────────────────┐
│ NEW SYSTEM │
│ ┌──────────┐ ┌──────────┐ ┌────────────────────┐ │
│ │ Angular │→ │ REST API │→ │ SQL Server │ │
│ │ Frontend │ │ (Bottle) │ │ (Normalized Schema)│ │
│ └──────────┘ └──────────┘ └────────┬───────────┘ │
│ │ │
│ ┌─────────▼──────────┐ │
│ │ EVENTS TABLE │ │
│ │ (SQL → Cache sync) │ │
│ └─────────┬──────────┘ │
├────────────────────────────────────────┼────────────────┤
│ PYTHON CDC ENGINE │ │
│ ┌─────────────────────────────────────┼──────────┐ │
│ │ ┌──────────┐ ┌─────────────▼────────┐ │ │
│ │ │ Journal │ │ Event Consumer │ │ │
│ │ │ Reader │ │ (SQL→Cache writes) │ │ │
│ │ │ (Cache→ │ │ Uses MUMPS primitive │ │ │
│ │ │ SQL) │ │ mappings via C++ │ │ │
│ │ └────┬─────┘ └──────────────────────┘ │ │
│ └───────┼────────────────────────────────────────┘ │
├──────────┼─────────────────────────────────────────────┤
│ │ LEGACY SYSTEM │
│ ┌───────▼─────────────────────────────────────────┐ │
│ │ InterSystems Caché (MUMPS) │ │
│ │ Mirror/Journaling → Event Stream │ │
│ │ 8 Servers × Multiple Hospital Centers │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Step 1: Speaking MUMPS from Python
The first challenge was accessing Caché data from Python at native speed. InterSystems provided a C++ bridge (callin/callout interface), but it exposed raw Caché API calls — not exactly Pythonic.
I implemented the core MUMPS traversal primitives as Python functions:
# Simplified example of $ORDER implementation
def cache_order(global_name, *subscripts):
"""
Equivalent to MUMPS $ORDER — returns the next sibling key
at the given subscript level. This is the fundamental
traversal operation for globals (multidimensional B-trees).
"""
result = cache_connection.call_class_method(
"CacheTraversal", "Order",
global_name, *subscripts
)
return result if result != "" else None
def cache_get(global_name, *subscripts):
"""Equivalent to MUMPS $GET — returns value at exact key."""
return cache_connection.call_class_method(
"CacheTraversal", "Get",
global_name, *subscripts
)
# Walk all patients:
patient_id = ""
while True:
patient_id = cache_order("^PATIENT", patient_id)
if patient_id is None:
break
name = cache_get("^PATIENT", patient_id, "NAME")
# ... process patient
The key insight: every Caché class method call goes through the C++ bridge at native speed. Python is just the orchestration layer. The data access is as fast as native MUMPS — milliseconds per traversal. This gave me the best of both worlds: MUMPS speed for data access, Python flexibility for API design.
Step 2: REST API as Translation Layer
With the primitives in place, I built a REST API using Bottle — a Python micro-framework that's essentially a single file. The API translated between the hierarchical MUMPS world and the flat JSON world:
@app.route('/api/patients/<patient_id>')
def get_patient(patient_id):
"""
Behind this clean REST endpoint, we're traversing
MUMPS globals and assembling a normalized JSON response.
"""
return {
"id": patient_id,
"name": cache_get("^PATIENT", patient_id, "NAME"),
"dob": cache_get("^PATIENT", patient_id, "DOB"),
"allergies": get_all_allergies(patient_id),
"recent_visits": get_visits(patient_id, limit=10),
}
def get_all_allergies(patient_id):
"""Walk the allergy subtree using $ORDER."""
allergies = []
idx = ""
while True:
idx = cache_order("^PATIENT", patient_id, "ALLERGIES", idx)
if idx is None:
break
allergies.append(cache_get("^PATIENT", patient_id, "ALLERGIES", idx))
return allergies
The API was deliberately read-heavy and write-light. Reads (patient lookups, visit lists, lab results) were segmented and fast — you never fetched "all patients," only specific records or filtered sets for a given day. Writes went through a different path (more on that in Step 3).
This API layer enabled Mirth Connect (the standard HL7/DICOM integration engine in healthcare) to communicate with the legacy system through standard HTTP — turning a sealed black box into an interoperable service.
Step 3: Bidirectional CDC — The Hard Part
This is where most migration projects fail. They can build the new system. They can read from the old system. But they can't make both systems reflect each other's changes in real-time while both are in production.
I needed bidirectional Change Data Capture (CDC): every write in the legacy system must appear in the new system, and every write in the new system must appear in the legacy system. Both systems needed to behave as if they were the only system — users shouldn't know (or care) which one they were using.
Legacy → New: Hijacking the Mirror
Caché has a built-in mirroring/replication system for high availability. It works by streaming a journal — a sequential log of every data operation (sets, kills, transactions) — from the primary server to the mirror. This journal is Caché's equivalent of PostgreSQL's WAL (Write-Ahead Log).
I "hooked into" this mechanism — not the mirror itself, but the journal stream. A Python process read the journal entries, filtered for operations on specific globals (patient data, visits, lab results, administrative records), and translated each operation into the corresponding SQL INSERT/UPDATE/DELETE against the SQL Server schema.
# Conceptual flow (simplified):
# 1. Caché journal entry: SET ^PATIENT("12345","ALLERGIES",3) = "Latex"
# 2. Python parser identifies: global=PATIENT, id=12345, field=ALLERGIES, idx=3
# 3. Maps to SQL: INSERT INTO allergies (patient_id, allergy, seq) VALUES (12345, 'Latex', 3)
# 4. Executes against SQL Server via pyodbc
The beauty of reading from the journal rather than polling the database: zero load on the production Caché server. The journal was already being generated for mirroring purposes. I just added another consumer.
New → Legacy: SQL Triggers + Event Table
The reverse direction used a different mechanism. When the new Angular frontend saved data to SQL Server, a trigger captured the change and inserted a record into an events table:
-- SQL Server trigger (conceptual)
CREATE TRIGGER trg_patient_update ON patients AFTER INSERT, UPDATE
AS
BEGIN
INSERT INTO sync_events (entity, entity_id, operation, payload, created_at)
SELECT 'PATIENT', i.id, 'UPDATE',
(SELECT i.* FOR JSON PATH), GETDATE()
FROM inserted i
END
A separate Python process consumed this events table, and using the MUMPS primitive mappings (the same C++ bridge from Step 1), wrote the changes back into Caché exactly as the legacy application would have. This was critical: the writes had to pass through the same validation logic that the Delphi/MUMPS application enforced, or the data would be inconsistent.
The Latency: 100ms
The full round-trip — from a change in one system to its reflection in the other — took 100 milliseconds. This was fast enough that it was effectively invisible to human users:
- A doctor modifies a patient record in the old terminal → 100ms later, it's visible in the new Angular interface
- A nurse updates triage information in the new system → 100ms later, it's reflected in the legacy terminal
- The window for a data conflict (two people modifying the same field within 100ms) was statistically zero in a clinical workflow — doctors don't update the same record simultaneously down to the tenth of a second
100ms was our safety mechanism. At that latency, "eventual consistency" was effectively "immediate consistency" for all practical purposes.
The Data: 40TB of Medical History
While the CDC engine handled real-time sync, the historical data migration was its own project. 40 terabytes of accumulated medical records needed to be cleaned, transformed, and loaded into the new SQL Server schema.
If you've never cleaned medical data, here's what you're dealing with:
- Free-text clinical notes with inconsistent encoding (Latin-1, UTF-8, sometimes raw bytes)
- PDFs embedded in globals — radiology reports, consent forms, discharge summaries stored as base64 strings inside MUMPS nodes
- Scanned images — handwritten notes, old paper records digitized decades ago
- Decades of accumulated data with schema evolution — the same global key might mean different things depending on when the record was created
- Duplicate records — patients who moved, changed insurance, or were registered at different centers with slight name variations
I used Pentaho Data Integration (Spoon/PDI) for the ETL orchestration — hundreds of transformation diagrams, each handling a specific data domain (demographics, visits, lab results, radiology, pharmacy, billing). The Pentaho jobs ran incrementally: initial full load, then nightly deltas, constantly reconciling with the real-time CDC stream.
The 4-Year Timeline
| Year | Milestone | Systems Active |
|---|---|---|
| 2013 | CDC engine operational, BI layer on SQL Server, historical migration begins | Caché (primary) + SQL Server (read-only) |
| 2014 | First Angular modules in production (non-critical: reporting, scheduling). Bidirectional sync active | Both active, Caché still primary for clinical |
| 2015 | Clinical modules migrated progressively — admissions, pharmacy, outpatient. Users on both systems | Both active, shared load |
| 2016 | ER, lab, and remaining critical modules migrated. Caché usage declining | SQL Server primary, Caché secondary |
| 2017 | Final validation. 8 Caché servers shut down permanently | SQL Server only |
The key discipline: never rush a module migration. Each module ran in parallel for weeks, sometimes months, before the legacy version was retired. Users were given the choice of which system to use during the transition, which provided a natural quality signal — if they kept going back to the old terminal, something was missing in the new interface.
The Human Cost: 3.5 Years of 24/7 On-Call
Here's the part that architecture diagrams don't show.
For 3.5 years, I was the single point of failure for the synchronization between both systems. Nobody else understood the CDC engine, the MUMPS primitive mappings, the journal parsing, and the SQL schema well enough to troubleshoot a sync failure at 3 AM.
When I say 24/7 on-call, I mean: if the journal parser encountered a MUMPS global structure it hadn't seen before (because some ancient routine was triggered by an unusual clinical scenario), I was the one who had to diagnose it, map it, and deploy the fix — while the hospital was operational and depending on both systems being in sync.
I was the architect, the builder, and the firefighter. The Bus Factor was 1. This is the antipattern that every engineering blog warns against, and it's also the reality of most successful legacy migrations in critical infrastructure: the person who built the bridge is the only one who can repair it while traffic is crossing.
The day the Caché servers were finally shut down, the bridge was no longer needed. And 15 days later, I was laid off — having successfully made myself obsolete. The system I built to replace myself worked perfectly. It still does.
Lessons for Your Migration
1. The person who knows the legacy wins. The two consulting firms that failed before me tried to treat Caché as "just another database." They connected via ODBC and ran SQL queries against a multidimensional key-value store. The results were meaningless. Understanding the legacy system's primitives — how $ORDER traverses globals, how journaling works, how the application logic is embedded in the data structure — is not optional. It's the foundation.
2. Bidirectional sync is non-negotiable for critical systems. A one-way migration (old → new) means you can't roll back. In healthcare, you can't afford "we'll fix it Monday." Both systems must be fully operational at all times. The CDC engine is the most complex component to build, but it's what makes the Strangler Fig viable in high-stakes environments.
3. Latency is your conflict resolution strategy. At 100ms sync latency, we didn't need a conflict resolution algorithm. The probability of two clinicians modifying the same patient field within 100ms was effectively zero. This is a trade-off: if your latency were 10 seconds, you'd need last-write-wins or operational transforms. Know your domain's tolerance.
4. Migrate the data, not the schema. Don't try to replicate the legacy data model in the new system. MUMPS globals are trees; SQL tables are flat. The CDC engine's job is to translate between representations, not to preserve the old structure. Your new system should have clean, normalized schemas that make sense independently of the legacy.
5. Let users choose. During the transition, users could work in either system. This was terrifying for project managers ("what if they never switch?") but invaluable for quality assurance. When nurses voluntarily switched to the new system, we knew it was ready. When they didn't, we knew what was missing. Your users are your best QA team — let them vote with their workflow.
6. Plan for the Bus Factor. My biggest mistake was being a team of one for 3.5 years. The CDC engine should have been documented and taught to at least two other engineers. I paid for this with years of on-call burnout. If you're building a synchronization layer for a critical system, staff it like the critical system it is.
7. Success means making yourself obsolete. The best possible outcome of a Strangler Fig migration is that one day, someone pulls the plug on the old system and nothing happens. No alarms. No data loss. No frantic calls. Just silence. That's not failure — that's the definition of success. Even if, as I learned, the organization may not need you anymore once your bridge is no longer needed.
Tags
About the Author
Gonzalo Monzón
Founder & Lead Architect
Gonzalo Monzón is a Senior Solutions Architect & AI Engineer with over 26 years building mission-critical systems in Healthcare, Industrial Automation, and enterprise AI. Founder of Cadences Lab, he specializes in bridging legacy infrastructure with cutting-edge technology.
Related Articles
Edge Computing: Why We Bet Everything on Cloudflare (And What $65/Month Gets You)
No servers, no containers, no Kubernetes. We run 14+ interconnected products on 9 Cloudflare products — Workers, D1, R2, Durable Objects, Pages, KV, Vectorize, Workers AI and WAF. $65/month for what would cost $400-600 on AWS. Here's the full architecture.
SQLite Is the Production Database You Already Know (You Just Don't Know It Yet)
DHH hit 30,000 concurrent writers on SQLite. We run 14+ products on Cloudflare D1 (SQLite at the edge) for $5/month. SQLite isn't the toy database you think it is — it's powering everything from 37signals' ONCE to our entire multi-tenant platform. Here's why the industry is converging on the world's most deployed database.
Vanilla JS Is the Assembly Language of the Browser (And That's Why We Use It)
Gmail ships 20MB of JavaScript. Slack ships 55MB. Our cookie consent system? 4KB, zero dependencies. When you understand the low-level primitives — vanilla JS, pure CSS, raw DOM APIs — you don't need a framework to tell you what the browser already knows. But we're not framework haters: we use Astro for SSG and React islands where they genuinely help. The difference is that we choose our tools — they don't choose us.