Make Better Calls on Technical Debt in Core Systems Without Slowing Delivery
Managing technical debt in core systems requires balancing immediate delivery needs with long-term code health. This article draws on expert perspectives to outline seven practical strategies for identifying when to tackle debt and when to defer it. These approaches help engineering teams make disciplined decisions that maintain velocity without accumulating crippling legacy burden.
Fix Customer Blockers Early
One note: I am a founder, not a CTO, but in a lean company these calls land on me, so I will answer from how we make them.
The way I decide which technical debt gets paid now versus deferred is to ask whether the debt is blocking a customer outcome or compounding work, or whether it is just ugly. Ugly can wait. Blocking cannot.
The instinct among engineers is to fix debt because it offends them, and the instinct among founders is to defer all of it to ship features. Both are wrong because neither asks what the debt actually costs. We run Eprezto with a lean team of under ten people, so we cannot refactor everything, which forces the question.
The rule is the same one we use for any commitment: will this still matter in twelve months, and run a dependency test. Debt that sits under work that compounds, or that slows down something customers feel, gets paid now because the cost grows every week you wait. Debt that is isolated and not in anyone's path gets postponed, openly, not forgotten.
The clearest lesson came from our carrier integrations. Each carrier connects differently and carries its own maintenance, and that maintenance load is a form of debt. We reduced the carriers from eight to between four and five, which was effectively a decision to retire debt rather than keep refactoring around it. Removing the source beat maintaining it, and it freed engineering for higher-value work.
The lesson about timing is that the cost of waiting is not constant. Debt under compounding work gets more expensive fast, so deferring it is a decision that quietly grows the bill. Debt off the critical path stays cheap to defer.
The honest part is that we have postponed things we should have fixed, and the tell was always when the debt started slowing a customer-facing outcome.
My advice is to judge debt by what it blocks, pay down what sits under compounding or customer-facing work now, and defer the rest on purpose, not by neglect.

Structure Deferrals With Risk Controls
I look at what it's doing to what we're doing now, not just that it's there, to decide how to attack technical debt. If there's something in the core system that's causing current projects to get bogged down and increasing the risk of releases, needing frequent manual fixes or making developers reluctant to touch it, that's a clear sign it needs immediate attention. If the code is messy, but stable, contained, and not involved in any upcoming product initiatives, then it's probably safe to defer action. I look at how often the area changes, the kind of damage that could be done if it fails, and whether it's blocking any business opportunities when I'm deciding which debts to attack first.
Authentication, permissions, billing, data integrity, deployment, reporting or critical integrations are generally more urgent than code with less impact.
Any decision to defer has to be structured. "I'd rather have specific actions to manage the risk, rather than just saying 'we'll sort it out later'." That means tests around the current functionality, alerts on functionality that can be costly if it fails, clear boundaries to prevent new issues from spreading and a definite trigger to check back on the status of the area.
One lesson learned was when I decided to take a chance and not replace a troublesome core module, but focus on reducing its risk. There was a temptation to go for a full replacement to get a more streamlined system, but that would have added complexity and the old system still needed support. The more intelligent approach was to describe what that module was responsible for, include tests for the critical functions, establish an interface around the unstable part and then only replace the workflow that was undergoing the most change. What I learned from this is that sometimes the question 'refactor or replace' can be misleading. Refactoring is a good idea when the basic idea is sound but the implementation needs work. You should do a replacement when the overall architecture is no longer fit for purpose for the business. Deferral is a valid option if the problem area can be kept isolated. The error is not to delay addressing technical debt, but to delay it until it becomes so overwhelming that the entire delivery system breaks.

Refactor Once Code Constrains Direction
One thing I keep an eye out for is legacy code that pushes the team to seek approval from the past. Some forms of debt are unsightly but subdued, so I refrain from intervening. The most problematic kind emerges when each new feature must accommodate a shortcut adopted during the product's earlier, smaller days. At this stage, the code isn't merely lingering in disarray but is discreetly influencing the directions we dare to pursue next.
A case in point for us involved a minor text-routing component in our humanization pipeline. Intended as a stopgap solution, it started dictating which drafts received advanced tone refinement and which ones bypassed it. As it remained functional, it was tempting to overlook. My takeaway was that refactoring is not prompted by engineers' embarrassment over the code's state. It occurs when the code begins to subtly constrain our product decisions, often without our awareness.

Clarify Behavior Before Big Changes
We learned a strong lesson by refactoring a pricing and reporting layer instead of replacing it. The system was old but the real problem was hidden logic that people did not fully understand. We identified the highest risk parts and added tests to capture the current behavior. Then we refactored the system while keeping delivery work running.
We learned that timing improves when we reduce confusion before we try bigger changes. A full replacement would have used energy without clarifying what the business depends on. Refactoring first showed those dependencies clearly and made future changes easier. We saw that better understanding leads to better decisions and smoother change.

Stabilize Process First Then Modernize
One decision that taught us about timing was postponing a replacement of an old scheduling layer. We kept it even though the system was hard to maintain because the business process around it was more messy than the code. Replacing it too early would have modernized confusion instead of solving the real problem. We first refactored inputs and improved ownership so the system became easier to understand.
We focused on operations discipline before architecture work. Once roles were clearer and exceptions measured, problem became easier to isolate. The lesson was that replacement is often technical reset but timing matters. We cleaned decision rights first then modernized what remained in the system.

Halt Harmful Automation Now
At Distribute, my framework for deciding whether to tackle technical debt immediately or defer it usually comes down to who is feeling the pain. If a messy backend system just makes our own engineering process slower, we can often live with it for a while to keep feature delivery moving. But if the technical debt is actively causing external damage—specifically scaling mistakes that hurt our users—we stop the line.
We learned this when we had to completely refactor our core distribution engine. We originally built our platform with a frictionless, straight-through processing pipeline, where our AI handled all the routing and personalization automatically with a single click. But generative AI produces volume incredibly fast, which means it scales edge-case errors just as fast. Analytics showed the AI was frequently mishandling formatting, leaving raw corporate markers like "Inc." or "LLC" attached to prospect names in live outreach sequences. Firing off thousands of those unpolished outputs was instantly triggering hard bounces and tanking sender domain reputations before anyone realized there was an issue.
The easiest way to defer the real technical debt would have been to just patch the algorithm on the fly so we wouldn't interrupt our delivery speed. Instead, we decided to go in and intentionally break our own continuous delivery pipeline. We ripped out the fully automated one-click launch feature entirely. In its place, we refactored the data flow to route all AI-generated outputs into a mandatory holding queue. We inserted a hard manual review node right before launch, requiring a human to review the batch and catch the weird edge cases the algorithms missed.
Adding that structural checkpoint dropped our daily hard bounce rates to almost zero. The lesson I took away about timing is that when automated systems start compounding bad data, you cannot postpone the fix. Sometimes the most urgent refactor isn't about optimizing your architecture for more speed, but completely rebuilding it to force a deliberate pause.

Rewrite If Drag Cripples Velocity
I had 23 engineers writing a profit maximization engine for the world's largest quick service restaurants, but our old mobile app was absolutely hogging our resources. It would take three days of code to fix UI states anytime we released new functionality. This forced me into an impossible dilemma: Would I throw our AI roadmap in the dumpster for several months to perform a full overhaul of our mobile code?
I manage technical debt on an objective metric: what does it cost us frictionally. If it costs our engineers 20% of the engineers' time just to keep the light running in an old legacy system. We rip that out, put it to rest. If it's sitting in a quiet little place not slowing us down at all, leave it alone.
We delayed a feature release about prices for four weeks to entirely rewrite the legacy mobile component into React Native Web and typed the whole thing out, but we shut the whole AI road map down. It seemed extremely risky to do it then; however, within a few weeks of doing it, code regressions dropped by 99% and our load times dropped from 4 hours down to 45 min. We pay down tech debt not for beautiful code; you do it for velocity.

