Hidden Technical Debt in Machine Learning Systems

Vipul Vaibhaw
5 min readFeb 15, 2021

With all the advances in Machine Learning, we have seen avid adaptation in the production systems. explores several ML-specific risk factors to account for system design. These include boundary erosion, configuration issues, changes in the external world and a variety of system-level anti-patterns.

The rapidly growing ML field has made training, developing and deployment of ML easier. This has lead to more adoption of ML in business. However, maintaining a system over time is difficult and expensive.

This dichotomy is known as technical debt. It is a term which has been adopted from software engineering. It helps to reason about the long term costs incurred. Technical debt might be paid down by refactoring code, improving unit tests, deleting dead code, reducing dependencies, tightening APIs and improving documentation. Deferring these debts often results in compounding costs.

The paper analyses similar incurring costs for ML systems.

Complex Models Erode Boundaries

Traditional software engineering practice has shown the value of creating abstractions. It helps in creating modular and maintainable code base.

However, building abstractions for machine learning systems is difficult. Here are some ways in which technical debt of ML systems might increase.

Entanglement — Often ML systems mix signals together, entangling them and making isolation of improvements is impossible. Simply put, if you change the input distribution then it has cascading impact on weights etc. No inputs are ever really independent. Hence if you Change Anything you Change Everything (CACE Principal). It is not just inputs, similar story is for hyperparams, learning settings, sampling methods, convergence thresholds, data selection and essentially every other possible tweak.

One possible strategy is to isolate models and serve ensembles. Another strategy is to focus on detecting changes in prediction behaviour as they occur.

Correction Cascades — There are often situations in which model m for problem A1 exists, but a solution for a slightly different problem A2 is required. It becomes very tempting to learn a model m2 that takes model m and transfer-learns for a small correction. It is a fast way to solve the problem.

However, now we have created a dependency on m making it significantly more expensive to analyse improvements to that model in the future. This often creates a deadlock as improvement in model m might lead to system-level detriments.

Undeclared Consumers — Often, a prediction from a machine learning model m is made widely accessible, either at runtime or by writing to files or logs that may be consumed by other systems. Without proper access controls, some of these consumers may be undeclared, silently using these outputs of a given model as input to another system. (refer Visibility Debt). These debts are harder to detect. These debts create a hidden tight coupling of model m to other parts of the stack.

Data Dependencies Cost more than Code Dependencies

Code dependencies can be identified via static analysis by compilers and linkers. Without similar tooling for data dependencies, it can be inappropriately easy to build large data dependency chains that can be difficult to entangle.

Unstable Data Dependencies — It is often convenient to consume signals as input features that are produced by other systems. However, some input signals are unstable, meaning that they qualitatively or quantitatively changes over time. In order to overcome this debt, a versioning strategy can be used but it has got its own costs.

Underutilised Data Dependencies — Similar to importing packages which are not used but they make the code dependent. We have models which depend on input signals that don’t contribute much to its performance. However, certain changes in those input signals might lead poor performance of the model.

Underutilised data dependencies can creep into a model in several ways.

  1. Legacy Features
  2. Bundled Features
  3. Features which contribute to very small improvements
  4. Correlated Features

ML-System Anti-Patterns

Glue Code — ML researchers tend to develop general purpose solutions as self-contained packages. Using generic packages often results in glue coded system design pattern, in which massive amounts of supporting code is written to get data into and out of general-purpose packages. An important strategy for combating glue-code is to wrap black-box packages into common API’s. This allows supporting infrastructure to be more reusable and reduces the cost of changing packages

Pipeline Jungles — A special case of glue code, pipeline jungles often appear in data preparation. Glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles

Dead Experimental Codepaths — A common consequence of glue code or pipeline jungles is that it becomes increasingly attractive in the short term to perform experiments with alternative methods by implementing experimental codepaths as conditional branches within the main production code.

For any individual change, the cost of experimenting in this manner is relatively low-none of the surrounding infrastructure needs to be reworked. However, over time, these accumulated codepaths can create a growing debt due to the increasing difficulties of maintaining backward compatibility and an exponential increase in cyclomatic complexity.

Conclusions

Technical debt is a useful metaphor, but it unfortunately does not provide a strict metric that can be tracked over time. ? Simply noting that a team is still able to move quickly is not in itself evidence of low debt or good practices, since the full cost of debt becomes apparent only over time. Indeed, moving quickly often introduces technical debt.

The paper points the reader towards the areas of maintainable ML, including better abstractions, testing methodologies, and design patterns.

There are a lots of points which the blog doesn’t cover. Read the paper to find out!

--

--

Vipul Vaibhaw

I am passionate about computer engineering. Building scalable distributed systems| Web3 | Data Engineering | Contact me — vaibhaw[dot]vipul[at]gmail[dot]com