Why deep learning needs its own debugging tools

June 20, 2026

Deep learning now sits inside many software systems, from the apps on a phone to systems that drive cars and read medical scans. These systems are software, and like any other software they contain faults. What makes them different is that a trained model gives almost no help in finding those faults. A conventional program fails at a line you can point to, and a debugger can stop there and show you the state. A deep neural network has no such line. It is a large collection of weights, organized into layers, that together encode behavior learned from data, and when it behaves wrongly the cause is spread across that structure where the usual tools of debugging have nothing to grip.

The trouble starts with the symptoms. A model that has a bug often looks, from the outside, like a model that is simply training slowly or learning a hard task. The loss may fail to fall, the accuracy may stall, or the predictions may be confidently wrong. None of these tells a developer what to change, and a single symptom can have many causes: a mis-set learning rate, a layer that has stopped passing signal, data that was scaled the wrong way, or a loss function that does not match the task. Over the past decade our lab has built a body of work that treats a trained model as an artifact a software engineer can inspect, and that turns these vague symptoms into located faults with concrete repairs. This page collects that work.

The first step was to make a fault visible at all. DeepLocalize (ICSE 2021) observes a model while it trains rather than after it fails. It records the values that pass between layers, learns what healthy training looks like, and uses the history of those values to report both that a model is faulty and which layer or hyperparameter is responsible. In effect it brings dynamic program analysis, a long-standing software engineering technique, to the inside of a learning model.

Locating a fault is useful, but a developer still has to decide what to do about it. DeepDiagnosis (ICSE 2022) carries the idea further. As the model trains it watches for eight distinct kinds of error condition, and when one appears it reports the symptom together with a message describing a concrete repair. Across a large benchmark of buggy models it diagnosed faults more accurately than earlier tools, and it did so for kinds of model those tools could not handle.

Tools like these are only as good as the repairs they know to suggest, so part of the work was to study how developers fix these models by hand. Repairing Deep Neural Networks (ICSE 2020) examined more than 400 real fixes that developers had posted publicly, organized them into patterns, and set out which patterns are hardest to automate and which would help the most. That map of real repairs is what lets an automated tool aim at the fixes that matter.

The same way of thinking extends to models that work very differently from a standard classifier. µPRL (ICSE 2025) turns to reinforcement-learning agents, the kind used to control a robot or a vehicle, and asks a testing question: would a team’s tests actually catch a real bug? It mines real faults from reinforcement-learning code, turns them into mutation operators, and lets a test suite be measured against faults that resemble the genuine ones. IRepair (FSE 2025) turns to large language models, where retraining the whole model to fix one problem tends to damage abilities that were working. Borrowing the idea of program slicing, IRepair concentrates a repair on the small part of a model most responsible for an error, and in a toxicity-reduction setting it repaired errors more effectively than the standard approach while disturbing the rest of the model far less.

This research is supported by our NSF project on Fault Localization for Deep Learning, carried out with Mohammad Wardat at Oakland University, which continues the work of locating and explaining faults in deep learning systems. The thread running through all of it is a single goal: to give the people who build deep learning the same kind of dependable tools for finding faults, understanding them, and repairing them that software engineers have long had for ordinary code.

This work is part of our research on Modular and Dependable AI. For the complete record, see our list of papers.