Paper Explained #4: Debugging Distributed Systems in Production: How COCA Solved the Logs vs Code Gap
Paper: “COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge” — — — Full Paper
It’s 3 AM on a Tuesday. Your phone explodes with alerts. The MapReduce cluster that processes millions of jobs daily just died. Hard.
You drag yourself to your laptop, eyes barely open. Open JIRA. The ticket reads:
“Job failed immediately after submission. Cannot complete work. URGENT.”
Attached logs:
10:52:35 INFO: Job submitted successfully
10:52:36 INFO: Application created
10:52:36 INFO: Submitted application to ResourceManager
10:52:36 INFO: Cleaning up the staging area
10:52:36 ERROR: Job failed to runYou stare at the screen. “Cleaning up the staging area.” What does that even mean? Why did it clean up? What happened between line 3 and line 4?
The logs don’t tell you. The user doesn’t know. And your manager expects a fix by morning standup.
Welcome to debugging distributed systems. Where the bugs hide across multiple machines, the logs tell you symptoms but not causes, and finding root causes feels like detective work without evidence.
The 5-Hour Debugging Marathon Nobody Wants
Here’s what typically happens next.
You open the codebase. Search for “cleanup” or “staging area.” Get 47 matches across 12 files. Start reading each one. Try to mentally trace which code path could have led to this log message.
Two hours in, you’ve narrowed it down to 5 possible locations. You add more logging (because the existing logs weren’t enough). Redeploy. Wait for the bug to reproduce. It doesn’t. Network timing is different now.
Your senior developer wakes up, sees the mess, joins in. Another three hours of joint debugging. Finally, someone spots it:
“Wait. I think there’s a race condition. The client asks the server for job status before the server finishes setting up. Gets NULL back. Thinks the job was deleted. Starts cleanup while the server is still reading those files.”
Total time: 5 hours. One bug. And you still need to write the fix.
Why Traditional Tools Fall Short
The problem isn’t that developers are bad at debugging. The problem is that distributed systems bugs are fundamentally different beasts.
Debuggers are useless here. You can’t step through code that’s running on three different machines communicating over a network with varying latencies. There’s no “pause all servers and inspect state” button.
Logs are incomplete. They show you snapshots. “Job submitted.” “Cleanup started.” But the WHY is missing. Why did cleanup start? What code decided that? What was the execution flow?
Reproduction is a nightmare. Network timing matters. Machine A sent a request. Machine B was busy for 50 milliseconds. That 50ms delay caused the race condition. Good luck reproducing that exact timing in your dev environment.
Recently, teams started using AI tools like ChatGPT for help. You paste the logs and error message. ChatGPT responds:
“This appears to be a cleanup-related issue. Check your staging area management logic and ensure proper synchronization.”
Thanks, ChatGPT. Very insightful. Totally worth the API cost.
The real issue is that AI tools only see what you show them: the logs, the error message. They don’t see the code. They don’t understand the execution flow. They’re guessing based on text patterns, not analyzing the actual system behavior.
The Missing Piece: Code Context
Think about how a human expert debugs this.
They don’t just read logs. They:
- Find which code line printed that log
- Trace backwards through the code to see what ran before
- Understand the execution path across multiple services
- Identify where things went wrong in that path
The key insight: logs + code together reveal the story. Logs alone are just cryptic hints.
This is exactly what researchers at Chinese University of Hong Kong realized. They built COCA (Code Knowledge Enhanced Root Cause Analysis), a system that automatically does what expert debuggers do, combining runtime logs with source code analysis.
And it does this in 20 seconds instead of 5 hours.
How COCA Actually Works
COCA’s approach has four phases. Let me walk you through each using our MapReduce crash as an example.
Phase 1: Finding the Crime Scene
You have a log message: “Cleaning up the staging area /tmp/job_20250104”
Simple solution: search the codebase for this exact string. Problem: it doesn’t exist.
The actual code looks like this:
String jobPath = getStagingDirectory(jobId);
String message = "Cleaning up the staging area " + jobPath;
LOG.info(message);Variables. Dynamic construction. Multiple possible branches depending on conditions. The runtime log and the code statement are completely different.
COCA solves this through what they call “logging statement restoration.” It uses static analysis to trace backwards from the LOG statement, following the data flow of that message variable through all possible branches.
For our example:
// Line 240
String message = "Cleaning up the staging area " + jobPath;// But jobPath comes from:
if (isTemporary) {
jobPath = tempDir + "/" + jobId;
} else {
jobPath = permanentDir + "/" + jobId;
}COCA creates template patterns:
- Template A: “Cleaning up the staging area /tmp/<*>”
- Template B: “Cleaning up the staging area /permanent/<*>”
Now when it sees the runtime log “Cleaning up the staging area /tmp/job_20250104”, it matches against Template A and knows: this came from line 240, temporary branch.
Boom. Crime scene located.
Phase 2: Reconstructing the Timeline
Knowing line 240 printed the log is helpful but insufficient. You need to know what happened BEFORE line 240. What’s the sequence of events that led to cleanup?
In a single-machine program, this is straightforward. Build a call graph. Function A called Function B called Function C called the cleanup function.
In a distributed system, this breaks down completely.
Here’s why: your code makes network calls. The client code on Machine A calls a function that sends a request over the network to Machine B. From a code analysis perspective, this looks like:
Client: submitJob()
calls networkLibrary.send()
... and then what?The connection to the server side is invisible. It happens at runtime through network protocols. Static code analysis tools just see a call to a network library, then nothing.
This is where COCA’s big innovation comes in: RPC (Remote Procedure Call) bridging.
Every distributed system defines its network APIs somewhere. For systems using gRPC (a popular framework), there’s a .proto file:
service JobService {
rpc SubmitJob(JobRequest) returns (JobResponse);
rpc GetJobStatus(StatusRequest) returns (StatusResponse);
}COCA reads these definition files. Now it knows: “SubmitJob” is a network-callable function.
Then it searches the client code: where is SubmitJob being called? Found it in JobClient.java line 150.
Then it searches the server code: where is SubmitJob implemented? Found it in JobServiceImpl.java line 300.
Finally, it connects them:
JobClient.submit() line 150
[NETWORK CALL]
JobServiceImpl.submitJob() line 300The invisible network gap is now visible.
Applying this to our bug, COCA reconstructs:
JobClient.submit()
calls YarnRunner.submitJob()
[RPC CALL to ResourceManager]
ResourceManager.submitApplication()
... setup starts
[RPC CALL to ResourceManager]
ResourceManager.getApplicationStatus()
returns NULL (setup not done yet)
receives NULL
interprets as "job deleted"
calls JobSubmitter.cleanup()
prints "Cleaning up staging area"
deletes files
CRASH (ResourceManager still accessing those files)The full execution path. Across multiple machines. Including the race condition.
Phase 3: The Information Overload Problem
At this point, COCA has reconstructed execution paths containing dozens of methods across multiple services. Let’s say 30 methods, each averaging 80 lines of code. That’s 2,400 lines.
You could dump all this into ChatGPT. Two problems:
- Cost: processing 2,400 lines repeatedly gets expensive fast
- Confusion: ChatGPT will struggle to identify what’s relevant vs boilerplate
COCA uses a two-step filtering approach.
First, it creates compact summaries for each method:
Method: submitJobInternal
Purpose: Submits job to cluster and manages staging area
Parameters: Job job, Cluster cluster
Returns: JobStatusJust the signature and documentation. No implementation details.
Then it shows these summaries to the LLM along with the issue report and asks: “Which of these methods are most relevant for diagnosing this failure?”
The LLM might respond: “submitJobInternal, cleanup, and getApplicationStatus look relevant.”
Only then does COCA retrieve the full code for those specific methods. Instead of 2,400 lines, maybe 300 lines of highly relevant code.
Smart filtering. Lower cost. Better accuracy.
Phase 4: The Final Diagnosis
Now COCA has everything:
- The original issue report
- The execution path showing what code ran
- The relevant code snippets
- Five similar historical bugs from JIRA
It constructs a detailed prompt and sends it to the LLM:
“Given this issue report, this execution sequence, and this code, identify the root cause and localize the faulty components.”
The LLM responds:
Root Cause Summary: “Race condition between client and server. The client submits a job to ResourceManager and immediately requests the application status. However, the ResourceManager has not completed setting up the application in the staging area and returns NULL. The client misinterprets this NULL response as indicating the application was removed and initiates cleanup of the staging area. This cleanup occurs while the ResourceManager is still accessing the staging area, causing the job to fail.”
Faulty Components (ranked by likelihood):
- JobSubmitter.cleanup() — premature cleanup logic
- ResourceManager.getApplicationReport() — NULL handling
- YarnRunner.submitJob() — timing assumptions
Time elapsed: 19 seconds.
Real-World Performance
The researchers tested COCA on 106 real bugs from five major Apache distributed systems: MapReduce, HDFS, HBase, Cassandra, and ZooKeeper. These aren’t toy examples. These are production systems with millions of lines of code, used by thousands of companies.
Compared to the best existing automated approach (a tool called RCACopilot):
- 28% better at identifying the exact faulty component
- 22% better at explaining WHY the failure occurred
- Consistent improvements across different underlying LLMs (GPT-4, Claude, Gemini)
One test case: HBASE-9821, a scanner ID collision bug. A human expert spent 3 hours analyzing logs and code to understand it. COCA diagnosed it in 20 seconds with a detailed explanation of the race condition in random number generation during rapid server restarts.
The JIRA discussion for that bug took a full day of back-and-forth between multiple developers. COCA gave the answer immediately.
The Limitations Worth Knowing
COCA isn’t magic. It has real constraints.
Java only. The implementation uses Java-specific analysis tools (Soot, Eclipse JDT). Python, Go, or C++ systems won’t work without significant retooling.
Static analysis struggles with dynamic code. If your code uses heavy reflection, dynamic proxies, or runtime class loading, COCA might miss some execution paths. It’s analyzing code without running it, so anything that’s decided at runtime is harder to track.
RPC bridging relies on naming patterns. It assumes your server implementation class is named something like “JobServiceImpl” if your interface is “JobService.” Custom naming schemes might break the matching.
Fixed execution depth. COCA traces 2 levels deep in the call graph by default. Some bugs might require going deeper, others might need less. There’s no adaptive depth selection yet.
Well-documented code helps. COCA uses method documentation to create those compact summaries. If your codebase has poor or missing docs, the filtering step becomes less effective.
What This Means for Distributed Systems Debugging
COCA represents a shift in how we can approach production debugging.
Traditional approach: wait for failure, manually dig through logs and code, reproduce if possible, fix after hours of investigation.
COCA approach: wait for failure, automatically analyze logs plus code, get root cause and component localization in seconds, proceed directly to fix.
The speed difference matters. Not just for convenience, but for availability. Every hour spent debugging is an hour your system is broken. In high-stakes environments (financial systems, healthcare, critical infrastructure), that hour costs real money or impacts real people.
The accuracy difference matters too. Human experts have bad days. They miss things. They make assumptions. An automated system that systematically traces execution paths and analyzes code doesn’t have those failure modes.
Could this be integrated into CI/CD pipelines? Imagine: your integration tests fail, COCA automatically analyzes the failure, and your pull request gets a comment saying “Root cause: race condition in UserService.createAccount() when handling concurrent requests for the same email address.”
Could this work for live production incidents? System crashes, monitoring detects it, COCA analyzes immediately, on-call engineer gets paged with not just “system down” but “system down due to memory leak in CacheManager.evict(), responsible component identified.”
The Bigger Picture
Distributed systems keep getting more complex. Microservices architectures mean a single user request touches dozens of services. Serverless and edge computing add more nodes to the equation. Debugging will only get harder.
COCA’s core insight applies beyond just their specific implementation: combining runtime observability data with static code analysis gives you superpowers. Logs tell you what happened. Code tells you why it happened. Together, they tell the complete story.
The researchers open-sourced their implementation and dataset. That means other teams can build on this, extend it to other languages, improve the RPC (Remote Procedure Call) bridging for different frameworks, experiment with adaptive execution depth.
The Bottom Line
Next time you’re staring at cryptic logs at 3 AM, wondering why your distributed system decided to implode, remember: there’s a better way than manual code archaeology.
COCA proves that automatic root cause analysis for distributed systems is not just possible, but practical. Twenty seconds instead of five hours. Detailed explanations instead of vague guesses. Exact component localization instead of “somewhere in the job submission logic probably.”
The gap between “system crashed” and “here’s exactly why and where” just got a lot smaller.
And your 3 AM debugging sessions just got a lot shorter.
COCA was developed by researchers at The Chinese University of Hong Kong and Sun Yat-sen University, with complete technical details available in their research paper “COCA: Generative Root Cause Analysis for Distributed Systems with Code Knowledge.” The team has open-sourced both the implementation and their carefully annotated dataset of 106 real-world distributed system failures. As distributed systems grow more complex and LLM capabilities mature, automated root cause analysis tools like COCA are becoming essential infrastructure for maintaining reliability at scale.
