Debugging a 32,767 Thread Leak in Apache Hadoop ABFS: How One Missing close() Call Crashed Production

Published on 2025-10-24 in 📝 Posts

TL;DR: Found and fixed a 3-year-old memory leak in Apache Hadoop's Azure ABFS connector causing Hive Metastores to crash at 32,767 threads. The bug affects any long-running Hadoop service using Azure Data Lake Storage with autothrottling enabled. Immediate workaround: disable autothrottling (see below). Fix: Merged into Apache Hadoop ✅

Reading time: 8 minutes

Calm Friday's always worry me

Introduction

It was late in the morning, going on afternoon when the Slack notification sound told me a new case landed in the queue: "Urgent: Production Hive Metastore Container Continuously Crashing." My customer's data platform was in crisis mode -- their Metastore would run for a few hours, then collapse with an OutOfMemoryError. Rinse and repeat. Historically, this Metastore was always a little fragile due to the architecture, but the error logs looked a bit different this time.

java.lang.OutOfMemoryError: unable to create native thread. Thread count: 32,767 -- the JVM's hard limit. But the Metastore shouldn't need that many threads. Something was leaking, badly. What I found wasn't just their problem. It was a 3-year-old bug in Apache Hadoop affecting anyone using Azure Data Lake Storage. And the fix? A single missing close() call.

Here's how I traced a production outage to its source, built a reproduction test, and contributed the fix upstream to Apache Hadoop—so you won't have to debug this one yourself.

⚠️ Are You Hitting This Right Now?

Symptoms:

Hive Metastore crashing with OutOfMemoryError: unable to create native thread
Thread count at or near 32,767
Thread dumps show thousands of abfs-timer-client-throttling-analyzer threads

Immediate Fix:

<property>
  <name>fs.azure.abfs.throttling.enable</name>
  <value>false</value>
</property>

Add to hive-site.xml and restart. This will keep you stable while you wait for the upstream fix.

The Discovery

Like I said, this particular HMS was always a little fragile -- the error logs when this type of issue occurs usually indicate that the connection pool was being maxed out from multiple jobs hitting the metastore from all sides; Databricks, Starburst, clients connecting directly, etc. However, the lack of being able to create a new thread was something different that I hadn't encountered before and there weren't any new changes that were introduced.

The thread dumps told a clear story, and it wasn't a good one. The abfs-timer-client-throttling-analyzer threads - hundreds of them - were piling up like cars in a traffic jam. Each one sitting idle, waiting, never cleaned up. Page after page of thread dumps, all showing the same pattern.

"abfs-timer-client-throttling-analyzer-write" #58711 - WAITING
"abfs-timer-client-throttling-analyzer-read"  #58710 - WAITING  
"abfs-timer-client-throttling-analyzer-write" #58705 - WAITING
"abfs-timer-client-throttling-analyzer-read"  #58704 - WAITING
...
[58,000+ more just like this]

(Thread dump simplified for readability - full output in production logs)

Just page after page of this. A quick google search didn't show anything really helpful but restarting and increasing the pod memory bought us a bit more time till the symptoms subsided. What we did know is that this seemed to spike when there were a large number of queries running on the HMS.

What is ABFS Auto-Throttling?

The Azure Blob FileSystem (ABFS) acts as a storage layer for Hadoop clusters. In order to not send tons of requests all at once to Azure and maintain a rate limit, this component will throttle the connection on the storage account level -- if one account is receiving more requests than the others, that account can be throttled without interfering with the other ones. You can read more about the functionality here where it was initially implemented:

The problem though is that each of these requests that are being made create a Java Timer object -- a way to schedule tasks as a background thread -- which were never cleaned up. So for each time that we make a read or write request to the object storage in Azure, we're spawning new threads which just take up resources and are never released. Eventually, for long running tasks like Hive Metastore, Spark jobs, etc. the service will hit an OutOfMemoryError and cause the JVM to crash. If you check the implementation in this fork before it was merged, you can see we spawn new timers but never actually call the method to get rid of them when the task is done.

One quick workaround to keep things stable was to just disable the auto-throttling analyzer functionality entirely by adding the below to the hive-site.xml configuration. This would force any throttling to happen on the Azure side which may cause some queries to fail, but the metastore would stay functional at least.

<property>
  <name>fs.azure.abfs.throttling.enable</name>
  <value>false</value>
</property>

Reproducing the Issue

Armed with the general idea of how this worked and where the issue was occurring, I took some time to put some tests together in order to reproduce the issue. I built a GitHub repo that uses the AbfsClientThrottlingAnalyzer class to make requests against an ADLS bucket I set up and keep track of the threads produced.

https://github.com/mattkduran/ABFSleaktest

After some trials and errors, more errors, googling, using Claude, more googling, I got things put together in a suitable way. There it was. Reproducible. Undeniable. Every request was leaking two timer threads. Below is an example of repeating the same task 20 times with 2 second interval delays:

$ java --add-opens java.base/java.lang=ALL-UNNAMED [...]
SYSTEM INFORMATION
========================================
Java Version: 23.0.2
Hadoop Version: 3.3.6
Test Time: 2025-07-07 19:50:23

============================================================
ABFS THREAD LEAK REPRODUCTION TEST
============================================================
Cycle    Total        ABFS Timer   Memory (MB)     
----------------------------------------------------------------
1        6->22        0->2         9.4          
2        22->24       2->4         9.4          
3        24->26       4->6         9.4          
4        26->28       6->8         9.5          
5        28->30       8->10        9.5          
...
18       54->55       34->36       9.9          
19       55->55       36->36       9.9          
20       55->55       36->36       9.9          

============================================================
FINAL ANALYSIS
============================================================
⚠️  THREAD LEAK DETECTED!
   36 ABFS timer threads were not cleaned up
   These threads will persist until JVM shutdown

Increasing this to a more real world scenario, and a small one at that, of 200 requests with no pause intervals showed how quickly this can get out of hand:

Final: 324 total threads (+318), 304 ABFS timer threads (+304)

⚠️ MASSIVE THREAD LEAK DETECTED!
304 ABFS timer threads were not cleaned up

So what's the fix?

Obviously relying on Azure to throttle clients was less than ideal as this could fail queries that were valid. Revisiting the original GitHub PR pointed me to that the timer did not have a method to close out the thread and free the resources. After a bit of fumbling and reading the java documentation for the method, I was able to come up with the following and included it in my reproduction:

/**
 * Closes the throttling analyzer and cleans up resources.
 */
public void close() {
    if (timer != null) {
        timer.cancel();
        timer.purge();
        timer = null;
    }
    
    if (blobMetrics != null) {
        blobMetrics.set(null);
    }
    
    LOG.debug("AbfsClientThrottlingAnalyzer for {} has been closed", name);
}

@VisibleForTesting
public boolean isClosed() {
    return timer == null;
}

If I invoked this method, I could immediately see the threads were cleaned up!

============================================================
TESTING FIXED ANALYZER DIRECTLY
============================================================
🔴 Simulating BROKEN behavior (no close() called):
  Result: 58 total threads (+3), 39 ABFS timer threads (+3) - LEAKED!

✅ Testing FIXED behavior (with close() called):
  Result: 58 total threads (+0), 39 ABFS timer threads (+0)
✅ FIX WORKS! No additional threads leaked when close() is called!

📋 SUMMARY:
  Without fix: +3 ABFS timer threads (LEAKED)
  With fix: +0 ABFS timer threads (NO LEAK)

Okay, I have a potential fix which requires a code change in Hadoop -- how do I even do that?

Getting in contact

Pushing aside my imposter syndrome, I reached out to the only piece of contact information I could think to use at this point -- the Hadoop dev common mailing list.

https://lists.apache.org/list?common-dev@hadoop.apache.org:2025-7:ABFS%20thread

Thankfully, the opensource community is very open to new folks and willing to help folks who have done their homework already. I was able to get in touch with a couple of contributors who were able to get me set up to actually contribute to the Hadoop project. Wait, what?

Contributing to Apache Hadoop

It sounds very intimidating, absolutely. However, I had already done a lot of work upfront and now it was just about getting these changes added to the project, built, and tested.

The fix required changes in three places:

AbfsClientThrottlingAnalyzer.java - Add the close() method
AbfsClient.java - Call close() when cleaning up
Add tests to verify timer cleanup

After a couple of rounds of reviews, things were ready to go! Right now the fix is pending unrelated code fixes before it can be merged, but once it's merged and released I'll add an update here. You can follow the progress here if the mood strikes you:

https://github.com/apache/hadoop/pull/7852

Impact & Lessons Learned

This bug has existed since 2022 (HADOOP-18457) and affects every organization using:

Azure Data Lake Storage (ADLS)
Apache Hive Metastore on Azure
Long-running Spark jobs with ADLS
Databricks on Azure
Any Hadoop-based system with ABFS autothrottling enabled

If you're seeing mysterious OOM errors with tens of thousands of threads, check for abfs-timer-client-throttling-analyzer in your thread dumps. You're hitting this bug.

What I Learned:

1. Resource cleanup is non-optional
A single missing close() call can bring down production systems. The Java Timer class creates daemon threads, but "daemon" doesn't mean "automatically cleaned up." You have to explicitly cancel timers.

2. Reproduction is key
Building ABFSleaktest was crucial. Without a reproducible test case, I couldn't have:

Verified the bug existed
Tested my fix
Given Apache reviewers confidence in the solution

3. Contributing upstream matters
The workaround (disabling autothrottling) kept my customer stable, but fixing it upstream helps everyone.

4. The Apache review process works
The feedback on my PR made the code better. The community's thoroughness ensures quality, even if it adds time to the process.

Conclusion

What started as a Friday afternoon production crisis turned into a deeper investigation of resource management in distributed systems. The fix was simple - a few lines to cancel and clean up Timer threads - but finding it required:

Understanding distributed system fundamentals
Patient debugging with thread dumps
Building a minimal reproduction
Contributing to open source

The customer's Hive Metastore is stable now. More importantly, the fix will hopefully be merged soon into Apache Hadoop, so teams facing this issue can upgrade rather than work around it.

Resources:

If you're hitting this bug in production, the immediate workaround is to disable autothrottling:

<property>
  <name>fs.azure.abfs.throttling.enable</name>
  <value>false</value>
</property>

Have you encountered similar resource leaks in production? I'd love to hear about them. You can find me on LinkedIn or GitHub.

A single missing close() call can bring down a production service. Resource cleanup isn't optional—it's essential.