How we analyzed and fixed a Golang memory leak

Sharva Pathak

Engineering

At Glean we’re building modern cloud-only architecture for solving some hard enterprise search and knowledge management-related problems. Performance and resource cost optimization is critical and often leads us into interesting technical challenges. Debugging one of these challenges led us to an interesting discovery that may be useful for others working on similar challenges.

At Glean, we use Golang for a moderately memory-intensive service. We also use Google Cloud Platform (GCP) for most of our deployment. We run this service as an app engine flexible instance using a custom runtime image that includes Go 1.15. We saw the following interesting behavior in our Golang service:

The memory would slowly ramp up, reach the limit (we were using 3GB as the AppEngine resource limit in this case) and then the instance would get killed, likely because it was exceeding the memory limit. Looking at the memory graph, the steady ramp-up smelled like a memory leak:

‍

Not an application memory leak

Thankfully, a memory leak on the application is not hard to debug in our case since we have access to continuous profiling data using the cloud profiler. In the past, we have seen cases of unclosed Google Remote Procedure Call (gRPC) connections causing such issues, but those were easy to debug using the continuous profiler. In particular, the flame graph in the profiler UI would clearly show heavy usage at a specific call site in such cases. In this case, that was not happening. One interesting thing the profiles revealed though was that the average heap size (i.e. by the in-use objects) was around 1.5G (i.e. ~2X less than the memory footprint app engine was seeing). This meant the memory was being held somewhere by the Golang runtime. The immediate next thought we had was whether this was a memory fragmentation issue because Golang is known to be bad in that aspect.

Not a fragmentation case

Luckily it wasn’t too hard to conclude that fragmentation was not the culprit either. We added a background thread that periodically logs the MemStats. An upper bound on fragmented memory can be easily obtained by subtracting HeapAlloc from HeapInuse. In particular, “HeapInuse minus HeapAlloc estimates the amount of memory that has been dedicated to particular size classes, but is not currently being used.” This amount was fairly small, ~3MB in our case.

It was also interesting to see that the values for HeapReleased were fairly large. We started looking more and came across this thread on similar issues.

Golang / container environment interaction issue

The potential theory in the Golang issue thread is that Go started using MADV_FREE as the default in go 1.12. This meant it might not return the memory immediately to the OS, and the OS could choose to reclaim this memory when it felt memory pressure. However, if you go back to how containers are implemented, these are essentially just processes running under separate Cgroups. The OS, therefore, might not feel the memory pressure and will not free up the memory even though the container might hit the memory limit and get killed.

Fortunately, there’s a Golang debug flag to flip this behavior and use MADV_DONTNEED instead, by setting the GODEBUG environment variable to “madvdontneed=1”. In fact, go 1.16 has reverted to using this as the default now. The memory graph after this change looks much better and steady at 2G.

‍

Key takeaways

pprof and flame graphs are pretty useful to analyze application memory leaks. A continuous profiler can really help you look at multiple snapshots of the profile and quickly figure out the cause of leaks. Cloud profiler is definitely a handy tool for GCP workloads.
MemStats logging can help analyze potential causes at a higher level. In particular, “HeapInuse minus HeapAlloc” can be used as an upper bound when estimating the amount of memory wasted fragmentation.
If you are using go between 1.12 to 1.15 within containers, you likely want to set madvdontneed=1 in GODEBUG. :-)

Ready to boost your workplace efficiency?

Get a Demo

How we migrated 150 Cloud SQL instances using GCP DMS

We migrated our Cloud SQL instances— a central part of the Glean index building architecture— to MySQL8. Here’s how we did it.

Satyam Shanker

Testing and deploying our search engine cluster

Testing and deploying our search engine cluster is critical for delivering a best-in-class search product. Here’s how we do it.

Stephen Chu

Incrementally deploying ranking code at Glean

While frequent iterative releases are helpful when making ranking improvements, they can be tricky to pull off, especially in a stateful service. Here’s how we do it.

Jeremy Lilley