Python: server is gradually running out of memory
This is the story of how using the copy-on-write (CoW) at Python may lead to memory leaks when building multi-process applications. If you don’t know what is a CoW, read it here or you’ll find nothing useful. So, let’s get to the point.
Once we fork the process we expect all the allocated memory at a parent process to be shared with a child process until it decides to write into it. When the child process writes into the common memory, only affected memory pages are copied. It allows us effectively build pre-fork servers with dozens of workers after allocating everything at a parent process.
At hands.ru we use the same model with a uWSGI server. In a recent article, I depicted how we were building the search engine and moved the search index into the master process to get rid of a memory explosion when UWSGi forked the children processes. But we missed one thing, look at the memory utilisation graph of our server below:
Memory was leaking somewhere. OOM Killer comes and kills the process. After digging into the problem we found that we are not alone. The same thing faced the Instagram and named the problem “copy-on-read”. They allocated multiple objects at parent process and forked multiple children. The child process did not write anything into memory, but the memory was gradually leaking. Suddenly they noticed that the memory was copied each time when child process read from the common memory. In their case the reader was a garbage collector, visiting each object occasionally.
For better understanding let’s demonstrate it on a simple example. Firstly we need a function to track the process’s memory:
And now let’s use it at the next example:
Above we create a huge dictionary and fork
the process. Thereafter we track what happens with the memory while reading from the dictionary at the child process. Nothing should happen because of CoW. But here is the result:
Memory before forking: 212 Mb
Memory at parent process after forking: 212 Mb
Memory at child process after forking: 2 Mb
Memory at child process after reading the half of data: 83 Mb
Memory at child process after reading the rest of data: 157 Mb
See it? Our dictionary uses 212 Mb including bare Python interpreter. After forking the child process takes only 2 Mb. Perfect–CoW works. But then we start reading from the dictionary and observe the growing memory consumption.
It happened because in CPython every object is a PyObject
structure with a reference count on it. Once the reference count is zero — the garbage collector may clean it out. So, when you read an object, you increase the reference counter which is the same as writing into the object. And the memory page containing the reference count is going to be copied. In our case it means that every time we were reading any part of the search index at the worker process, the memory page containing the head of PyObject
with the key we read was copied. With time it had read almost every object from the index and we were observing the leaking memory.
What to do with it?
Reload-on-rss
UWSGi supports a reload-on-rss option. Once the physical unshared memory of the worker is higher than the specified value, the worker is going to be reloaded. This is a popular solution to prevent workers to be killed by OOM. This is a crutch–may temporary hide the problem.
Move it into the different service
We can choose a different language, not unaffected by the copy-on-read problem. By moving out a memory consumptive part of our project into different service we can solve the problem. But another language increases an entropy of the project and requires much time. We’d like to have the only Python everywhere at our back-end which allows us to move as fast as possible with the small team.
Use byte arrays
As was noticed earlier, only the head of PyObject
is copied. So, if we find a way to replace our dict with the new data structure using only one PyObject
, the problem will disappear. So, we decided to replace our dicts with a simple binary search index structure. Let’s see how it works. Consider we have a dictionary:
We can construct a plain index from it using offsets — from which offset we should read the data for certain key:
It means that we have a numeration for each key: 0 for “foo”, 1 for “bar” and so on. At the next step we can pack everything into one string as follows:
“3036foobarbaz037aaabbbbccccc”
First 4 bytes we allocate to get the number of items in the index. In our example this number is 3. Next we write every offset for each key. After come the keys, offsets for data and the data itself. We even can compress it.
To use this index we read everything into the bytes
object or use mmap
when index is really huge. You can download the code of our implementation here. And here is a usage example:
After release we saw a pretty flat picture of memory utilisation:
Shared memory and others
Also, there are box solutions like shared memory dicts. We’ve tested it and it works fine–use it if it suits you.
Furthermore, we could use Cython which seems to be more difficult compared to solutions at this article. Or even buy a more capacious hardware which on a reasonable scale is cheaper than paying for optimisation until a certain time.
Download fork aware dictionary implementation here.