gRPC is a high performance RPC framework from Google which helps to reduce the gap between components within large scale systems. It provides a program paradigm in which code written by various languages is able to work together as a single language. The framework is promising due to the microservices tide at the moment.
I have been working on a project which leverages gRPC recently. In the project, a component, the gRPC client, written with Python communicates with another one, the gRPC server every day like a daily cron job. And during each round, multiple calls will be issued by the client.
It worked well on the first round. One day later, when 2nd round got scheduled, unavailable exception was observed from the first call with the complaint.
What is more, succeeding calls were totally OK.
Fortunately, the issue is able to be reproduced by using the simple hello world exmaple. So I present a small project in GitHub based on the example to explain the detail. The demo program wraps a RPC client with object oriented style code.
Check out the buggy code and reproduce the bug with the steps.
- Start the server with
- Start the client
python client.pyin another terminal and a normal greeting could see.
- Kill the server started in step 1 with
Ctrl + Cand then restart it within 30s. When next round is scheduled, the exception should be raised.
The issue is actually not rare. Through Google, one could find that there is even an official GitHub issue 11043 tracking the problem and that is exactly what I have met. However, it does not make any sense because the issue has been solved by PR 11154 and has been applied in our environment definitely. However, we are sure about 2 things at least.
- The issue is caused by disconnected communication channel.
- The reconnection mechanism did not work as expected.
Every time an instance of
grpc._channel.Channel is initialized, a subscription which takes an callback parameter to monitor the connection status will be started. Multiple callbacks are allowed and every time the state of the connection changes, all of them will be called. According to the comment of the code, this is a workaround and will be removed in the future when c-core supports retry. The subscription actually takes following actions.
- Creates a thread intended to poll the connection status if it does not exist yet.
- Polling the state of the connection.
- If there is no callback, quit the polling thread
All the callbacks attached to the
grpc._channel.Channel instance will be removed when the object is disposed by Python garbage collector because customized
__del__ method which will detach all the callbacks is provided. That means connection state polling will be stopped which could cause our bug. Thus we are able to make an assumption now: the channel instance we have created somehow got destroyed by GC unexpectedly so that there is no connection monitor anymore.
Let’s reexamine the buggy code.
1 2 3 4 5 6 7 8
Line 3 creates a channel instance that will be passed to initialize the client stub of type
GreeterStub which is generated by the protocol buffer compiler. If we are able to find anything that proves that there is not reference to the channel instance anymore after some time, we get more close to the truth.
What is checked next is the generated code of
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Oh, there is no explicit reference to the channel instance in this class but it is insufficient to rule out implicit reference yet. The
grpc._channel.Channel.unary_unary method requires more investigation.
1 2 3 4 5 6 7 8
We can see that—although the whole channel instance is passed—the method references just some attributes of our channel instance but not the channel itself. That means after client stub is initialized, the channel instance will become an orphan Python object. So everything makes sense now.
The solution is simple—prevent the channel instance from being collected.
1 2 3 4 5 6 7 8
To be honest, the behavior is quite surprising.
__del__ is evil most of the time and should be avoided unless you have to. We shall not blame the authors of gRPC because without this mechanism, resource leak occurs. And I believe this is the best solution so far.
This would be confusing for beginners because there is no doc about this. And almost all the code samples—even the unit test case in gRPC code base—use Python procedural programming style in which the bug would not be produced. A warning should have been documented. That is why this post is created.