GoVPP/Performance
GoVPP Performance
GoVPP performance has been measured using the perf-bench example binary on an Ubuntu 16.04 LTS virtual machine running on a laptop with Intel® Core™ i7-6820HQ CPU @ 2.70GHz and 16 GB of RAM. The virtual machine has been assigned all CPU cores and 8 GB of RAM.
The benchmark application is able to measure the performance of both synchronous and asynchronous GoVPP APIs. The results are compared with the results of the benchmark application written in C, that uses the same vppapiclient calls as GoVPP. As the results of the benchmarks show; in case that speed is an issue, asynchronous API is recommended:
(rr/second = requests + replies per second)
1 | C application (vppapiclient) - control ping | 762 000 rr/second |
2 | GoVPP - control ping | 251 878 rr/second |
3 | GoVPP - l2fib add | 245 560 rr/second |
4 | GoVPP - interface dump | 210 305 rr/second |
5 | C application - control ping | 2 340 rr/second |
6 | GoVPP - control ping | 107 rr/second |
As the results show, there is quite a big difference between the performance of the C application and the GoVPP. Also, there is even larger difference between the performance of the C app and GoVPP by synchronous API. In order to identify the differences and bottlenecks, we did the following:
- performed profiling of the execution of the test 2 (GoVPP async API) and 6 (GoVPP sync API).
- performed the measurements from the bare Go application that called vppapiclient library directly, without any further processing in Go, with following results:
7 | Go async - no encode & decode, callback once per 100 replies | 761 836 rr/second |
8 | Go async - no encode & decode | 554 215 rr/second |
9 | Go async - with encode & decode, no message passing | 284 283 rr/second |
10 | Go async - with encode & decode, with message passing | 250 861 rr/second |
Discussion
- By comparing tests 1 and 8, we can see that we can not reach the performance of the C application from Go. After inspecting the profiling results (GoVPP async API) which shows that most of the time is wasted by Go runtime by calling a message reply callback from C world, we performed the test 7, which calls the reply callback only once per 100 replies (buffers the replies). The results confirmed the theory and show that with buffered replies, we could reach the performance of the C application.
- Since the encoder & decoder of binary API messages from binary to Go bindings uses reflection, we knew that this process will eat quite a lot resources. In order to find out the exact numbers, we performed the tests 8 (no encoding & decoding of the messages) and 9 (encoding and decoding of the messages). The results show that the encoding & decoding process slows down the performance in 50%.
- As the results of the tests 2, 3 and 4 show, the more complex is the reply message, the more time is required to decode it. This is not entirely true for the requests, since the request encoding is much faster that reply decoding (this can be also seen on the profiling results).
- The difference between the tests 9 and 10 shows the performance drop caused by passing the reply message via Go channel between the thread that receives the replies from VPP and the thread with the main Go routine.
- There is a big performance drop between the C application and GoVPP in synchronous API - tests 5 and 6. Profiling results (GoVPP sync API) show that this is caused by Go runtime probably by continuous sleeps and wakeups or the async thread created by the vppapiclient library that receives the replies from VPP.
Possible Performance Improvements
- Get rid of reflection by message decoding. Either generate custom encoder & decoder code for each binary API message, or allow to provide custom encoder & decoder functions by the user. Possible to combine with the current approach for not so frequent binary API messages.
- Buffer replies from VPP in case that multiple replies are expected before calling the callback from the C world.
- Get rid of the async thread created by the vppapiclient library that receives the replies from VPP, since this probably does not match with the concept of Go routines very well. Blocking read from a Go routine may perform much better.