Difference between revisions of "GoVPP/Performance"

From fd.io
Jump to: navigation, search
(GoVPP Performance)
(GoVPP Performance)
 
(8 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
== GoVPP Performance ==
 
== GoVPP Performance ==
  
GoVPP performance has been measured using the [https://gerrit.fd.io/r/gitweb?p=govpp.git;a=blob;f=examples/cmd/perf-bench/perf-bench.go;h=e414b204fcadfc9965ac231adce9f495145a8e05;hb=HEAD perf-bench example binary] on an Ubuntu 16.04 LTS virtual machine running on a laptop with Intel® Core™ i7-6820HQ CPU @ 2.70GHz and 16 GB of RAM. The virtual machine has been assigned all CPU cores and 8 GB of RAM.
+
GoVPP performance has been measured using the [https://gerrit.fd.io/r/gitweb?p=govpp.git;a=blob;f=examples/cmd/perf-bench/perf-bench.go;h=e414b204fcadfc9965ac231adce9f495145a8e05;hb=HEAD perf-bench example binary] on an Ubuntu 16.04 LTS virtual machine running on a laptop with Intel® Core™ i7-6820HQ CPU @ 2.70GHz and 16 GB of RAM. The virtual machine has been assigned 4 CPU cores and 8 GB of RAM.
  
The benchmark application is able to measure the performance of both synchronous and asynchronous binary API. As the results of the benchmark show; in case that speed is an issue, asynchronous API is recommended:
+
The benchmark application is able to measure the performance of both synchronous and asynchronous GoVPP APIs. The results are compared with the results of the benchmark application written in C, that uses the same vppapiclient calls as GoVPP. As the results of the benchmarks show; in case that speed is an issue, asynchronous API is recommended:
 +
 
 +
(rr/second = requests + replies per second)
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 11: Line 13:
 
|1
 
|1
 
|C application (vppapiclient) - control ping
 
|C application (vppapiclient) - control ping
|762 000 req/second
+
|762 000 rr/second
 
|-
 
|-
 
|2
 
|2
 
|GoVPP - control ping
 
|GoVPP - control ping
|251 878 req/second
+
|251 878 rr/second
 
|-
 
|-
 
|3
 
|3
 
|GoVPP - l2fib add
 
|GoVPP - l2fib add
|245 560 req/second
+
|245 560 rr/second
 
|-
 
|-
 
|4
 
|4
 
|GoVPP - interface dump
 
|GoVPP - interface dump
|210 305 req/second
+
|210 305 rr/second
 
|}
 
|}
  
Line 31: Line 33:
 
|5
 
|5
 
|C application - control ping
 
|C application - control ping
|2340 req/second
+
|2 340 rr/second
 
|-
 
|-
 
|6
 
|6
 
|GoVPP - control ping
 
|GoVPP - control ping
|107 req/second
+
|107 rr/second
 
|}
 
|}
  
As the results show, there is quite big difference between the performance of a C test application and the GoVPP. In order to identify the difference and bottlenecks, we:
+
As the results show, there is quite a big difference between the performance of the C application and the GoVPP. Also, there is even larger difference between the performance of the C app and GoVPP by synchronous API. In order to identify the differences and bottlenecks, we did the following:
* performed profiling of the execution of the test 2 (asynchronous GoVPP) and 6 (synchronous GoVPP).
+
 
* performed the measurements from the bare Go application that called vppapiclient library directly, without any further processing in Go, with following results:
+
* performed profiling of the execution of the test 2 ([https://wiki.fd.io/images/f/fe/Async-prof.pdf GoVPP async API]) and 6 ([https://wiki.fd.io/images/d/dc/Sync-prof.pdf GoVPP sync API]).
 +
* performed the measurements from the plain Go application that called vppapiclient library directly, without any further processing in Go, with following results:
  
 
{| class="wikitable"
 
{| class="wikitable"
|+Synchronous API
+
|+Plain Go
 
|-
 
|-
 
|7
 
|7
 
|Go async - no encode & decode, callback once per 100 replies
 
|Go async - no encode & decode, callback once per 100 replies
|761 836 req/second
+
|761 836 rr/second
 
|-
 
|-
 
|8
 
|8
 
|Go async - no encode & decode
 
|Go async - no encode & decode
|554 215 req/second
+
|554 215 rr/second
 
|-
 
|-
 
|9
 
|9
 
|Go async - with encode & decode, no message passing
 
|Go async - with encode & decode, no message passing
|284 283 req/second
+
|284 283 rr/second
 
|-
 
|-
 
|10
 
|10
 
|Go async - with encode & decode, with message passing
 
|Go async - with encode & decode, with message passing
|250 861 req/second
+
|250 861 rr/second
 
|}
 
|}
 +
 +
== Discussion ==
 +
* During all tests, the benchmark application utilized the CPU to > 100% (there were 2 threads), whereas VPP utilized the CPU to 30 - 40%.
 +
* By comparing tests 1 and 8, we can see that we can not reach the performance of the C application from Go. After inspecting the profiling results ([https://wiki.fd.io/images/f/fe/Async-prof.pdf GoVPP async API]) which shows that most of the time is wasted by Go runtime by calling a message reply callback from C world, we performed the test 7, which calls the reply callback only once per 100 replies (buffers the replies). The results confirmed the theory and show that with buffered replies, we could reach the performance of the C application.
 +
* Since the encoder & decoder of binary API messages from binary to Go bindings uses reflection, we knew that this process will eat quite a lot resources. In order to find out the exact numbers, we performed the tests 8 (no encoding & decoding of the messages) and 9 (encoding and decoding of the messages). The results show that the encoding & decoding process slows down the performance in 50%.
 +
* As the results of the tests 2, 3 and 4 show, the more complex is the reply message, the more time is required to decode it. This is not entirely true for the requests, since the request encoding is much faster that reply decoding (this can be also seen on the profiling results).
 +
* The difference between the tests 9 and 10 shows the performance drop caused by passing the reply message via Go channel between the thread that receives the replies from VPP and the thread with the main Go routine.
 +
* There is a big performance drop between the C application and GoVPP in synchronous API - tests 5 and 6. Profiling results ([https://wiki.fd.io/images/d/dc/Sync-prof.pdf GoVPP sync API]) show that this is caused by Go runtime probably by continuous sleeps and wakeups or the async thread created by the vppapiclient library that receives the replies from VPP.
 +
 +
== Possible Performance Improvements ==
 +
* Get rid of reflection by message decoding. Either generate custom encoder & decoder code for each binary API message, or allow to provide custom encoder & decoder functions by the user. Possible to combine with the current approach for not so frequent binary API messages.
 +
* Buffer replies from VPP in case that multiple replies are expected before calling the callback from the C world.
 +
* Get rid of the async thread created by the vppapiclient library that receives the replies from VPP, since this probably does not match with the concept of Go routines very well. Blocking read from a Go routine may perform much better.

Latest revision as of 07:25, 13 July 2017

GoVPP Performance

GoVPP performance has been measured using the perf-bench example binary on an Ubuntu 16.04 LTS virtual machine running on a laptop with Intel® Core™ i7-6820HQ CPU @ 2.70GHz and 16 GB of RAM. The virtual machine has been assigned 4 CPU cores and 8 GB of RAM.

The benchmark application is able to measure the performance of both synchronous and asynchronous GoVPP APIs. The results are compared with the results of the benchmark application written in C, that uses the same vppapiclient calls as GoVPP. As the results of the benchmarks show; in case that speed is an issue, asynchronous API is recommended:

(rr/second = requests + replies per second)

Asynchronous API
1 C application (vppapiclient) - control ping 762 000 rr/second
2 GoVPP - control ping 251 878 rr/second
3 GoVPP - l2fib add 245 560 rr/second
4 GoVPP - interface dump 210 305 rr/second
Synchronous API
5 C application - control ping 2 340 rr/second
6 GoVPP - control ping 107 rr/second

As the results show, there is quite a big difference between the performance of the C application and the GoVPP. Also, there is even larger difference between the performance of the C app and GoVPP by synchronous API. In order to identify the differences and bottlenecks, we did the following:

  • performed profiling of the execution of the test 2 (GoVPP async API) and 6 (GoVPP sync API).
  • performed the measurements from the plain Go application that called vppapiclient library directly, without any further processing in Go, with following results:
Plain Go
7 Go async - no encode & decode, callback once per 100 replies 761 836 rr/second
8 Go async - no encode & decode 554 215 rr/second
9 Go async - with encode & decode, no message passing 284 283 rr/second
10 Go async - with encode & decode, with message passing 250 861 rr/second

Discussion

  • During all tests, the benchmark application utilized the CPU to > 100% (there were 2 threads), whereas VPP utilized the CPU to 30 - 40%.
  • By comparing tests 1 and 8, we can see that we can not reach the performance of the C application from Go. After inspecting the profiling results (GoVPP async API) which shows that most of the time is wasted by Go runtime by calling a message reply callback from C world, we performed the test 7, which calls the reply callback only once per 100 replies (buffers the replies). The results confirmed the theory and show that with buffered replies, we could reach the performance of the C application.
  • Since the encoder & decoder of binary API messages from binary to Go bindings uses reflection, we knew that this process will eat quite a lot resources. In order to find out the exact numbers, we performed the tests 8 (no encoding & decoding of the messages) and 9 (encoding and decoding of the messages). The results show that the encoding & decoding process slows down the performance in 50%.
  • As the results of the tests 2, 3 and 4 show, the more complex is the reply message, the more time is required to decode it. This is not entirely true for the requests, since the request encoding is much faster that reply decoding (this can be also seen on the profiling results).
  • The difference between the tests 9 and 10 shows the performance drop caused by passing the reply message via Go channel between the thread that receives the replies from VPP and the thread with the main Go routine.
  • There is a big performance drop between the C application and GoVPP in synchronous API - tests 5 and 6. Profiling results (GoVPP sync API) show that this is caused by Go runtime probably by continuous sleeps and wakeups or the async thread created by the vppapiclient library that receives the replies from VPP.

Possible Performance Improvements

  • Get rid of reflection by message decoding. Either generate custom encoder & decoder code for each binary API message, or allow to provide custom encoder & decoder functions by the user. Possible to combine with the current approach for not so frequent binary API messages.
  • Buffer replies from VPP in case that multiple replies are expected before calling the callback from the C world.
  • Get rid of the async thread created by the vppapiclient library that receives the replies from VPP, since this probably does not match with the concept of Go routines very well. Blocking read from a Go routine may perform much better.