Difference between revisions of "CSIT/TestFailuresTracking"
From fd.io
< CSIT
Line 31: | Line 31: | ||
** frequency: always | ** frequency: always | ||
** testbed: 3n-alt | ** testbed: 3n-alt | ||
− | ** example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/ | + | ** example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/155/log.html.gz#s1-s1-s1-s1-s1-t1 |
** ticket: | ** ticket: | ||
− | ** note: In last | + | ** note: In last runs the first tests are failing until interface comes up. Number of failed tests wary (2-61 failed tests). Issue can be result of Vratko running DPDK testpmd runs. Ie #154 all tests passed. |
==== (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device ==== | ==== (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device ==== | ||
Line 43: | Line 43: | ||
** frequency: always | ** frequency: always | ||
** testbed: 3n-snr | ** testbed: 3n-snr | ||
− | ** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/ | + | ** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/95/log.html.gz#s1-s1-s1-s3-s1 3n-snr] |
** ticket: [https://jira.fd.io/browse/CSIT-1883 CSIT-1883] | ** ticket: [https://jira.fd.io/browse/CSIT-1883 CSIT-1883] | ||
** note: missing QAT driver. | ** note: missing QAT driver. | ||
Line 55: | Line 55: | ||
** frequency: always | ** frequency: always | ||
** testbed: 1n-aws | ** testbed: 1n-aws | ||
− | ** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-trex-perf-ndrpdr-weekly-master-1n-aws/ | + | ** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-trex-perf-ndrpdr-weekly-master-1n-aws/18/log.html.gz#s1-s1-s1-s1-s2-t1 1n-aws] |
** ticket: [https://jira.fd.io/browse/CSIT-1876 CSIT-1876] | ** ticket: [https://jira.fd.io/browse/CSIT-1876 CSIT-1876] | ||
** note: The root cause can be shared environment in aws cloud. | ** note: The root cause can be shared environment in aws cloud. | ||
Line 185: | Line 185: | ||
** frequency: medium | ** frequency: medium | ||
** testbed: 2n-icx | ** testbed: 2n-icx | ||
− | ** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-weekly-master-2n-icx/ | + | ** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-weekly-master-2n-icx/57/log.html.gz 2n-icx mrr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-icx/48/log.html.gz#s1-s1-s1-s5-s8-t1 2n-icx ndrpdr] |
** ticket: [https://jira.fd.io/browse/CSIT-1881 CSIT-1881] | ** ticket: [https://jira.fd.io/browse/CSIT-1881 CSIT-1881] | ||
** note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority. | ** note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority. | ||
Line 235: | Line 235: | ||
** frequency: low | ** frequency: low | ||
** testbed: 3n-skx, 3n-icx, 3n-snr | ** testbed: 3n-skx, 3n-icx, 3n-snr | ||
− | ** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/197/log.html.gz#s1-s1-s1-s1-s4-t1 3n-icx daily], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/32/log.html.gz#s1-s1-s1-s1-s4-t1 3n-snr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/ | + | ** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/197/log.html.gz#s1-s1-s1-s1-s4-t1 3n-icx daily], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/32/log.html.gz#s1-s1-s1-s1-s4-t1 3n-snr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/57/log.html.gz#s1-s1-s1-s1-s4-t1 3n-icx weekly] |
** ticket: [https://jira.fd.io/browse/CSIT-1827 CSIT-1827] | ** ticket: [https://jira.fd.io/browse/CSIT-1827 CSIT-1827] | ||
** note: | ** note: | ||
Line 283: | Line 283: | ||
** frequency: low | ** frequency: low | ||
** testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx | ** testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx | ||
− | ** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf- | + | ** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-zn2/672/log.html.gz#s1-s1-s1-s2-s22-t3 2n-zn2], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-clx/164/log.html.gz#s1-s1-s1-s2-s54-t1 2n-clx] |
** ticket: [https://jira.fd.io/browse/CSIT-1795 CSIT-1795] | ** ticket: [https://jira.fd.io/browse/CSIT-1795 CSIT-1795] | ||
** note: | ** note: |
Revision as of 10:07, 30 January 2023
Contents
- 1 CSIT Test Failure Clasification
- 2 CSIT Test Fixing Priorities
- 3 Always Failing Tests
- 3.1 In Trending
- 3.1.1 (M) 3n-alt: Tests failing until 40Ge Interface comes up
- 3.1.2 (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device
- 3.1.3 (M) 1n-aws: TRex mlrsearch fails to find NDR & PDR due to AWS rate limiting (5min total test duration)
- 3.1.4 (M) 3n-alt, 3n-snr: testpmd no traffic forwarded
- 3.2 not in trending
- 3.2.1 (H) 3n-icx: vpp hoststack QUIC vppecho tests failing
- 3.2.2 (M) all testbeds: vpp 9000B tests with vhostuser, memif, tunnels, avf
- 3.2.3 (M) 2n-clx, 2n-icx: all Geneve tests with 1024 tunnels fail
- 3.2.4 (L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail
- 3.2.5 (L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions
- 3.1 In Trending
- 4 Sometimes failing tests
- 4.1 in trending - high frequency failures
- 4.2 in trending - lower frequency failures
- 4.2.1 (M) 3n-icx, 3n-snr: 1518B IPsec packets not passing
- 4.2.2 (M) all testbeds: mlrsearch fails to find NDR rate
- 4.2.3 (M) all testbeds: AF_XDP mlrsearch fails to find NDR rate
- 4.2.4 (L) all testbeds: vpp create avf interface failure in multi-core configs
- 4.2.5 (L) all testbeds: nat44det 4M and 16M scale 1 session not established
- 5 Fixed issues
- 5.1 (H) all testbeds: all DPDK tests did not run because DPDK failed to install meson
- 5.2 (L) 2n-dnv: nat44ed 1518B 64k sessions not establishing all sessions
- 5.3 (L) 2n-dnv, 3n-dnv: x557 auto-negotiating 1ge instead of 10ge
- 5.4 (M) 3n-snr: 25GE links randomly going down between snr/sut and icx/tg-trex
- 5.5 (H) 2n-clx, 2n-zn2: VPP RDMA tests no traffic forwarded
CSIT Test Failure Clasification
All known CSIT failures grouped and listed in the following order:
- Always failing followed by sometimes failing.
- Always failing tests:
- Most common use cases followed by less common.
- Sometimes failing tests:
- Most frequently failing followed by less frequently failing.
- High frequency 50%-100%
- medium frequency 10%-50%
- low frequency 0%-10%.
- Within each sub-group: most common use cases followed by less common.
- Most frequently failing followed by less frequently failing.
CSIT Test Fixing Priorities
- Test fixing work priorities defined as follows
- (H)igh priority, most common use cases and most common test code.
- (M)edium priority, specific HW and pervasive test code issue.
- (L)ow priority, corner cases and external dependencies.
Always Failing Tests
In Trending
(M) 3n-alt: Tests failing until 40Ge Interface comes up
- (M) 3n-alt: Tests failing until 40Ge Interface comes up
- work-to-fix: easy
- rca:
- test: First tests in order
- frequency: always
- testbed: 3n-alt
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/155/log.html.gz#s1-s1-s1-s1-s1-t1
- ticket:
- note: In last runs the first tests are failing until interface comes up. Number of failed tests wary (2-61 failed tests). Issue can be result of Vratko running DPDK testpmd runs. Ie #154 all tests passed.
(M) 3n-snr: All hwasync wireguard tests failing when trying to verify device
- (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device
(M) 1n-aws: TRex mlrsearch fails to find NDR & PDR due to AWS rate limiting (5min total test duration)
- (M) 1n-aws: TRex NDR PDR ALL IP4 scale and L2 scale tests failing with 50% packet loss
(M) 3n-alt, 3n-snr: testpmd no traffic forwarded
- (M) 3n-alt, 3n-snr: testpmd tests fail with no traffic
not in trending
(H) 3n-icx: vpp hoststack QUIC vppecho tests failing
- (H) 3n-icx: QUIC vppecho BPS tests failing on timeout when checking hoststack finished
(M) all testbeds: vpp 9000B tests with vhostuser, memif, tunnels, avf
- All tests with 9000B payload frames not forwarded over vhostuser interfaces.
- work-to-fix: hard
- rca: VPP code: 34839: dpdk: cleanup MTU handling
- test: 9000B - vhostuser
- frequency: always
- testbed: 2n-skx, 3n-skx, 2n-clx
- examples: 3n-skx vhostuser
- ticket: CSIT-1809
- note:
- All tests with 9000B payload frames not forwarded over memif interfaces.
- work-to-fix: hard
- rca: VPP code: 34839: dpdk: cleanup MTU handling
- test: 9000B - memif
- frequency: always
- testbed: 2n-skx, 3n-skx, 2n-clx
- examples: 2n-skx Memif
- ticket: CSIT-1808
- note:
- 9000B payload frames not forwarded over tunnels due to violating supported Max Frame Size (VxLAN, LISP, SRv6)
- work-to-fix: hard
- rca: VPP code: 34839: dpdk: cleanup MTU handling
- test: 9000B - IP4 tunnels VXLAN, IP4 tunnels LISP, Srv6
- frequency: always
- testbed: 2n-icx, 3n-icx
- examples: 2n-icx VXLAN, 3n-icx
- ticket: CSIT-1801
- note:
- (M) 3n-icx: 9000b ip4 ip6 l2 NDRPDR AVF tests are failing to forward traffic
- work-to-fix: hard
- rca:
- test: 9000B - IP4, IP6, l2 - base and scale
- frequency: always
- testbed: 3n-icx
- examples: 3n-icx ip4base
- ticket: CSIT-1885
- note:
- (M) 2n-clx, 2n-icx, 2n-zn2: DPDK testpmd 9000b tests on xxv710 nic are failing with no traffic
(M) 2n-clx, 2n-icx: all Geneve tests with 1024 tunnels fail
- (M) All Geneve L3 mode scale tests (1024 tunnels) are failing
(L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail
- (L) All NAT44-ED 16M sessions CPS scale tests fail while setting NAT44 address range.
- work-to-fix: hard
- rca: VPP crash, Failed to set NAT44 address range on host 10.30.51.44 (connections-per-second tests only)
- test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c
- frequency: always
- testbeds: 2n-skx, 2n-clx, 2n-icx
- example: 2n-icx, 2n-clx
- ticket: CSIT-1799
- note:
(L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions
- (L) 2n-clx, 2n-icx: All NAT44DET NDR PDR IMIX over 1M sessions BIDIR tests failing to create enough sessions
Sometimes failing tests
in trending - high frequency failures
(H) 2n-icx: NFV density VPP does not start in container
- (H) 2n-icx: NFV density tests breaks VPP which fails to start (re-opened)
- work-to-fix: hard
- rca:
- test: all subsequent
- frequency: medium
- testbed: 2n-icx
- example: 2n-icx mrr, 2n-icx ndrpdr
- ticket: CSIT-1881
- note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.
(M) 2n-clx: e810 mlrsearch tests packets forwarding in one direction
- (M) 2n-clx: half of the packets lost on PDR tests (re-opened)
(M) 3n-icx: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c
- (M) 3n-icx: Wireguard tests with 100 and more tunnels are failing PDR criteria
(M) 3n-tsh: vpp in VM not starting
- (M) 3n-tsh: VM tests failing to boot VM
in trending - lower frequency failures
(M) 3n-icx, 3n-snr: 1518B IPsec packets not passing
- (M) 3n-icx, 3n-skx, 3n-snr: all 1518B AVF crypto tests failed with no traffic, all IMIX AVF crypto with excessive packet loss
- work-to-fix: hard
- rca:
- test: all AVF crypto
- frequency: low
- testbed: 3n-skx, 3n-icx, 3n-snr
- example: 3n-icx daily, 3n-snr, 3n-icx weekly
- ticket: CSIT-1827
- note:
(M) all testbeds: mlrsearch fails to find NDR rate
- (M) 3n-tsh, 3n-alt, 2n-clx testbed (Taishan, Altra, Cascade-lake): NDR tests failing from time to time.
(M) all testbeds: AF_XDP mlrsearch fails to find NDR rate
- (M) all testbeds: AF-XDP - NDR tests failing from time to time
(L) all testbeds: vpp create avf interface failure in multi-core configs
- (L) multicore AVF tests are failing when trying to create interface
(L) all testbeds: nat44det 4M and 16M scale 1 session not established
- (L) Not all DET44 sessions have been established: 4128767 != 4128768
Fixed issues
(H) all testbeds: all DPDK tests did not run because DPDK failed to install meson
- (H) all testbeds: all DPDK tests did not run because required meson version was not installed
- work-to-fix: easy
- rca: upgraded meson from 0.49.2 to 0.64.1
- test: all
- frequency: always
- testbed: all
- example: 2n-clx
- ticket:
- note:
(L) 2n-dnv: nat44ed 1518B 64k sessions not establishing all sessions
- (L) 2n-dnv: sporadic 1518B tput tests failing to establish required sessions
(L) 2n-dnv, 3n-dnv: x557 auto-negotiating 1ge instead of 10ge
- (L) T-Rex STL runtime error
- work-to-fix: hard
- rca: VPP code - X557 speed_capability set 1GE instead of 10GE
- test: all tests
- frequency: high
- testbed: 2n-dnv and 3n-dnv
- example: 2n-dnv, 3n-dnv
- ticket: [/VPP-2010](https://jira.fd.io/browse/VPP-2010)
- note: TODO VPP to fix speed_capability.
- note: 2n-dnv and 3n-dnv are turned off as they are going to be decomissioned soon.
(M) 3n-snr: 25GE links randomly going down between snr/sut and icx/tg-trex
- (M) 3n-snr: 25GE interface between SUT and TG/TRex goes down randomly
(H) 2n-clx, 2n-zn2: VPP RDMA tests no traffic forwarded
- (H) 2n-clx, 2n-zn2: all RDMA tests failing with cli_inband clear runtime command
- work-to-fix: easy
- rca: for-loop initialization in scalar path
- test: all RDMA with CX556A NIC
- frequency: always
- testbed: 2n-clx, 2n-zn2
- example: 2n-clx, 2n-zn2, 2n-clx
- ticket: CSIT-1882
- fix: 37720: rdma: fix for-loop initialization in scalar path | https://gerrit.fd.io/r/c/vpp/+/37720