Difference between revisions of "CSIT/TestFailuresTracking"
From fd.io
								< CSIT
												
				|  (→(M) 9000b all AVF tests are failing to forward traffic:  B.) |  (CSIT-1848 fixed but the failures are now attributed to CSIT-1904.) | ||
| Line 36: | Line 36: | ||
| * ticket: [https://jira.fd.io/browse/CSIT-1883 CSIT-1883] | * ticket: [https://jira.fd.io/browse/CSIT-1883 CSIT-1883] | ||
| − | ==== (M)  | + | ==== (M) DPDK 23.03 testpmd startup fails on some testbeds ==== | 
| − | * last update: 2023-06- | + | * last update: 2023-06-29 | 
| * work-to-fix: medium | * work-to-fix: medium | ||
| − | * rca: DUT-DUT link  | + | * rca: The DUT-DUT link is slow to go up. The same consequences as CSIT-1848 but affects more testbed+NIC combinations. Can be fixed by a different startup procedure. | 
| − | * test:  | + | * test: Testpmd (no vpp). Around half of tested+NIC combinations are affected. | 
| − | * frequency: always  | + | * frequency: always since 23.03.0 got released | 
| − | * testbed:  | + | * testbed: multiple | 
| − | * example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master- | + | * example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-2n-tx2/131/log.html.gz#s1-s1-s1-s1-t1-k2-k4 | 
| − | * ticket: [https://jira.fd.io/browse/CSIT- | + | * ticket: [https://jira.fd.io/browse/CSIT-1904 CSIT-1904] | 
| − | + | ||
| ==== (M) 2n-spr: zero traffic on cx7 rdma ==== | ==== (M) 2n-spr: zero traffic on cx7 rdma ==== | ||
| Line 184: | Line 183: | ||
| * example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/18/log.html.gz#s1-s1-s1-s1-s2-t4 2n-icx] | * example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/18/log.html.gz#s1-s1-s1-s1-s2-t4 2n-icx] | ||
| * ticket: [https://jira.fd.io/browse/CSIT-1884 CSIT-1884] | * ticket: [https://jira.fd.io/browse/CSIT-1884 CSIT-1884] | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| == Occasional Failures == | == Occasional Failures == | ||
| Line 369: | Line 356: | ||
| * ticket: [https://jira.fd.io/browse/CSIT-1870 CSIT-1870] | * ticket: [https://jira.fd.io/browse/CSIT-1870 CSIT-1870] | ||
| * note: Fixed, probably by the CSIT MTU handling change. | * note: Fixed, probably by the CSIT MTU handling change. | ||
| + | |||
| + | ==== (M) 3n-alt: testpmd no traffic forwarded ==== | ||
| + | |||
| + | * last update: 2023-06-29 | ||
| + | * work-to-fix: medium | ||
| + | * rca: DUT-DUT link takes too long to come up on some testbeds. This happens *after* a test case with a DPDK app (not VPP even when using dpdk plugin), although multiple subsequent tests (even with VPP) may be affected. The real cause is probably in NIC firmware or driver, but CSIT can be better at detecting port status as a workaround. | ||
| + | * test: testpmd (also l3fwd but hidden by CSIT-1896) | ||
| + | * frequency: always (almost) | ||
| + | * testbed: 3n-alt, 3n-snr | ||
| + | * example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-3n-alt/65/log.html.gz#s1-s1-s1-s3-t1-k2-k4 | ||
| + | * ticket: [https://jira.fd.io/browse/CSIT-1848 CSIT-1848] | ||
| + | * note: The infra cause got fixed, but there still is CSIT-1904 with the same consequences. | ||
| ==== (L) 3n-alt: Tests failing until 40Ge Interface comes up ==== | ==== (L) 3n-alt: Tests failing until 40Ge Interface comes up ==== | ||
Revision as of 07:48, 29 June 2023
Contents
- 1 CSIT Test Failure Clasification
- 2 CSIT Test Fixing Priorities
- 3 Current Failures
- 3.1 Deterministic Failures
- 3.1.1 In Trending
- 3.1.1.1 (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device
- 3.1.1.2 (M) DPDK 23.03 testpmd startup fails on some testbeds
- 3.1.1.3 (M) 2n-spr: zero traffic on cx7 rdma
- 3.1.1.4 (M) 3n-icx, 3n-snr: first few swasync scheduler tests timing out in runtime stat
- 3.1.1.5 (L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch
- 3.1.1.6 (L) 2n-tx2: af_xdp mrr failures
 
- 3.1.2 Not In Trending
- 3.1.2.1 (M) all testbeds: some 9000B tests
- 3.1.2.1.1 (M) tests with 9000B payload frames not forwarded over vhost interfaces
- 3.1.2.1.2 (M) tests with 9000B payload frames not forwarded over memif interfaces
- 3.1.2.1.3 (M) 9000B payload frames not forwarded over tunnels due to violating supported Max Frame Size (VxLAN, LISP, SRv6)
- 3.1.2.1.4 (M) 9000B all AVF tests are failing to forward traffic
- 3.1.2.1.5 (L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B
 
- 3.1.2.2 (M) 2n-clx, 2n-icx: all Geneve tests with 1024 tunnels fail
- 3.1.2.3 (L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail
- 3.1.2.4 (L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions
 
- 3.1.2.1 (M) all testbeds: some 9000B tests
 
- 3.1.1 In Trending
- 3.2 Occasional Failures
- 3.3 Rare Failures
- 3.3.1 In Trending
- 3.3.1.1 (M) 3n-icx, 3n-snr: 1518B IPsec packets not passing
- 3.3.1.2 (M) all testbeds: mlrsearch fails to find NDR rate
- 3.3.1.3 (M) all testbeds: AF_XDP mlrsearch fails to find NDR rate
- 3.3.1.4 (L) all testbeds: vpp create avf interface failure in multi-core configs
- 3.3.1.5 (L) all testbeds: nat44det 4M and 16M scale 1 session not established
 
 
- 3.3.1 In Trending
 
- 3.1 Deterministic Failures
- 4 Past Failures
- 4.1 (H) AVF suite setup fails if previous suite was also AVF
- 4.2 (M) 2n-spr 200Ge2P1Cx7Veat: TRex sees port line rate as 100 Gbps
- 4.3 (M) 2n-icx: interface down in nginx tests
- 4.4 (M) hoststack: ip4udpscale1cl10s-ldpreload-iperf3 times out waiting for strace
- 4.5 (M) 3n-tsh: vpp in VM starting too slowly
- 4.6 (M) 2n-clx, 2n-icx, 2n-zn2: DPDK testpmd 9000b tests on xxv710 nic are failing with no traffic
- 4.7 (M) 3n-alt: testpmd no traffic forwarded
- 4.8 (L) 3n-alt: Tests failing until 40Ge Interface comes up
- 4.9 (L) 1n-aws: TRex NDR PDR ALL IP4 scale and L2 scale tests failing with 50% packet loss
 
CSIT Test Failure Clasification
All known CSIT failures grouped and listed in the following order:
- Always failing followed by sometimes failing.
-  Always failing tests:
- Most common use cases followed by less common.
 
-  Sometimes failing tests:
-  Most frequently failing followed by less frequently failing. 
- High frequency 50%-100%
- medium frequency 10%-50%
- low frequency 0%-10%.
 
- Within each sub-group: most common use cases followed by less common.
 
-  Most frequently failing followed by less frequently failing. 
CSIT Test Fixing Priorities
Test fixing work priorities defined as follows:
- (H)igh priority, most common use cases and most common test code.
- (M)edium priority, specific HW and pervasive test code issue.
- (L)ow priority, corner cases and external dependencies.
Current Failures
Deterministic Failures
In Trending
(M) 3n-snr: All hwasync wireguard tests failing when trying to verify device
- last update: before 2023-01-31
- work-to-fix: hard
- rca: Missing QAT driver. Symptom: Failed to bind PCI device 0000:f4:00.0 to c4xxx on host 10.30.51.93
- test: hwasync wireguard
- frequency: always
- testbed: 3n-snr
- example: 3n-snr
- ticket: CSIT-1883
(M) DPDK 23.03 testpmd startup fails on some testbeds
- last update: 2023-06-29
- work-to-fix: medium
- rca: The DUT-DUT link is slow to go up. The same consequences as CSIT-1848 but affects more testbed+NIC combinations. Can be fixed by a different startup procedure.
- test: Testpmd (no vpp). Around half of tested+NIC combinations are affected.
- frequency: always since 23.03.0 got released
- testbed: multiple
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-2n-tx2/131/log.html.gz#s1-s1-s1-s1-t1-k2-k4
- ticket: CSIT-1904
(M) 2n-spr: zero traffic on cx7 rdma
- last update: 2023-06-22
- work-to-fix: medium
- rca: VPP reports "tx completion errors", more investigation ongoing.
- test: All tests on 2n-spr with 200Ge2P1Cx7Veat NIC and RDMA driver.
- frequency: always (since 2n-spr was set up)
- testbed: 2n-spr
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/58/log.html.gz#s1-s1-s1-s5-s19-t1-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1
- ticket: CSIT-1906
- note: Also would affect 3n-alt with mlxc6 and rdma. Will probably be made invisible by removing rdma (except mlxc5) from jobspecs.
(M) 3n-icx, 3n-snr: first few swasync scheduler tests timing out in runtime stat
- last update: 2023-06-21
- work-to-fix: medium
- rca:
- test: first two tests on 2n-icx, first 8 (or on occasion 9) on 3n-snr.
- frequency: always (except the one test on 3n-snr), last good run was 2023-05-29, first bad was 2023-06-05.
- testbed: 3n-icx, 3n-snr
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/287/log.html.gz#s1-s1-s1-s1-s5-t1-k2-k14-k9-k10-k1-k1-k1-k12
- ticket: CSIT-1923
(L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch
- last update: 2023-04-19
- work-to-fix:
- rca:
- test: TB38 AVF 4c e810cq, earlier l2patch nowadays eth-l2xcbase
- frequency: always
- testbed: 3n-icx (only TB38, never TB37)
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/214/log.html.gz#s1-s1-s1-s5-s3-t3-k2-k9-k8-k13-k1-k2
- ticket: CSIT-1901
(L) 2n-tx2: af_xdp mrr failures
- last update: 2023-06-21
- work-to-fix:
- rca:
- test: 25Ge2P1Xxv710-Af-Xdp-Ethip4-Ip4Base-Mrr
- frequency: more than 80% on a subset of cases, 100% on (multicore) 2n-tx2
- testbed: 2n-tx2; to a lesser extent also 2n-clx and 2n-icx
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1349/log.html.gz#s1-s1-s1-s2-s60-t1
- ticket: CSIT-1922
Not In Trending
(M) all testbeds: some 9000B tests
- last update: 2023-02-09
- work-to-fix: hard
- rca: VPP code: 34839: dpdk: cleanup MTU handling. CSIT needs to rework how it sets MTU / max frame rate (CSIT-1797). Some tests will continue failing due to missing support on VPP side, we will open specific Jira tickets for those.
- test: see sub-items
- frequency: always
- testbed: all
- examples: see sub-items
- ticket: CSIT-1809
- gerrit: https://gerrit.fd.io/r/c/csit/+/37824
(M) tests with 9000B payload frames not forwarded over vhost interfaces
- last update: 2023-02-09
- work-to-fix: hard
- test: 9000B + vhostuser
- testbed: 2n-skx, 3n-skx, 2n-clx
- examples: 3n-skx vhostuser
- ticket: CSIT-1809
(M) tests with 9000B payload frames not forwarded over memif interfaces
- last update: 2023-02-09
- work-to-fix: hard
- test: 9000B + memif
- testbed: 2n-skx, 3n-skx, 2n-clx
- examples: 2n-skx Memif
- ticket: CSIT-1808
(M) 9000B payload frames not forwarded over tunnels due to violating supported Max Frame Size (VxLAN, LISP, SRv6)
- last update: 2023-02-09
- work-to-fix: medium
- test: 9000B + (IP4 tunnels VXLAN, IP4 tunnels LISP, Srv6, IpSec)
- testbed: 2n-icx, 3n-icx
- examples: 2n-icx VXLAN, 3n-icx
- ticket: CSIT-1801
(M) 9000B all AVF tests are failing to forward traffic
- last update: 2023-02-09
- work-to-fix: hard
- test: 9000B + AVF
- testbed: 3n-icx
- examples: 3n-icx ip4base
- ticket: CSIT-1885
(L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B
- last update: 2023-06-28
- work-to-fix: medium
- test: 9000B + Cx7 with DPDK DUT
- testbed: 2n-icx
- examples: [1]
- ticket: CSIT-1924
(M) 2n-clx, 2n-icx: all Geneve tests with 1024 tunnels fail
- last update: before 2023-01-31
- work-to-fix: hard
- rca: VPP crash, Failed to add IP neighbor on interface geneve_tunnel258
- test: avf-ethip4--ethip4udpgeneve-1024tun-ip4base 64B 1518B IMIX 1c 2c 4c
- frequency: always
- testbed: 2n-skx, 2n-clx, 2n-icx
- example: 2n-icx
- ticket: CSIT-1800
(L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail
- last update: before 2023-01-31
- work-to-fix: hard
- rca: VPP crash, Failed to set NAT44 address range on host 10.30.51.44 (connections-per-second tests only)
- test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c
- frequency: always
- testbeds: 2n-skx, 2n-clx, 2n-icx
- example: 2n-icx, 2n-clx
- ticket: CSIT-1799
(L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions
- last update: before 2023-01-31
- work-to-fix: hard
- rca:
- test: IMIX over 1M sessions bidir
- frequency: always
- testbed: 2n-skx, 2n-clx, 2n-icx
- example: 2n-icx
- ticket: CSIT-1884
Occasional Failures
In Trending
(H) 2n-icx: NFV density VPP does not start in container
- last update: before 2023-01-31
- work-to-fix: hard
- rca:
- test: all subsequent
- frequency: medium
- testbed: 2n-icx
- example: 2n-icx mrr, 2n-icx ndrpdr
- ticket: CSIT-1881
- note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.
(M) 2n-clx: e810 mlrsearch tests packets forwarding in one direction
- last update: before 2023-01-31
- work-to-fix: hard
- rca:
- test: e810Cq ip4base, ip6base
- frequency: high
- testbed: 2n-clx
- example: 2n-clx
- ticket: CSIT-1864
(M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c
- last update: before 2023-01-31
- work-to-fix: easy
- rca:
- test: wireguard 100 tunnels and more
- frequency: high
- testbed: 3n-icx, 3n-snr
- examples: 3n-icx
- ticket: CSIT-1886
Rare Failures
In Trending
(M) 3n-icx, 3n-snr: 1518B IPsec packets not passing
- last update: before 2023-01-31
- work-to-fix: hard
- rca:
- test: all AVF crypto
- frequency: low
- testbed: 3n-skx, 3n-icx, 3n-snr
- example: 3n-icx daily, 3n-snr, 3n-icx weekly
- ticket: CSIT-1827
(M) all testbeds: mlrsearch fails to find NDR rate
- last update: before 2023-06-22
- work-to-fix: hard
- rca: On 3n-tsh, the symptom is TRex reporting ierrors, only in one direction. Other testbeds may have a different symptom, but failures there are less frequent.
- test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
- frequency: low
- testbed: 3n-tsh, 3n-alt, 2n-clx
- example: [2]
- ticket: CSIT-1804
(M) all testbeds: AF_XDP mlrsearch fails to find NDR rate
- last update: before 2023-01-31
- work-to-fix: hard
- rca:
- test: af-xdp multicore tests
- frequency: low
- testbed: 2n-clx, 2n-skx, 2n-tx2, 2n-icx
- example: 2n-skx, 2n-clx
- ticket: CSIT-1802
- note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100
(L) all testbeds: vpp create avf interface failure in multi-core configs
- last update: 2023-02-06
- work-to-fix: hard
- rca: issue in Intel FVL driver
- test: multicore AVF
- frequency: low
- testbed: all testbeds
- example: 2n-clx, 3n-icx
- ticket: CSIT-1782
- note: A long standing issue without a final permanent fix.
(L) all testbeds: nat44det 4M and 16M scale 1 session not established
- last update: 2023-02-14
- work-to-fix: hard
- rca: unknown
- test: nat44det udp 4m and 16m (64k is ok, 1m can fail but rarely than bigger scales)
- frequency: low
- testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx
- example: 2n-zn2, 2n-clx
- ticket: CSIT-1795
Past Failures
(H) AVF suite setup fails if previous suite was also AVF
- last update: 2023-05-15
- work-to-fix: low
- rca: After a recent change, CSIT attempt to bind an already bound driver.
- test: All AVF suites if the previous suite running on the NIC was also AVF.
- frequency: always since 2023-05-09
- testbed: all (if having AVF supported NIC)
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/265/log.html.gz#s1-s1-s1-s3-s2-k1-k5-k3-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1
- ticket: CSIT-1913
- note: proposed fix https://gerrit.fd.io/r/c/csit/+/38831
(M) 2n-spr 200Ge2P1Cx7Veat: TRex sees port line rate as 100 Gbps
- last update: 2023-04-19
- work-to-fix: medium
- rca: TRex has hard cap on perceived line rate.
- test: All tests on 2n-spr with 200Ge2P1Cx7Veat NIC.
- frequency: always (since 2n-spr was set up)
- testbed: 2n-spr
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/15/log.html.gz#s1-s1-s1-s5-s19-t1-k2-k9-k9-k10-k1-k1-k1-k11
- ticket: CSIT-1905
- note: Fixed by bumping TRex v3.03
(M) 2n-icx: interface down in nginx tests
- last update: 2023-06-21
- work-to-fix: medium
- rca: Likely an infra issue for TG with AB on NIC using ICE.
- test: All nginx tests except for xxv710 with dpdk driver.
- frequency: always since 2023-04-28
- testbed: 2n-icx
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-2n-icx/51/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k8-k3-k2-k1
- ticket: CSIT-1910
- note: Fixed by adding more logic to CSIT suite setup code for tests using AB.
(M) hoststack: ip4udpscale1cl10s-ldpreload-iperf3 times out waiting for strace
- last update: 2023-06-21
- work-to-fix: medium
- rca:
- test: Only the two ip4udpscale1cl10s-ldpreload-iperf3 tests
- frequency: always since 2023-03-04
- testbed: 3n-icx
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3n-icx/41/log.html.gz#s1-s1-s1-s1-s8-t1-k2-k5-k12
- ticket: CSIT-1908
- note: Fixed in vpp by https://gerrit.fd.io/r/c/vpp/+/38906
(M) 3n-tsh: vpp in VM starting too slowly
- last update: before 2023-02-22
- work-to-fix: medium
- rca: perhaps related to numa, investigation continues
- test: 3n-tsh: sporadic VM vhost
- frequency: high
- testbed: 3n-tsh
- example: 3n-tsh, 3n-tsh
- ticket: CSIT-1877
- note: Applied a workaround that simply waits longer. May lead to a rare failure, but for now considered fixed.
(M) 2n-clx, 2n-icx, 2n-zn2: DPDK testpmd 9000b tests on xxv710 nic are failing with no traffic
- last update: 2023-06-28
- work-to-fix: medium
- rca: The DPDK app only attempts to set MTU once, but if interface is down (CSIT-1848) it fails. As a workaround, MTU could be set on Linux interface before starting the DPDK app.
- test: DPDK testpmd 9000b
- frequency: always
- testbed: 2n-clx, 2n-icx, 2n-zn2
- example: 2n-clx, 2n-icx
- ticket: CSIT-1870
- note: Fixed, probably by the CSIT MTU handling change.
(M) 3n-alt: testpmd no traffic forwarded
- last update: 2023-06-29
- work-to-fix: medium
- rca: DUT-DUT link takes too long to come up on some testbeds. This happens *after* a test case with a DPDK app (not VPP even when using dpdk plugin), although multiple subsequent tests (even with VPP) may be affected. The real cause is probably in NIC firmware or driver, but CSIT can be better at detecting port status as a workaround.
- test: testpmd (also l3fwd but hidden by CSIT-1896)
- frequency: always (almost)
- testbed: 3n-alt, 3n-snr
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-3n-alt/65/log.html.gz#s1-s1-s1-s3-t1-k2-k4
- ticket: CSIT-1848
- note: The infra cause got fixed, but there still is CSIT-1904 with the same consequences.
(L) 3n-alt: Tests failing until 40Ge Interface comes up
- last update: 2023-06-22
- work-to-fix: medium
- rca: DUT-DUT link takes too long to come up due to CSIT-1848.
- test: first tests in order
- frequency: rare in recent times, but still not impossible
- testbed: 3n-alt (3n-snr link does not take that long)
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/155/log.html.gz#s1-s1-s1-s1-s1-t1
- ticket: CSIT-1890
