Difference between revisions of "CSIT/TestFailuresTracking"

From fd.io
Jump to: navigation, search
((M) tests with 9000B payload frames not forwarded over memif interfaces =: Add VPP-2091 as the otehr ticket.)
 
(104 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== CSIT Test Failure Clasification ==
+
= CSIT Test Failure Clasification =
  
 
All known CSIT failures grouped and listed in the following order:
 
All known CSIT failures grouped and listed in the following order:
Line 12: Line 12:
 
** Within each sub-group: most common use cases followed by less common.
 
** Within each sub-group: most common use cases followed by less common.
  
== CSIT Test Fixing Priorities ==
+
= CSIT Test Fixing Priorities =
  
* Test fixing work priorities defined as follows
+
Test fixing work priorities defined as follows:
** (H)igh priority, most common use cases and most common test code.
+
* (H)igh priority, most common use cases and most common test code.
** (M)edium priority, specific HW and pervasive test code issue.
+
* (M)edium priority, specific HW and pervasive test code issue.
** (L)ow priority, corner cases and external dependencies.
+
* (L)ow priority, corner cases and external dependencies.
  
== Always Failing Tests ==
+
= Current Failures =
 +
 
 +
== Deterministic Failures ==
  
 
=== In Trending ===
 
=== In Trending ===
  
==== (M) 3n-alt: Tests failing until 40Ge Interface comes up ====
+
==== (H) 3nb-spr hoststack: interface not up after first test ====
  
* (M) 3n-alt: Tests failing until 40Ge Interface comes up
+
* last update: 2024-02-07
** work-to-fix: easy
+
* work-to-fix: medium
** rca:
+
* rca: After first test, HundredGigabitEthernetab/0/0 never goes up within the run. Not sure which part f test setup is missing, the tests do work correctly on other testbeds.
** test: First tests in order
+
* test: All subsequent tests.
** frequency: always
+
* frequency: 100%
** testbed: 3n-alt
+
* testbed: 3nb-apr
** example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/149/log.html.gz#s1-s1-s1-s1-s1-t1
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3nb-spr/87/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k3-k7-k7
** ticket:  
+
* ticket: [https://jira.fd.io/browse/CSIT-1942 CSIT-1942]
** note: In last 6 runs the first tests are failing until interface comes up. Number of failed tests wary (2-61 failed tests). Issue can be result of ARM guys using the testbed or Vratko running DPDK testpmd runs.
+
  
==== (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device ====
+
==== (H) Zero traffic reported in udpquic tests ====
  
* (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device
+
* last update: 2024-02-07
** work-to-fix: hard
+
* work-to-fix: medium
** rca: Failed to bind PCI device 0000:f4:00.0 to c4xxx on host 10.30.51.93
+
* rca: There are errors when closing sessions. The current CSIT logic happens to report this as a passing test with zero traffic, which is wrong.
** test: hwasync wireguard
+
* test: All tests udpquic tests
** frequency: always
+
* frequency: 100%
** testbed: 3n-snr
+
* testbed: all running the tests
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/90/log.html.gz#s1-s1-s1-s3-s1 3n-snr]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3n-icx/217/log.html.gz#s1-s1-s1-s1-s4-t1-k2-k4-k17
** ticket: [https://jira.fd.io/browse/CSIT-1883 CSIT-1883]
+
* ticket: [https://jira.fd.io/browse/CSIT-1935 CSIT-1935]
** note: missing QAT driver.
+
  
==== (M) 1n-aws: TRex mlrsearch fails to find NDR & PDR due to AWS rate limiting (5min total test duration) ====
+
==== (H) 3na-spr, 2n-zn2: VPP fails to start in first test cases if dpdk/mlx5 ====
  
* (M) 1n-aws: TRex NDR PDR ALL IP4 scale and L2 scale tests failing with 50% packet loss
+
* last update: 2024-02-07
** work-to-fix: hard
+
* work-to-fix: medium
** rca:
+
* rca: DUT1 fails to boot up in first test case. If DUT2 is present, it fails to start in second test case. Other test cases in the run are unaffected. This looks like an infra issue, ansible cleanup is doing something wrong, hard to tell what.
** test: ip4scale2m
+
* test: First test case, unless it uses AVF driver.
** frequency: always
+
* frequency: 100%
** testbed: 1n-aws
+
* testbed: 3na-spr, 2n-zn2
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-trex-perf-ndrpdr-weekly-master-1n-aws/17/log.html.gz#s1-s1-s1-s1-s2-t1 1n-aws]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3na-spr/139/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k3-k2-k5-k1-k1-k1-k1
** ticket: [https://jira.fd.io/browse/CSIT-1876 CSIT-1876]
+
* ticket: [https://jira.fd.io/browse/CSIT-1939 CSIT-1939]
** note: The root cause can be shared environment in aws cloud.
+
  
==== (M) 3n-alt, 3n-snr: testpmd no traffic forwarded ====
+
==== (M) DPDK 23.03 testpmd startup fails on some testbeds ====
  
* (M) 3n-alt, 3n-snr: testpmd tests fail with no traffic
+
* last update: 2023-11-06
** work-to-fix: hard
+
* work-to-fix: medium
** rca:
+
* rca: The DUT-DUT link sometimes does not go up. The same consequences as CSIT-1848 but affects more testbed+NIC combinations. Can be fixed by a different startup procedure (better detection before for restart).
** test: testpmd
+
* test: Testpmd (no vpp). Around half of tested+NIC combinations are affected.
** frequency: always
+
* frequency: Low, ~1%, as in most testbeds the link does go up.
** testbed: 3n-alt, 3n-snr
+
* testbed: multiple
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-3n-alt/42/log.html.gz#s1-s1-s1-s1-t1 3n-alt], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-3n-snr/6/log.html.gz#s1-s1-s1-s1-t1 3n-snr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-3n-snr/14/log.html.gz#s1-s1-s1-s1-t1 3n-snr]
+
* example:https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2310-2n-zn2/8/log.html.gz#s1-s1-s1-s1-t6-k2-k4
** ticket: [https://jira.fd.io/browse/CSIT-1848 CSIT-1848]
+
* ticket: [https://jira.fd.io/browse/CSIT-1904 CSIT-1904]
** note:
+
  
=== not in trending ===
+
==== (M) 2n-spr: zero traffic on cx7 rdma ====
  
==== (H) 3n-icx: vpp hoststack QUIC vppecho tests failing ====
+
* last update: 2023-06-22
 +
* work-to-fix: medium
 +
* rca: VPP reports "tx completion errors", more investigation ongoing.
 +
* test: All tests on 2n-spr with 200Ge2P1Cx7Veat NIC and RDMA driver.
 +
* frequency: always (since 2n-spr was set up)
 +
* testbed: 2n-spr
 +
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/58/log.html.gz#s1-s1-s1-s5-s19-t1-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1
 +
* ticket: [https://jira.fd.io/browse/CSIT-1906 CSIT-1906]
 +
* note: Also would affect 3n-alt with mlxc6 and rdma. Will probably be made invisible by removing rdma (except mlxc5) from jobspecs.
  
* (H) 3n-icx: QUIC vppecho BPS tests failing on timeout when checking hoststack finished
+
==== (M) Lossy trials in nat udp mlx5 tests ====
** work-to-fix: easy
+
** rca:
+
** test: Quic vppecho BPS
+
** frequency: always
+
** testbed: 3n-skx, 3n-icx
+
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2210-3n-icx/17/log.html.gz#s1-s1-s1-s1-s5-t1 3n-icx]
+
** ticket: [https://jira.fd.io/browse/CSIT-1835 CSIT-1835]
+
** note:
+
  
==== (M) all testbeds: vpp 9000B tests with vhostuser, memif, tunnels, avf ====
+
* last update: 2024-02-07
 +
* work-to-fix:
 +
* rca: VPP counters suggest the packet is lost somewhere between TG on in-side [1] and VPP [2].
 +
* test: It is affecting only cx7 with mlx5 driver (not e810cq with avf driver), only udp tests (not tcp or other), and interestingly both ASTF (both cps and tput for nat44ed) and STL (det44) traffic profiles. It does not affect corresponding ASTF tests with ip4base routing.
 +
* frequency: depends on scale, 100% on high scale tests.
 +
* testbed: 2n-icx, 2n-spr.
 +
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-spr/45/log.html.gz#s1-s1-s1-s2-s49-t2-k2-k14-k14
 +
* ticket: [https://jira.fd.io/browse/CSIT-1929 CSIT-1929]
  
* All tests with 9000B payload frames not forwarded over vhostuser interfaces.
+
==== (L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch ====
** work-to-fix: hard
+
** rca: VPP code: [https://gerrit.fd.io/r/c/vpp/+/34839 34839: dpdk: cleanup MTU handling]
+
** test: 9000B - vhostuser
+
** frequency: always
+
** testbed: 2n-skx, 3n-skx, 2n-clx
+
** examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2202-3n-skx/67/log.html.gz#s1-s1-s1-s1-s1 3n-skx vhostuser]
+
** ticket: [https://jira.fd.io/browse/CSIT-1809 CSIT-1809]
+
** note:
+
  
* All tests with 9000B payload frames not forwarded over memif interfaces.
+
* last update: 2023-04-19
** work-to-fix: hard
+
* work-to-fix:
** rca: VPP code: [https://gerrit.fd.io/r/c/vpp/+/34839 34839: dpdk: cleanup MTU handling]
+
* rca:
** test: 9000B - memif
+
* test: TB38 AVF 4c e810cq, earlier l2patch nowadays eth-l2xcbase
** frequency: always
+
* frequency: always
** testbed: 2n-skx, 3n-skx, 2n-clx
+
* testbed: 3n-icx (only TB38, never TB37)
** examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2202-2n-skx/33/log.html.gz#s1-s1-s1-s1-s1 2n-skx Memif]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/214/log.html.gz#s1-s1-s1-s5-s3-t3-k2-k9-k8-k13-k1-k2
** ticket: [https://jira.fd.io/browse/CSIT-1808 CSIT-1808]
+
* ticket: [https://jira.fd.io/browse/CSIT-1901 CSIT-1901]
** note:
+
  
* 9000B payload frames not forwarded over tunnels due to violating supported Max Frame Size (VxLAN, LISP, SRv6)
+
==== (L) 2n-tx2: af_xdp mrr failures ====
** work-to-fix: hard
+
** rca: VPP code: [https://gerrit.fd.io/r/c/vpp/+/34839 34839: dpdk: cleanup MTU handling]
+
** test: 9000B - IP4 tunnels VXLAN, IP4 tunnels LISP, Srv6
+
** frequency: always
+
** testbed: 2n-icx, 3n-icx
+
** examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/10/log.html.gz 2n-icx VXLAN], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-3n-icx/22/log.html.gz#s1-s1-s1-s1-s1-t6 3n-icx]
+
** ticket: [https://jira.fd.io/browse/CSIT-1801 CSIT-1801]
+
** note:
+
  
* (M) 3n-icx: 9000b ip4 ip6 l2 NDRPDR AVF tests are failing to forward traffic
+
* last update: 2023-11-08
** work-to-fix: hard
+
* work-to-fix:
** rca:
+
* rca: Some workers see no traffic, "error af_xdp_device_input_refill_db: rx poll() failed: Bad address" in show hardware. More examination needed.
** test: 9000B - IP4, IP6, l2 - base and scale
+
* test: ip4 and ip6, base and scale
** frequency: always
+
* frequency: more than 80% on a subset of cases, 100% on (multicore) 2n-tx2
** testbed: 3n-icx
+
* testbed: 2n-tx2; to a lesser extent also 2n-clx and 2n-icx, where just decreased performance (and failure in ndrpdr) is more likely outcome.
** examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-3n-icx/13/log.html.gz#s1-s1-s1-s1-s1-t6 3n-icx ip4base]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/120/log.html.gz#s1-s1-s1-s2-s2-t2-k2-k10-k9-k14-k1-k1-k1-k1
** ticket: [https://jira.fd.io/browse/CSIT-1885 CSIT-1885]
+
* ticket: [https://jira.fd.io/browse/CSIT-1922 CSIT-1922]
** note:
+
  
* (M) 2n-clx, 2n-icx, 2n-zn2: DPDK testpmd 9000b tests on xxv710 nic are failing with no traffic
+
=== Not In Trending ===
** work-to-fix: hard
+
** rca:
+
** test: DPDK testpmd 9000b tests on xxv710 nic
+
** frequency: always
+
** testbed: 2n-clx, 2n-icx, 2n-zn2
+
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-2n-clx/1/log.html.gz#s1-s1-s1-s3-t6 2n-clx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-2n-icx/3/log.html.gz#s1-s1-s1-s1-t6 2n-icx]
+
** ticket: [https://jira.fd.io/browse/CSIT-1870 CSIT-1870]
+
** note:
+
  
==== (M) 2n-clx, 2n-icx: all Geneve tests with 1024 tunnels fail ====
+
==== (M) tests with 9000B payload frames not forwarded over memif interfaces =====
  
* (M) All Geneve L3 mode scale tests (1024 tunnels) are failing
+
* last update: 2023-02-09
** work-to-fix: hard
+
* work-to-fix: hard
** rca: VPP crash, Failed to add IP neighbor on interface geneve_tunnel258
+
* test: 9000B + memif
** test: avf-ethip4--ethip4udpgeneve-1024tun-ip4base 64B 1518B IMIX 1c 2c 4c
+
* testbed: 2n-skx, 3n-skx, 2n-clx
** frequency: always
+
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2202-2n-skx/33/log.html.gz#s1-s1-s1-s1-s1 2n-skx Memif]
** testbed: 2n-skx, 2n-clx, 2n-icx
+
* ticket: [https://jira.fd.io/browse/CSIT-1808 CSIT-1808] VPP-2091
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/10/log.html.gz#s1-s1-s1-s1-s1 2n-icx]
+
 
** ticket: [https://jira.fd.io/browse/CSIT-1800 CSIT-1800]
+
==== (M) IMIX 4c tests may fail PDR due to ~10% loss =====
** note:
+
 
 +
* last update: 2024-02-07
 +
* work-to-fix: hard
 +
* rca: Only seen in coverage tests so far, more data needed.
 +
* test: various high-performance, mostly mlx5
 +
* frequency: <1%
 +
* testbed: 2n-icx, 3n-icx, 3na-spr
 +
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-coverage-2310-2n-icx/1/log.html.gz#s1-s1-s1-s3-t9-k2-k5-k16-k9
 +
* ticket: [https://jira.fd.io/browse/CSIT-1943 CSIT-1943]
 +
 
 +
==== (L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B =====
 +
 
 +
* last update: 2023-06-28
 +
* work-to-fix: medium
 +
* test: 9000B + Cx7 with DPDK DUT
 +
* testbed: 2n-icx
 +
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-coverage-2306-2n-icx/2/log.html.gz#s1-s1-s1-s4-t6-k2-k4]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1924 CSIT-1924]
  
 
==== (L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail ====
 
==== (L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail ====
  
* (L) All NAT44-ED 16M sessions CPS scale tests fail while setting NAT44 address range.
+
* last update: 2023-07-12
** work-to-fix: hard
+
* work-to-fix: hard
** rca: VPP crash, Failed to set NAT44 address range on host 10.30.51.44 (connections-per-second tests only)
+
* rca: Ramp-up trial takes more than 5 minutes so sessions are timing out.
** test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c
+
* test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c
** frequency: always
+
* frequency: always
** testbeds: 2n-skx, 2n-clx, 2n-icx
+
* testbeds: 2n-skx, 2n-clx, 2n-icx
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/18/log.html.gz#s1-s1-s1-s1-s11-t3 2n-icx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-clx/9/log.html.gz#s1-s1-s1-s1-s11-t1 2n-clx]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2306-2n-clx/8/log.html.gz#s1-s1-s1-s1-s14-t2-k2-k15-k1
** ticket: [https://jira.fd.io/browse/CSIT-1799 CSIT-1799]
+
* ticket: [https://jira.fd.io/browse/CSIT-1799 CSIT-1799]
** note:
+
  
 
==== (L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions ====
 
==== (L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions ====
  
* (L) 2n-clx, 2n-icx: All NAT44DET NDR PDR IMIX over 1M sessions BIDIR tests failing to create enough sessions
+
* last update: 2023-07-12
** work-to-fix: hard
+
* work-to-fix: medium
** rca:
+
* rca: One possible cause is CSIT not counting ramp-up rate properly for IMIX (if multiple packets belong to the same session).
** test: IMIX over 1M sessions bidir
+
* test: IMIX over 1M sessions bidir
** frequency: always
+
* frequency: always
** testbed: 2n-skx, 2n-clx, 2n-icx
+
* testbed: 2n-skx, 2n-clx, 2n-icx
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/18/log.html.gz#s1-s1-s1-s1-s2-t4 2n-icx]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2306-2n-clx/7/log.html.gz#s1-s1-s1-s1-s2-t4-k2-k11-k1-k2
** ticket: [https://jira.fd.io/browse/CSIT-1884 CSIT-1884]
+
* ticket: [https://jira.fd.io/browse/CSIT-1884 CSIT-1884]
** note:
+
  
== Sometimes failing tests ==
+
== Occasional Failures ==
  
=== in trending - high frequency failures ===
+
=== In Trending ===
  
 
==== (H) 2n-icx: NFV density VPP does not start in container ====
 
==== (H) 2n-icx: NFV density VPP does not start in container ====
  
* (H) 2n-icx: NFV density tests breaks VPP which fails to start (re-opened)
+
* last update: before 2023-01-31
** work-to-fix: hard
+
* work-to-fix: hard
** rca:
+
* rca:
** test: all subsequent
+
* test: all subsequent
** frequency: medium
+
* frequency: medium
** testbed: 2n-icx
+
* testbed: 2n-icx
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-weekly-master-2n-icx/53/console.log.gz 2n-icx mrr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-icx/48/log.html.gz#s1-s1-s1-s5-s8-t1 2n-icx ndrpdr]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-weekly-master-2n-icx/57/log.html.gz 2n-icx mrr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-icx/48/log.html.gz#s1-s1-s1-s5-s8-t1 2n-icx ndrpdr]
** ticket: [https://jira.fd.io/browse/CSIT-1881 CSIT-1881]
+
* ticket: [https://jira.fd.io/browse/CSIT-1881 CSIT-1881]
** note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.
+
* note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.
  
==== (M) 2n-clx: e810 mlrsearch tests packets forwarding in one direction ====
+
==== (M) 3n-alt: high scale ipsec policy tests may crash VPP ====
  
* (M) 2n-clx: half of the packets lost on PDR tests (re-opened)
+
* last update: 2024-02-07
** work-to-fix: hard
+
* work-to-fix: hard
** rca:
+
* rca: Vpp is crashing without creating core.
** test: e810Cq ip4base, ip6base
+
* test: policy ipsec tests, large scale increases probability.
** frequency: high
+
* frequency: 15% and lower.
** testbed: 2n-clx
+
* testbed: 3n-alt, also seen rarely on 3n-icx.
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-clx/176/log.html.gz#s1-s1-s1-s2-s8-t1 2n-clx]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/357/log.html.gz#s1-s1-s1-s1-s15-t4-k3-k5-k1-k1
** ticket: [https://jira.fd.io/browse/CSIT-1864 CSIT-1864]
+
* ticket: [https://jira.fd.io/browse/CSIT-1938 CSIT-1938]
** note:
+
  
==== (M) 3n-icx: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c ====
+
==== (M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c ====
  
* (M) 3n-icx: Wireguard tests with 100 and more tunnels are failing PDR criteria
+
* last update: before 2023-01-31
** work-to-fix: easy
+
* work-to-fix: easy
** rca:
+
* rca:
** test: wireguard 100 tunnels and more
+
* test: wireguard 100 tunnels and more
** frequency: high
+
* frequency: high
** testbed: 3n-icx
+
* testbed: 3n-icx, 3n-snr
** examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/56/log.html.gz#s1-s1-s1-s3-s8-t4 3n-icx]
+
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/56/log.html.gz#s1-s1-s1-s3-s8-t4 3n-icx]
** ticket: [https://jira.fd.io/browse/CSIT-1886 CSIT-1886]
+
* ticket: [https://jira.fd.io/browse/CSIT-1886 CSIT-1886]
** note:
+
 
+
==== (M) 3n-tsh: vpp in VM not starting ====
+
  
* (M) 3n-tsh: VM tests failing to boot VM
+
==== (M) e810cq sometimes reporting link down ====
** work-to-fix: easy
+
** rca:
+
** test: 3n-tsh: sporadic VM vhost
+
** frequency: high
+
** testbed: 3n-tsh
+
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-tsh/738/log.html.gz#s1-s1-s1-s7-s2-t1 3n-tsh], [https://jenkins.fd.io/view/csit/job/csit-vpp-perf-verify-master-3n-tsh/123/ 3n-tsh]
+
** ticket: [https://jira.fd.io/browse/CSIT-1877 CSIT-1877]
+
** note: 3n-alt testbed was fixed. 3n-tsh still failing. fixed: by rebuild initrd .37 on TB,
+
  
=== in trending - lower frequency failures ===
+
* last update: 2024-02-07
 +
* work-to-fix: hard
 +
* rca: Mostly causing failure symptom of TRex complaining about link down. Probably also causes zero throughput in one direction in ASTF tests or even defice test failures. More frequent with more performant tests (L2) but seen affecting any test on occasion.
 +
* test: Any that uses Intel-E810CQ NIC.
 +
* frequency: <20%
 +
* testbed: all with the NIC.
 +
* examples: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-spr/45/log.html.gz#s1-s1-s1-s2-s24-t1-k2-k12-k14
 +
* ticket: [https://jira.fd.io/browse/CSIT-1936 CSIT-1936]
 +
 
 +
== Rare Failures ==
 +
 
 +
=== In Trending ===
  
 
==== (M) 3n-icx, 3n-snr: 1518B IPsec packets not passing ====
 
==== (M) 3n-icx, 3n-snr: 1518B IPsec packets not passing ====
  
* (M) 3n-icx, 3n-skx, 3n-snr: all 1518B AVF crypto tests failed with no traffic, all IMIX AVF crypto with excessive packet loss
+
* last update: before 2023-01-31
** work-to-fix: hard
+
* work-to-fix: hard
** rca:
+
* rca:
** test: all AVF crypto
+
* test: all AVF crypto
** frequency: low
+
* frequency: low
** testbed: 3n-skx, 3n-icx, 3n-snr
+
* testbed: 3n-skx, 3n-icx, 3n-snr
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/197/log.html.gz#s1-s1-s1-s1-s4-t1 3n-icx daily], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/32/log.html.gz#s1-s1-s1-s1-s4-t1 3n-snr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/56/log.html.gz#s1-s1-s1-s1-s4-t1 3n-icx weekly]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/197/log.html.gz#s1-s1-s1-s1-s4-t1 3n-icx daily], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/32/log.html.gz#s1-s1-s1-s1-s4-t1 3n-snr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/57/log.html.gz#s1-s1-s1-s1-s4-t1 3n-icx weekly]
** ticket: [https://jira.fd.io/browse/CSIT-1827 CSIT-1827]
+
* ticket: [https://jira.fd.io/browse/CSIT-1827 CSIT-1827]
** note:
+
  
 
==== (M) all testbeds: mlrsearch fails to find NDR rate ====
 
==== (M) all testbeds: mlrsearch fails to find NDR rate ====
  
* (M) 3n-tsh, 3n-alt, 2n-clx testbed (Taishan, Altra, Cascade-lake): NDR tests failing from time to time.
+
* last update: before 2023-06-22
** work-to-fix: hard
+
* work-to-fix: hard
** rca:
+
* rca: On 3n-tsh, the symptom is TRex reporting ierrors, only in one direction. Other testbeds may have a different symptom, but failures there are less frequent.
** test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
+
* test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
** frequency: low
+
* frequency: low
** testbed: 3n-tsh, 3n-alt, 2n-clx
+
* testbed: 3n-tsh, 3n-alt, 2n-clx
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-icx/57/log.html.gz#s1-s1-s1-s2-s37-t2 2n-icx]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-tsh/14/log.html.gz#s1-s1-s1-s5-s8-t2]
** ticket: [https://jira.fd.io/browse/CSIT-1804 CSIT-1804]
+
* ticket: [https://jira.fd.io/browse/CSIT-1804 CSIT-1804]
** note:
+
  
 
==== (M) all testbeds: AF_XDP mlrsearch fails to find NDR rate ====
 
==== (M) all testbeds: AF_XDP mlrsearch fails to find NDR rate ====
  
* (M) all testbeds: AF-XDP - NDR tests failing from time to time
+
* last update: before 2023-01-31
** work-to-fix: hard
+
* work-to-fix: hard
** rca:
+
* rca:
** test: af-xdp multicore tests
+
* test: af-xdp multicore tests
** frequency: low
+
* frequency: low
** testbed: 2n-clx, 2n-skx, 2n-tx2, 2n-icx
+
* testbed: 2n-clx, 2n-skx, 2n-tx2, 2n-icx
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-skx/202/log.html.gz#s1-s1-s1-s2-s4-t3 2n-skx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-clx/152/log.html.gz#s1-s1-s1-s5-s12-t3 2n-clx]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-skx/202/log.html.gz#s1-s1-s1-s2-s4-t3 2n-skx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-clx/152/log.html.gz#s1-s1-s1-s5-s12-t3 2n-clx]
** ticket: [https://jira.fd.io/browse/CSIT-1802 CSIT-1802]
+
* ticket: [https://jira.fd.io/browse/CSIT-1802 CSIT-1802]
** note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100
+
* note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100
  
 
==== (L) all testbeds: vpp create avf interface failure in multi-core configs ====
 
==== (L) all testbeds: vpp create avf interface failure in multi-core configs ====
  
* (L) multicore AVF tests are failing when trying to create interface
+
* last update: 2023-02-06
** work-to-fix: hard
+
* work-to-fix: hard
** rca: issue in Intel FVL driver
+
* rca: issue in Intel FVL driver
** test: multicore AVF
+
* test: multicore AVF
** frequency: low
+
* frequency: low
** testbed: all testbeds
+
* testbed: all testbeds
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1257/log.html.gz#s1-s1-s1-s5-s24-t2 2n-clx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/197/log.html.gz#s1-s1-s1-s5-s1-t3 3n-icx]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1257/log.html.gz#s1-s1-s1-s5-s24-t2 2n-clx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/197/log.html.gz#s1-s1-s1-s5-s1-t3 3n-icx]
** ticket: [https://jira.fd.io/browse/CSIT-1782 CSIT-1782]
+
* ticket: [https://jira.fd.io/browse/CSIT-1782 CSIT-1782]
** note: A long standing issue without a final permanent fix.
+
* note: A long standing issue without a final permanent fix.
  
 
==== (L) all testbeds: nat44det 4M and 16M scale 1 session not established ====
 
==== (L) all testbeds: nat44det 4M and 16M scale 1 session not established ====
  
* (L) Not all DET44 sessions have been established: 4128767 != 4128768
+
* last update: 2023-02-14
** work-to-fix: hard
+
* work-to-fix: hard
** rca:
+
* rca: unknown
** test: nat44det udp 4m and 16m (64k and 1m are ok)
+
* test: nat44det udp 4m and 16m (64k is ok, 1m can fail but rarely than bigger scales)
** frequency: low
+
* frequency: low
** testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx
+
* testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-icx/56/log.html.gz#s1-s1-s1-s2-s35-t3 2n-icx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-clx/164/log.html.gz#s1-s1-s1-s2-s54-t1 2n-clx]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-zn2/672/log.html.gz#s1-s1-s1-s2-s22-t3 2n-zn2], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1271/log.html.gz#s1-s1-s1-s2-s60-t1-k2-k11-k1-k2 2n-clx]
** ticket: [https://jira.fd.io/browse/CSIT-1795 CSIT-1795]
+
* ticket: [https://jira.fd.io/browse/CSIT-1795 CSIT-1795]
** note:
+
 
+
== Fixed issues ==
+
 
+
==== (H) all testbeds: all DPDK tests did not run because DPDK failed to install meson ====
+
 
+
* (H) all testbeds: all DPDK tests did not run because required meson version was not installed
+
** work-to-fix: easy
+
** rca: upgraded meson from 0.49.2 to 0.64.1
+
** test: all
+
** frequency: always
+
** testbed: all
+
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-2n-clx/175/log.html.gz 2n-clx]
+
** ticket:
+
** note:
+
 
+
==== (L) 2n-dnv: nat44ed 1518B 64k sessions not establishing all sessions ====
+
 
+
* (L) 2n-dnv: sporadic 1518B tput tests failing to establish required sessions
+
** work-to-fix: hard
+
** rca:
+
** test: 1518B tput
+
** frequency: low
+
** testbeds: 2n-dnv
+
** examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-dnv/1264/log.html.gz#s1-s1-s1-s1-s7-t4 2n-dnv]
+
** ticket: [https://jira.fd.io/browse/CSIT-1850 CSIT-1850]
+
** note: 2n-dnv and 3n-dnv are turned off as they are going to be decomissioned soon.
+
 
+
==== (L) 2n-dnv, 3n-dnv: x557 auto-negotiating 1ge instead of 10ge ====
+
 
+
* (L) T-Rex STL runtime error
+
** work-to-fix: hard
+
** rca: VPP code - X557 speed_capability set 1GE instead of 10GE
+
** test: all tests
+
** frequency: high
+
** testbed: 2n-dnv and 3n-dnv
+
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-dnv/1264/log.html.gz#s1-s1-s1-s1-s3-t1 2n-dnv], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-dnv/1274/log.html.gz#s1-s1-s1-s2-s1-t1 3n-dnv]
+
** ticket: [/VPP-2010](https://jira.fd.io/browse/VPP-2010)
+
** note: TODO VPP to fix speed_capability.
+
** note: 2n-dnv and 3n-dnv are turned off as they are going to be decomissioned soon.
+
 
+
==== (M) 3n-snr: 25GE links randomly going down between snr/sut and icx/tg-trex ====
+
  
* (M) 3n-snr: 25GE interface between SUT and TG/TRex goes down randomly
+
==== (L) TRex may wrongly detect link bandwidth ====
** work-to-fix: hard
+
** rca: QSFP convertor replaced.
+
** test: all subsequent
+
** frequency: high
+
** testbed: 3n-snr
+
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/45/log.html.gz#s1-s1-s1-s3-s12-t1 3n-snr]
+
** ticket: [https://jira.fd.io/browse/CSIT-1871 CSIT-1871]
+
** note: Sometimes 'TwentyFiveGigabitEthernetec/0/0' goes down and all subsequent tests fail.
+
  
==== (H) 2n-clx, 2n-zn2: VPP RDMA tests no traffic forwarded ====
+
* last update: 2024-02-07
 +
* work-to-fix: hard
 +
* rca: Quite rare failure affecting unpredictable tests. Perhaps a less severe symptom of CSIT-1936.
 +
* test: No obvious pattern due to low frequency
 +
* frequency: <0.4%
 +
* testbed: recently seen on 3n-tsh and 3nb-spr
 +
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2310-3nb-spr/28/log.html.gz#s1-s1-s1-s1-s4-t1-k2-k13-k14
 +
* ticket: [https://jira.fd.io/browse/CSIT-1941 CSIT-1941]
  
* (H) 2n-clx, 2n-zn2: all RDMA tests failing with cli_inband clear runtime command
+
= Past Failures =
** work-to-fix: easy
+
** rca: for-loop initialization in scalar path
+
** test: all RDMA with CX556A NIC
+
** frequency: always
+
** testbed: 2n-clx, 2n-zn2
+
** example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1212/log.html.gz#s1-s1-s1-s1-s1-t1 2n-clx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-zn2/639/log.html.gz#s1-s1-s1-s1-s1-t1 2n-zn2], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-clx/167/log.html.gz#s1-s1-s1-s2-s5-t1 2n-clx]
+
** ticket: [https://jira.fd.io/browse/CSIT-1882 CSIT-1882]
+
** fix: 37720: rdma: fix for-loop initialization in scalar path | https://gerrit.fd.io/r/c/vpp/+/37720
+

Latest revision as of 14:49, 7 February 2024

Contents

CSIT Test Failure Clasification

All known CSIT failures grouped and listed in the following order:

  • Always failing followed by sometimes failing.
  • Always failing tests:
    • Most common use cases followed by less common.
  • Sometimes failing tests:
    • Most frequently failing followed by less frequently failing.
      • High frequency 50%-100%
      • medium frequency 10%-50%
      • low frequency 0%-10%.
    • Within each sub-group: most common use cases followed by less common.

CSIT Test Fixing Priorities

Test fixing work priorities defined as follows:

  • (H)igh priority, most common use cases and most common test code.
  • (M)edium priority, specific HW and pervasive test code issue.
  • (L)ow priority, corner cases and external dependencies.

Current Failures

Deterministic Failures

In Trending

(H) 3nb-spr hoststack: interface not up after first test

(H) Zero traffic reported in udpquic tests

(H) 3na-spr, 2n-zn2: VPP fails to start in first test cases if dpdk/mlx5

(M) DPDK 23.03 testpmd startup fails on some testbeds

(M) 2n-spr: zero traffic on cx7 rdma

(M) Lossy trials in nat udp mlx5 tests

(L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch

(L) 2n-tx2: af_xdp mrr failures

Not In Trending

(M) tests with 9000B payload frames not forwarded over memif interfaces =

  • last update: 2023-02-09
  • work-to-fix: hard
  • test: 9000B + memif
  • testbed: 2n-skx, 3n-skx, 2n-clx
  • examples: 2n-skx Memif
  • ticket: CSIT-1808 VPP-2091

(M) IMIX 4c tests may fail PDR due to ~10% loss =

(L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B =

  • last update: 2023-06-28
  • work-to-fix: medium
  • test: 9000B + Cx7 with DPDK DUT
  • testbed: 2n-icx
  • examples: [1]
  • ticket: CSIT-1924

(L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail

(L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions

Occasional Failures

In Trending

(H) 2n-icx: NFV density VPP does not start in container

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: all subsequent
  • frequency: medium
  • testbed: 2n-icx
  • example: 2n-icx mrr, 2n-icx ndrpdr
  • ticket: CSIT-1881
  • note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.

(M) 3n-alt: high scale ipsec policy tests may crash VPP

(M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c

  • last update: before 2023-01-31
  • work-to-fix: easy
  • rca:
  • test: wireguard 100 tunnels and more
  • frequency: high
  • testbed: 3n-icx, 3n-snr
  • examples: 3n-icx
  • ticket: CSIT-1886

(M) e810cq sometimes reporting link down

Rare Failures

In Trending

(M) 3n-icx, 3n-snr: 1518B IPsec packets not passing

(M) all testbeds: mlrsearch fails to find NDR rate

  • last update: before 2023-06-22
  • work-to-fix: hard
  • rca: On 3n-tsh, the symptom is TRex reporting ierrors, only in one direction. Other testbeds may have a different symptom, but failures there are less frequent.
  • test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
  • frequency: low
  • testbed: 3n-tsh, 3n-alt, 2n-clx
  • example: [2]
  • ticket: CSIT-1804

(M) all testbeds: AF_XDP mlrsearch fails to find NDR rate

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: af-xdp multicore tests
  • frequency: low
  • testbed: 2n-clx, 2n-skx, 2n-tx2, 2n-icx
  • example: 2n-skx, 2n-clx
  • ticket: CSIT-1802
  • note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100

(L) all testbeds: vpp create avf interface failure in multi-core configs

  • last update: 2023-02-06
  • work-to-fix: hard
  • rca: issue in Intel FVL driver
  • test: multicore AVF
  • frequency: low
  • testbed: all testbeds
  • example: 2n-clx, 3n-icx
  • ticket: CSIT-1782
  • note: A long standing issue without a final permanent fix.

(L) all testbeds: nat44det 4M and 16M scale 1 session not established

  • last update: 2023-02-14
  • work-to-fix: hard
  • rca: unknown
  • test: nat44det udp 4m and 16m (64k is ok, 1m can fail but rarely than bigger scales)
  • frequency: low
  • testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx
  • example: 2n-zn2, 2n-clx
  • ticket: CSIT-1795

(L) TRex may wrongly detect link bandwidth

Past Failures