Difference between revisions of "CSIT/TestFailuresTracking"

From fd.io
Jump to: navigation, search
(Add CSIT-1913.)
((L) 3n-snr: Increased heap size in ipsec policy tests prevents VPP from starting: Add example link.)
 
(89 intermediate revisions by the same user not shown)
Line 25: Line 25:
 
=== In Trending ===
 
=== In Trending ===
  
==== (H) AVF suite setup fails if previous suite was also AVF ====
+
==== (H) 3n spr: Unusable performance of ipsec tests with SHA_256_128 ====
  
* last update: 2023-05-15
+
* last update: 2024-08-14
* work-to-fix: low
+
* work-to-fix: low (if you are Damjan)
* rca: After a recent change, CSIT attempt to bind an already bound driver.
+
* rca: It seems compiler emits wrong instructions, VPP build system needs to be fixed.
* test: All AVF suites if the previous suite running on the NIC was also AVF.
+
* test: Three new ipsec tests that use SHA_256_128 as integrity algorithm.
* frequency: always since 2023-05-09
+
* frequency: 50%
* testbed: all (if having AVF supported NIC)
+
* testbed: 3na-spr, 3nb-spr
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/265/log.html.gz#s1-s1-s1-s3-s2-k1-k5-k3-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3na-spr/57/log.html.gz#s1-s1-s1-s1-s3-t1-k3-k7-k1-k1-k1-k8-k10-k1-k1-k1-k12]
* ticket: [https://jira.fd.io/browse/CSIT-1913 CSIT-1913]
+
* ticket: [https://jira.fd.io/browse/VPP-2118 VPP-2118]
 +
* note: Other Xeon testbeds are also affected, but performance is not as bad to fail NDR. ARM is not affected at all.
  
==== (M) 2n-icx: interface down in nginx tests ====
+
==== (H) 3nb-spr hoststack: interface not up after first test ====
  
* last update: 2023-05-03
+
* last update: 2024-02-07
 
* work-to-fix: medium
 
* work-to-fix: medium
* rca: Likely an infra issue for TG with AB on NIC using ICE.
+
* rca: After first test, HundredGigabitEthernetab/0/0 never goes up within the run. Not sure which part f test setup is missing, the tests do work correctly on other testbeds.
* test: All nginx tests except for xxv710 with dpdk driver.
+
* test: All subsequent tests.
* frequency: always since 2023-04-28
+
* frequency: 100%
* testbed: 2n-icx
+
* testbed: 3nb-apr
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-2n-icx/51/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k8-k3-k2-k1
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3nb-spr/87/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k3-k7-k7]
* ticket: [https://jira.fd.io/browse/CSIT-1910 CSIT-1910]
+
* ticket: [https://jira.fd.io/browse/CSIT-1942 CSIT-1942]
  
==== (M) hoststack: ip4udpscale1cl10s-ldpreload-iperf3 times out waiting for strace ====
+
==== (H) Zero traffic reported in udpquic tests ====
  
* last update: 2023-04-19
+
* last update: 2024-02-07
 
* work-to-fix: medium
 
* work-to-fix: medium
* rca:  
+
* rca: There are errors when closing sessions. The current CSIT logic happens to report this as a passing test with zero traffic, which is wrong.
* test: Only the two ip4udpscale1cl10s-ldpreload-iperf3 tests
+
* test: All tests udpquic tests
* frequency: always since 2023-03-04
+
* frequency: 100%
* testbed: 3n-icx
+
* testbed: all running the tests
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3n-icx/41/log.html.gz#s1-s1-s1-s1-s8-t1-k2-k5-k12
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3n-icx/217/log.html.gz#s1-s1-s1-s1-s4-t1-k2-k4-k17]
* ticket: [https://jira.fd.io/browse/CSIT-1908 CSIT-1908]
+
* ticket: [https://jira.fd.io/browse/CSIT-1935 CSIT-1935]
  
==== (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device ====
+
==== (M) DPDK 23.03 testpmd startup fails on some testbeds ====
  
* last update: before 2023-01-31
+
* last update: 2023-11-06
* work-to-fix: hard
+
* rca: Missing QAT driver. Symptom: Failed to bind PCI device 0000:f4:00.0 to c4xxx on host 10.30.51.93
+
* test: hwasync wireguard
+
* frequency: always
+
* testbed: 3n-snr
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/95/log.html.gz#s1-s1-s1-s3-s1 3n-snr]
+
* ticket: [https://jira.fd.io/browse/CSIT-1883 CSIT-1883]
+
 
+
==== (M) 1n-aws: TRex mlrsearch fails to find NDR & PDR due to AWS rate limiting (5min total test duration) ====
+
 
+
* last update: 2023-02-09
+
* work-to-fix: hard
+
* rca:
+
* test: ip4scale2m
+
* frequency: always
+
* testbed: 1n-aws
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-trex-perf-ndrpdr-weekly-master-1n-aws/18/log.html.gz#s1-s1-s1-s1-s2-t1 1n-aws]
+
* ticket: [https://jira.fd.io/browse/CSIT-1876 CSIT-1876]
+
* note: The root cause can be shared environment in aws cloud. We may need to use a smaller scale there.
+
 
+
==== (M) 3n-alt, 3n-snr: testpmd no traffic forwarded ====
+
 
+
* last update: 2023-02-09
+
 
* work-to-fix: medium
 
* work-to-fix: medium
* rca: DUT-DUT link takes too long to come up on some testbeds. This happens *after* a test case with a DPDK app (not VPP even when using dpdk plugin), although multiple subsequent tests (even with VPP) may be affected. The real cause is probably in NIC firmware or driver, but CSIT can be better at detecting port status as a workaround.
+
* rca: The DUT-DUT link sometimes does not go up. The same consequences as CSIT-1848 but affects more testbed+NIC combinations. Can be fixed by a different startup procedure (better detection before for restart).
* test: testpmd (also l3fwd but hidden by CSIT-1896)
+
* test: Testpmd (no vpp). Around half of tested+NIC combinations are affected.
* frequency: always (almost)
+
* frequency: Low, ~1%, as in most testbeds the link does go up.
* testbed: 3n-alt, 3n-snr
+
* testbed: multiple
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-3n-alt/42/log.html.gz#s1-s1-s1-s1-t1 3n-alt], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-3n-snr/6/log.html.gz#s1-s1-s1-s1-t1 3n-snr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-3n-snr/14/log.html.gz#s1-s1-s1-s1-t1 3n-snr]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2310-2n-zn2/8/log.html.gz#s1-s1-s1-s1-t6-k2-k4]
* ticket: [https://jira.fd.io/browse/CSIT-1848 CSIT-1848]
+
* ticket: [https://jira.fd.io/browse/CSIT-1904 CSIT-1904]
 
+
==== (M) 3n-alt: Tests failing until 40Ge Interface comes up ====
+
 
+
* last update: 2023-02-09
+
* work-to-fix: medium
+
* rca: DUT-DUT link takes too long to come up due to CSIT-1848.
+
* test: first tests in order
+
* frequency: always (almost, depends on run order)
+
* testbed: 3n-alt (3n-snr link does not take that long)
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/155/log.html.gz#s1-s1-s1-s1-s1-t1
+
* ticket: [https://jira.fd.io/browse/CSIT-1890 CSIT-1890]
+
 
+
==== (M) 2n-spr 200Ge2P1Cx7Veat: TRex sees port line rate as 100 Gbps ====
+
 
+
* last update: 2023-04-19
+
* work-to-fix: medium
+
* rca: TRex has hard cap on perceived line rate.
+
* test: All tests on 2n-spr with 200Ge2P1Cx7Veat NIC.
+
* frequency: always (since 2n-spr was set up)
+
* testbed: 2n-spr
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/15/log.html.gz#s1-s1-s1-s5-s19-t1-k2-k9-k9-k10-k1-k1-k1-k11
+
* ticket: [https://jira.fd.io/browse/CSIT-1905 CSIT-1905]
+
* note: Fix will be in TRex v3.03, possible workarounds being discussed.
+
  
 
==== (M) 2n-spr: zero traffic on cx7 rdma ====
 
==== (M) 2n-spr: zero traffic on cx7 rdma ====
  
* last update: 2023-04-19
+
* last update: 2023-06-22
 
* work-to-fix: medium
 
* work-to-fix: medium
 
* rca: VPP reports "tx completion errors", more investigation ongoing.
 
* rca: VPP reports "tx completion errors", more investigation ongoing.
Line 123: Line 78:
 
* frequency: always (since 2n-spr was set up)
 
* frequency: always (since 2n-spr was set up)
 
* testbed: 2n-spr
 
* testbed: 2n-spr
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/58/log.html.gz#s1-s1-s1-s5-s19-t1-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1]
 
* ticket: [https://jira.fd.io/browse/CSIT-1906 CSIT-1906]
 
* ticket: [https://jira.fd.io/browse/CSIT-1906 CSIT-1906]
* note: Currently not visible in trending as CSIT-1905 hits first.
+
* note: Also would affect 3n-alt with mlxc6 and rdma. Will probably be made invisible by removing rdma (except mlxc5) from jobspecs.
  
==== (M) 2n-clx: DPDK 23.03 link failures ====
+
==== (M) 3n-icx 3nb-spr: Failed to enable GTPU offload RX ====
  
* last update: 2023-04-17
+
* last update: 2024-08-14
* work-to-fix: medium
+
* work-to-fix: low
* rca: No link comes up on some NICs, investigation continues.
+
* rca: Retval is -1. More examination is needed to understand why.
* test: Testpmd (no vpp). Around half of tested+NIC combinations are affected.
+
* test: any gtpuhw
* frequency: always since 23.03.0 got released
+
* frequency: 100%
* testbed: multiple
+
* testbed: all
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-2n-clx/189/log.html.gz#s1-s1-s1-s1-t1-k2-k4
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3nb-spr/63/log.html.gz#s1-s1-s1-s3-s3-t1-k2-k8-k17-k1]
* ticket: [https://jira.fd.io/browse/CSIT-1904 CSIT-1904]
+
* ticket: [https://jira.fd.io/browse/CSIT-1953 CSIT-1950]
 +
 
 +
==== (M) Lossy trials in nat udp mlx5 tests ====
 +
 
 +
* last update: 2024-02-07
 +
* work-to-fix: hard
 +
* rca: VPP counters suggest the packet is lost somewhere between TG on in-side [1] and VPP [2].
 +
* test: It is affecting only cx7 with mlx5 driver (not e810cq with avf driver), only udp tests (not tcp or other), and interestingly both ASTF (both cps and tput for nat44ed) and STL (det44) traffic profiles. It does not affect corresponding ASTF tests with ip4base routing.
 +
* frequency: depends on scale, 100% on high scale tests.
 +
* testbed: 2n-icx, 2n-spr.
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-spr/45/log.html.gz#s1-s1-s1-s2-s49-t2-k2-k14-k14]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1929 CSIT-1929]
  
 
==== (L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch ====
 
==== (L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch ====
Line 145: Line 112:
 
* frequency: always
 
* frequency: always
 
* testbed: 3n-icx (only TB38, never TB37)
 
* testbed: 3n-icx (only TB38, never TB37)
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/214/log.html.gz#s1-s1-s1-s5-s3-t3-k2-k9-k8-k13-k1-k2
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/214/log.html.gz#s1-s1-s1-s5-s3-t3-k2-k9-k8-k13-k1-k2]
 
* ticket: [https://jira.fd.io/browse/CSIT-1901 CSIT-1901]
 
* ticket: [https://jira.fd.io/browse/CSIT-1901 CSIT-1901]
 +
 +
==== (L) 2n-tx2: af_xdp mrr failures ====
 +
 +
* last update: 2023-11-08
 +
* work-to-fix:
 +
* rca: Some workers see no traffic, "error af_xdp_device_input_refill_db: rx poll() failed: Bad address" in show hardware. More examination needed.
 +
* test: ip4 and ip6, base and scale
 +
* frequency: more than 80% on a subset of cases, 100% on (multicore) 2n-tx2
 +
* testbed: 2n-tx2; to a lesser extent also 2n-clx and 2n-icx, where just decreased performance (and failure in ndrpdr) is more likely outcome.
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/120/log.html.gz#s1-s1-s1-s2-s2-t2-k2-k10-k9-k14-k1-k1-k1-k1]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1922 CSIT-1922]
  
 
=== Not In Trending ===
 
=== Not In Trending ===
  
==== (M) all testbeds: some vpp 9000B tests ====
+
==== (M) Combination of AVF and vhost drops all 9000B packets ====
  
* last update: 2023-02-09
+
* last update: 2024-08-14
* work-to-fix: hard
+
* work-to-fix: medium
* rca: VPP code: [https://gerrit.fd.io/r/c/vpp/+/34839 34839: dpdk: cleanup MTU handling]. CSIT needs to rework how it sets MTU / max frame rate (CSIT-1797). Some tests will continue failing due to missing support on VPP side, we will open specific Jira tickets for those.
+
* rca: Buffer alloc error is seen, not sure why that happens.
* test: see sub-items
+
* test: 9000B vhost
* frequency: always
+
* frequency: 100%
 
* testbed: all
 
* testbed: all
* examples: see sub-items
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2406-2n-icx/33/log.html.gz#s1-s1-s1-s1-s1-t6-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1]
* ticket: [https://jira.fd.io/browse/CSIT-1809 CSIT-1809]
+
* ticket: [https://jira.fd.io/browse/CSIT-1951 CSIT-1951]
* gerrit: https://gerrit.fd.io/r/c/csit/+/37824
+
* note: Sometimes VPP crashes, not sure if the cause is the same.
  
===== (M) tests with 9000B payload frames not forwarded over vhost interfaces =====
+
==== (M) 9000B tests with encap overhead and non-dpdk plugins see fragmented packets ====
  
* last update: 2023-02-09
+
* last update: 2024-08-14
* work-to-fix: hard
+
* work-to-fix: medium
* test: 9000B + vhostuser
+
* rca: Some internal MTU is at 9000 (not 9200). More examination needed to see if the issue is in VPP or CSIT.
* testbed: 2n-skx, 3n-skx, 2n-clx
+
* test: 9000B testcases for loadbalance, geneva, vxlan or SRv6
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2202-3n-skx/67/log.html.gz#s1-s1-s1-s1-s1 3n-skx vhostuser]
+
* frequency: 100%
* ticket: [https://jira.fd.io/browse/CSIT-1809 CSIT-1809]
+
* testbed: all
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2406-3n-alt/22/log.html.gz#s1-s1-s1-s1-s1-t7-k3-k7-k1-k1-k1-k8-k14-k2-k1-k1-k1-k1]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1950 CSIT-1950]
 +
* note: Some tests drop packets (affects all tests), some tests fragment packets (does not fail MRR).
  
===== tests with 9000B payload frames not forwarded over memif interfaces =====
+
==== (M) IMIX 4c tests may fail PDR due to ~10% loss ====
  
* last update: 2023-02-09
+
* last update: 2024-02-07
 
* work-to-fix: hard
 
* work-to-fix: hard
* test: 9000B + memif
+
* rca: Only seen in coverage tests so far, more data needed.
* testbed: 2n-skx, 3n-skx, 2n-clx
+
* test: various high-performance, mostly mlx5
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2202-2n-skx/33/log.html.gz#s1-s1-s1-s1-s1 2n-skx Memif]
+
* frequency: <1%
* ticket: [https://jira.fd.io/browse/CSIT-1808 CSIT-1808]
+
* testbed: 2n-icx, 3n-icx, 3na-spr
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-coverage-2310-2n-icx/1/log.html.gz#s1-s1-s1-s3-t9-k2-k5-k16-k9]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1943 CSIT-1943]
  
===== 9000B payload frames not forwarded over tunnels due to violating supported Max Frame Size (VxLAN, LISP, SRv6) =====
+
==== (L) 3n-snr: Increased heap size in ipsec policy tests prevents VPP from starting ====
  
* last update: 2023-02-09
+
* last update: 2024-08-14
* work-to-fix: medium
+
* work-to-fix: low
* test: 9000B + (IP4 tunnels VXLAN, IP4 tunnels LISP, Srv6, IpSec)
+
* rca: Probably not enough huge pages.
* testbed: 2n-icx, 3n-icx
+
* test: ipsec policy large scale
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/10/log.html.gz 2n-icx VXLAN], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-3n-icx/22/log.html.gz#s1-s1-s1-s1-s1-t6 3n-icx]
+
* frequency: 100%
* ticket: [https://jira.fd.io/browse/CSIT-1801 CSIT-1801]
+
* testbed: 3n-snr
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-3n-snr/131/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k5-k1-k1-k1-k1]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1966 CSIT-1966]
 +
* note: Not even in coverage. Found when investigating VPP-2121.
  
===== (M) 9000b all AVF tests are failing to forward traffic =====
+
==== (L) Some tests have too long ramp-up trials ====
  
* last update: 2023-02-09
+
* last update: 2024-08-14
* work-to-fix: hard
+
* work-to-fix: low
* test: 9000B + AVF
+
* rca: Ramp-up trial should have high enough rate to fit into session timeout.
* testbed: 3n-icx
+
* test: NAT ASTF TPUT at large scale
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-3n-icx/13/log.html.gz#s1-s1-s1-s1-s1-t6 3n-icx ip4base]
+
* frequency: 100%
* ticket: [https://jira.fd.io/browse/CSIT-1885 CSIT-1885]
+
* testbed: all
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-2n-zn2/57/log.html.gz#s1-s1-s1-s2-s19-t3-k2-k12-k14]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1961 CSIT-1961]
 +
* note: We removed largest scale from most jobs, only few stragglers remain on coverage.
  
==== (M) 2n-clx, 2n-icx, 2n-zn2: DPDK testpmd 9000b tests on xxv710 nic are failing with no traffic ====
+
==== (L) Memif crashes VPP in container with jumbo frames ====
  
* last update: 2023-02-09
+
* last update: 2024-08-14
* work-to-fix: medium
+
* work-to-fix: hard (bug in VPP, no easy access to logs)
* rca: The DPDK app only attempts to set MTU once, but if interface is down (CSIT-1848) it fails. As a workaround, MTU could be set on Linux interface before starting the DPDK app.
+
* test: any memif with 9000B
* test: DPDK testpmd 9000b
+
* testbed: all
* frequency: always
+
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2406-2n-clx/25/log.html.gz#s1-s1-s1-s1-s1-t6-k3]
* testbed: 2n-clx, 2n-icx, 2n-zn2
+
* ticket: [https://jira.fd.io/browse/VPP-2091 VPP-2091]
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-2n-clx/1/log.html.gz#s1-s1-s1-s3-t6 2n-clx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-2n-icx/3/log.html.gz#s1-s1-s1-s1-t6 2n-icx]
+
* ticket: [https://jira.fd.io/browse/CSIT-1870 CSIT-1870]
+
* note: Vratko will fix, either in general workaround for CSIT-1848 or in a separate change.
+
  
==== (M) 2n-clx, 2n-icx: all Geneve tests with 1024 tunnels fail ====
+
==== (L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B ====
  
* last update: before 2023-01-31
+
* last update: 2023-06-28
* work-to-fix: hard
+
* work-to-fix: medium
* rca: VPP crash, Failed to add IP neighbor on interface geneve_tunnel258
+
* test: 9000B + Cx7 with DPDK DUT
* test: avf-ethip4--ethip4udpgeneve-1024tun-ip4base 64B 1518B IMIX 1c 2c 4c
+
* testbed: 2n-icx
* frequency: always
+
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-coverage-2306-2n-icx/2/log.html.gz#s1-s1-s1-s4-t6-k2-k4]
* testbed: 2n-skx, 2n-clx, 2n-icx
+
* ticket: [https://jira.fd.io/browse/CSIT-1924 CSIT-1924]
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/10/log.html.gz#s1-s1-s1-s1-s1 2n-icx]
+
* ticket: [https://jira.fd.io/browse/CSIT-1800 CSIT-1800]
+
  
 
==== (L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail ====
 
==== (L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail ====
  
* last update: before 2023-01-31
+
* last update: 2023-07-12
 
* work-to-fix: hard
 
* work-to-fix: hard
* rca: VPP crash, Failed to set NAT44 address range on host 10.30.51.44 (connections-per-second tests only)
+
* rca: Ramp-up trial takes more than 5 minutes so sessions are timing out.
 
* test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c
 
* test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c
 
* frequency: always
 
* frequency: always
 
* testbeds: 2n-skx, 2n-clx, 2n-icx
 
* testbeds: 2n-skx, 2n-clx, 2n-icx
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/18/log.html.gz#s1-s1-s1-s1-s11-t3 2n-icx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-clx/9/log.html.gz#s1-s1-s1-s1-s11-t1 2n-clx]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2306-2n-clx/8/log.html.gz#s1-s1-s1-s1-s14-t2-k2-k15-k1]
 
* ticket: [https://jira.fd.io/browse/CSIT-1799 CSIT-1799]
 
* ticket: [https://jira.fd.io/browse/CSIT-1799 CSIT-1799]
  
 
==== (L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions ====
 
==== (L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions ====
  
* last update: before 2023-01-31
+
* last update: 2023-07-12
* work-to-fix: hard
+
* work-to-fix: medium
* rca:
+
* rca: One possible cause is CSIT not counting ramp-up rate properly for IMIX (if multiple packets belong to the same session).
 
* test: IMIX over 1M sessions bidir
 
* test: IMIX over 1M sessions bidir
 
* frequency: always
 
* frequency: always
 
* testbed: 2n-skx, 2n-clx, 2n-icx
 
* testbed: 2n-skx, 2n-clx, 2n-icx
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/18/log.html.gz#s1-s1-s1-s1-s2-t4 2n-icx]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2306-2n-clx/7/log.html.gz#s1-s1-s1-s1-s2-t4-k2-k11-k1-k2]
 
* ticket: [https://jira.fd.io/browse/CSIT-1884 CSIT-1884]
 
* ticket: [https://jira.fd.io/browse/CSIT-1884 CSIT-1884]
  
Line 247: Line 231:
 
=== In Trending ===
 
=== In Trending ===
  
==== (H) 2n-icx: NFV density VPP does not start in container ====
+
==== (H) 3n-icxd: Various symptoms pointing to hardware (cable/nic/driver) issues ====
  
* last update: before 2023-01-31
+
* last update: 2024-08-14
 +
* work-to-fix: medium
 +
* rca: Unknown hardware instability. Not sure if on SUT or TG side.
 +
* test: any on 3n-icxd
 +
* frequency: medium
 +
* testbed: 3n-icxd
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3n-icxd/58/log.html.gz#s1-s1-s1-s1-s18-t1-k2-k11-k9-k13-k7-k2]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1963 CSIT-1963]
 +
* note: My be multiple issues, e.g. one hardware (loose cable) and one software (driver like in CSIT-1936).
 +
 
 +
==== (M) 3n-icx hoststack: Udpquicscale tests sometimes fail with various symptoms ====
 +
 
 +
* last update: 2024-08-14
 
* work-to-fix: hard
 
* work-to-fix: hard
* rca:
+
* rca: Cause unclear a symptoms vary. Maybe memory corruption?
* test: all subsequent
+
* test: Udpquicscale
 
* frequency: medium
 
* frequency: medium
* testbed: 2n-icx
+
* testbed: 3n-icx
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-weekly-master-2n-icx/57/log.html.gz 2n-icx mrr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-icx/48/log.html.gz#s1-s1-s1-s5-s8-t1 2n-icx ndrpdr]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3n-icx/110/log.html.gz#s1-s1-s1-s1-s7-t1-k3-k4-k1]
* ticket: [https://jira.fd.io/browse/CSIT-1881 CSIT-1881]
+
* ticket: [https://jira.fd.io/browse/CSIT-1962 CSIT-1962]
* note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.
+
* note: Not visible in trending as udpscale already suffers from CSIT-1935.
  
==== (M) 2n-clx: e810 mlrsearch tests packets forwarding in one direction ====
+
==== (M) 3n-alt: high scale ipsec policy tests may crash VPP ====
  
* last update: before 2023-01-31
+
* last update: 2024-02-07
 
* work-to-fix: hard
 
* work-to-fix: hard
* rca:
+
* rca: Vpp is crashing without creating core.
* test: e810Cq ip4base, ip6base
+
* test: policy ipsec tests, large scale increases probability.
* frequency: high
+
* frequency: 15% and lower.
* testbed: 2n-clx
+
* testbed: 3n-alt, also seen rarely on 3n-icx.
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-clx/176/log.html.gz#s1-s1-s1-s2-s8-t1 2n-clx]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/357/log.html.gz#s1-s1-s1-s1-s15-t4-k3-k5-k1-k1]
* ticket: [https://jira.fd.io/browse/CSIT-1864 CSIT-1864]
+
* ticket: [https://jira.fd.io/browse/CSIT-1938 CSIT-1938]
 +
 
 +
==== (M) 3nb-spr: Wireguardhw tests are likely to crash ====
 +
 
 +
* last update: before 2024-08-14
 +
* work-to-fix: hard
 +
* rca: More investigation needed.
 +
* test: any wireguardhw
 +
* frequency: low
 +
* testbed: 3nb-spr
 +
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3nb-spr/67/log.html.gz#s1-s1-s1-s3-s5]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1964 CSIT-1964]
 +
* note: Symptoms are similar to CSIT-1938 or CSIT-1886, but both of those are frequent in high scale, this is not.
  
 
==== (M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c ====
 
==== (M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c ====
Line 280: Line 288:
 
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/56/log.html.gz#s1-s1-s1-s3-s8-t4 3n-icx]
 
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/56/log.html.gz#s1-s1-s1-s3-s8-t4 3n-icx]
 
* ticket: [https://jira.fd.io/browse/CSIT-1886 CSIT-1886]
 
* ticket: [https://jira.fd.io/browse/CSIT-1886 CSIT-1886]
 
 
==== (M) 3n-tsh: vpp in VM starting too slowly ====
 
  
* last update: before 2023-02-22
+
==== (M) e810cq sometimes reporting link down ====
* work-to-fix: medium
+
* rca: perhaps related to numa, investigation continues
+
* test: 3n-tsh: sporadic VM vhost
+
* frequency: high
+
* testbed: 3n-tsh
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-tsh/738/log.html.gz#s1-s1-s1-s7-s2-t1 3n-tsh], [https://jenkins.fd.io/view/csit/job/csit-vpp-perf-verify-master-3n-tsh/123/ 3n-tsh]
+
* ticket: [https://jira.fd.io/browse/CSIT-1877 CSIT-1877]
+
  
== Rare Failures ==
+
* last update: 2024-02-07
 +
* work-to-fix: hard
 +
* rca: Mostly causing failure symptom of TRex complaining about link down. Probably also causes zero throughput in one direction in ASTF tests or even defice test failures. More frequent with more performant tests (L2) but seen affecting any test on occasion.
 +
* test: Any that uses Intel-E810CQ NIC.
 +
* frequency: <20%
 +
* testbed: all with the NIC.
 +
* examples: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-spr/45/log.html.gz#s1-s1-s1-s2-s24-t1-k2-k12-k14
 +
* ticket: [https://jira.fd.io/browse/CSIT-1936 CSIT-1936]
  
=== In Trending ===
+
=== Not In Trending ===
  
==== (L) 3n-alt: hugepage leak ====
+
==== (H) sw_interface_add_del_address: avf process node failed to reply in 5 seconds ====
  
* last update: 2023-05-02
+
* last update: 2024-08-14
 
* work-to-fix: medium
 
* work-to-fix: medium
* rca:  
+
* rca: AVF processes API commands asynchronously, not prepared to receive multiple messages at once.
* test: All vhost tests when number of free hugepages gets low
+
* test: ipsec policy, but any other test using async PAPI processing may be affected.
* frequency: happened only once (2023-04-24) so far, but multiple runs were affected until huge pages were freed manually
+
* frequency: high on 3n-snr, less frequent on Xeon testbeds
* testbed: 3n-alt
+
* testbed: all but with different frwequencies
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/210/log.html.gz#s1-s1-s1-s6-s1-t1-k2-k9-k6-k1
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/vpp-csit-verify-perf-master-ubuntu2204-x86_64-3n-snr/49/csit_current/0/log.html.gz#s1-s1-s1-s1-s1-t1-k3-k4-k1]
* ticket: [https://jira.fd.io/browse/CSIT-1911 CSIT-1911]
+
* ticket: [https://jira.fd.io/browse/VPP-2121 VPP-2121]
 +
* note: Only coverage jobs are running ipsec policy with AFT driver.
 +
 
 +
== Rare Failures ==
 +
 
 +
=== In Trending ===
  
 
==== (M) 3n-icx, 3n-snr: 1518B IPsec packets not passing ====
 
==== (M) 3n-icx, 3n-snr: 1518B IPsec packets not passing ====
Line 320: Line 331:
 
==== (M) all testbeds: mlrsearch fails to find NDR rate ====
 
==== (M) all testbeds: mlrsearch fails to find NDR rate ====
  
* last update: before 2023-04-19
+
* last update: before 2023-06-22
 
* work-to-fix: hard
 
* work-to-fix: hard
* rca: One (not sure whether only) possible symptom is ierrors on TRex side. Not sure it is TRex error or VPP sending mangled packets.
+
* rca: On 3n-tsh, the symptom is TRex reporting ierrors, only in one direction. Other testbeds may have a different symptom, but failures there are less frequent.
 
* test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
 
* test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
 
* frequency: low
 
* frequency: low
 
* testbed: 3n-tsh, 3n-alt, 2n-clx
 
* testbed: 3n-tsh, 3n-alt, 2n-clx
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-icx/57/log.html.gz#s1-s1-s1-s2-s37-t2 2n-icx]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-tsh/14/log.html.gz#s1-s1-s1-s5-s8-t2]
 
* ticket: [https://jira.fd.io/browse/CSIT-1804 CSIT-1804]
 
* ticket: [https://jira.fd.io/browse/CSIT-1804 CSIT-1804]
  
Line 341: Line 352:
 
* note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100
 
* note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100
  
==== (L) all testbeds: vpp create avf interface failure in multi-core configs ====
+
==== (L) 2n-zn2: Geneve sometimes loses one direction of traffic ====
  
* last update: 2023-02-06
+
* last update: 2024-08-14
 
* work-to-fix: hard
 
* work-to-fix: hard
* rca: issue in Intel FVL driver
+
* rca: More investigation needed to see the mechanism.
* test: multicore AVF
+
* test: any geneve
 +
* frequency: <1%
 +
* testbed: 2n-zn2
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-2n-zn2/54/log.html.gz#s1-s1-s1-s3-s1-t1-k2-k9-k14]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1960 CSIT-1960]
 +
* note: Very rare, recently seen only as a failure in report iterative (not weekly trending) and in one run as MRR regression.
 +
 
 +
==== (L) Rare VPP crash in nat avf tests ====
 +
 
 +
* last update: 2024-08-14
 +
* work-to-fix: medium
 +
* rca: clib_dlist_remove called by nat44_session_update_lru, but probably just a memory corruption symptom
 +
* test: any NAT test with AVF
 
* frequency: low
 
* frequency: low
* testbed: all testbeds
+
* testbed: any
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1257/log.html.gz#s1-s1-s1-s5-s24-t2 2n-clx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/197/log.html.gz#s1-s1-s1-s5-s1-t3 3n-icx]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-2n-clx/79/log.html.gz#s1-s1-s1-s2-s37-t2-k3-k4-k1]
* ticket: [https://jira.fd.io/browse/CSIT-1782 CSIT-1782]
+
* ticket: [https://jira.fd.io/browse/CSIT-1947 CSIT-1947]
* note: A long standing issue without a final permanent fix.
+
* note: More frequent in soak tests.
 +
 
 +
==== (L) ipsec hwasync fails with large scale and multiple queues ====
 +
 
 +
* last update: 2024-08-14
 +
* work-to-fix: hard
 +
* test: ipsec hwasync
 +
* frequency: low
 +
* testbed: all with QAT
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-3n-icxd/58/log.html.gz#s1-s1-s1-s1-s1-t3-k3-k4-k1]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1946 CSIT-1946]
 +
* note: Frequency decreased when CSIT changes RXQ ratio in 40824.
  
 
==== (L) all testbeds: nat44det 4M and 16M scale 1 session not established ====
 
==== (L) all testbeds: nat44det 4M and 16M scale 1 session not established ====
Line 363: Line 397:
 
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-zn2/672/log.html.gz#s1-s1-s1-s2-s22-t3 2n-zn2], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1271/log.html.gz#s1-s1-s1-s2-s60-t1-k2-k11-k1-k2 2n-clx]
 
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-zn2/672/log.html.gz#s1-s1-s1-s2-s22-t3 2n-zn2], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1271/log.html.gz#s1-s1-s1-s2-s60-t1-k2-k11-k1-k2 2n-clx]
 
* ticket: [https://jira.fd.io/browse/CSIT-1795 CSIT-1795]
 
* ticket: [https://jira.fd.io/browse/CSIT-1795 CSIT-1795]
 +
 +
==== (L) TRex may wrongly detect link bandwidth ====
 +
 +
* last update: 2024-02-07
 +
* work-to-fix: hard
 +
* rca: Quite rare failure affecting unpredictable tests. Perhaps a less severe symptom of CSIT-1936.
 +
* test: No obvious pattern due to low frequency
 +
* frequency: <0.4%
 +
* testbed: recently seen on 3n-tsh and 3nb-spr
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2310-3nb-spr/28/log.html.gz#s1-s1-s1-s1-s4-t1-k2-k13-k14]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1941 CSIT-1941]
 +
 +
=== Not In Trending ===
 +
 +
==== (L) Occasional failure on 1518B CX5: Trex failed to send message ====
 +
 +
* last update: 2024-08-14
 +
* work-to-fix: hard
 +
* rca: More investigation needed.
 +
* test: Any 1518B on CX5, any driver.
 +
* frequency: low
 +
* testbed: 2n-clx
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2406-2n-clx/20/log.html.gz#s1-s1-s1-s1-s1-t4-k2-k9-k26-k11]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1965 CSIT-1965]
 +
* note: Only in coverage and 2n-clx has been decommissioned, but may appear if we move CX5 NICs to another testbed.
  
 
= Past Failures =
 
= Past Failures =
  
==== (M) 2n-clx: 100Ge2P1Cx556A not recognized in mlx5-core tests ====
+
==== (H) 3na-spr, 2n-zn2: VPP fails to start in first test cases if dpdk/mlx5 ====
  
* last update: 2023-05-15
+
* last update: 2024-02-07
 
* work-to-fix: medium
 
* work-to-fix: medium
* rca:  
+
* rca: DUT1 fails to boot up in first test case. If DUT2 is present, it fails to start in second test case. Other test cases in the run are unaffected. This looks like an infra issue, ansible cleanup is doing something wrong, hard to tell what.
* test: All with 100Ge2P1Cx556A and mlx5-core. At least TB27 is affected.
+
* test: First test case, unless it uses AVF driver.
* frequency: always since 2023-04-28
+
* frequency: 100%
* testbed: 2n-clx
+
* testbed: 3na-spr, 2n-zn2
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1317/log.html.gz#s1-s1-s1-s5-s7-t1-k2-k6-k3-k1-k1-k1-k1-k1-k1-k3-k1-k1
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3na-spr/139/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k3-k2-k5-k1-k1-k1-k1]
* ticket: [https://jira.fd.io/browse/CSIT-1909 CSIT-1909]
+
* ticket: [https://jira.fd.io/browse/CSIT-1939 CSIT-1939]
 +
 
 +
==== (H) 2n-icx: NFV density VPP does not start in container ====
 +
 
 +
* last update: before 2023-01-31
 +
* work-to-fix: hard
 +
* rca:
 +
* test: all subsequent
 +
* frequency: medium
 +
* testbed: 2n-icx
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-weekly-master-2n-icx/57/log.html.gz 2n-icx mrr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-icx/48/log.html.gz#s1-s1-s1-s5-s8-t1 2n-icx ndrpdr]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1881 CSIT-1881]
 +
* note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.
 +
 
 +
==== (M) 3n-icx, 3n-snr: 1518B IPsec packets not passing ====
 +
 
 +
* last update: before 2023-01-31
 +
* work-to-fix: hard
 +
* rca:
 +
* test: all AVF crypto
 +
* frequency: low
 +
* testbed: 3n-skx, 3n-icx, 3n-snr
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/197/log.html.gz#s1-s1-s1-s1-s4-t1 3n-icx daily], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/32/log.html.gz#s1-s1-s1-s1-s4-t1 3n-snr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/57/log.html.gz#s1-s1-s1-s1-s4-t1 3n-icx weekly]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1827 CSIT-1827]
 +
 
 +
==== (L) all testbeds: vpp create avf interface failure in multi-core configs ====
 +
 
 +
* last update: 2023-02-06
 +
* work-to-fix: hard
 +
* rca: issue in Intel FVL driver
 +
* test: multicore AVF
 +
* frequency: low
 +
* testbed: all testbeds
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1257/log.html.gz#s1-s1-s1-s5-s24-t2 2n-clx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/197/log.html.gz#s1-s1-s1-s5-s1-t3 3n-icx]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1782 CSIT-1782]
 +
* note: A long standing issue without a final permanent fix.

Latest revision as of 14:33, 14 August 2024

Contents

CSIT Test Failure Clasification

All known CSIT failures grouped and listed in the following order:

  • Always failing followed by sometimes failing.
  • Always failing tests:
    • Most common use cases followed by less common.
  • Sometimes failing tests:
    • Most frequently failing followed by less frequently failing.
      • High frequency 50%-100%
      • medium frequency 10%-50%
      • low frequency 0%-10%.
    • Within each sub-group: most common use cases followed by less common.

CSIT Test Fixing Priorities

Test fixing work priorities defined as follows:

  • (H)igh priority, most common use cases and most common test code.
  • (M)edium priority, specific HW and pervasive test code issue.
  • (L)ow priority, corner cases and external dependencies.

Current Failures

Deterministic Failures

In Trending

(H) 3n spr: Unusable performance of ipsec tests with SHA_256_128

  • last update: 2024-08-14
  • work-to-fix: low (if you are Damjan)
  • rca: It seems compiler emits wrong instructions, VPP build system needs to be fixed.
  • test: Three new ipsec tests that use SHA_256_128 as integrity algorithm.
  • frequency: 50%
  • testbed: 3na-spr, 3nb-spr
  • example: [1]
  • ticket: VPP-2118
  • note: Other Xeon testbeds are also affected, but performance is not as bad to fail NDR. ARM is not affected at all.

(H) 3nb-spr hoststack: interface not up after first test

  • last update: 2024-02-07
  • work-to-fix: medium
  • rca: After first test, HundredGigabitEthernetab/0/0 never goes up within the run. Not sure which part f test setup is missing, the tests do work correctly on other testbeds.
  • test: All subsequent tests.
  • frequency: 100%
  • testbed: 3nb-apr
  • example: [2]
  • ticket: CSIT-1942

(H) Zero traffic reported in udpquic tests

  • last update: 2024-02-07
  • work-to-fix: medium
  • rca: There are errors when closing sessions. The current CSIT logic happens to report this as a passing test with zero traffic, which is wrong.
  • test: All tests udpquic tests
  • frequency: 100%
  • testbed: all running the tests
  • example: [3]
  • ticket: CSIT-1935

(M) DPDK 23.03 testpmd startup fails on some testbeds

  • last update: 2023-11-06
  • work-to-fix: medium
  • rca: The DUT-DUT link sometimes does not go up. The same consequences as CSIT-1848 but affects more testbed+NIC combinations. Can be fixed by a different startup procedure (better detection before for restart).
  • test: Testpmd (no vpp). Around half of tested+NIC combinations are affected.
  • frequency: Low, ~1%, as in most testbeds the link does go up.
  • testbed: multiple
  • example: [4]
  • ticket: CSIT-1904

(M) 2n-spr: zero traffic on cx7 rdma

  • last update: 2023-06-22
  • work-to-fix: medium
  • rca: VPP reports "tx completion errors", more investigation ongoing.
  • test: All tests on 2n-spr with 200Ge2P1Cx7Veat NIC and RDMA driver.
  • frequency: always (since 2n-spr was set up)
  • testbed: 2n-spr
  • example: [5]
  • ticket: CSIT-1906
  • note: Also would affect 3n-alt with mlxc6 and rdma. Will probably be made invisible by removing rdma (except mlxc5) from jobspecs.

(M) 3n-icx 3nb-spr: Failed to enable GTPU offload RX

  • last update: 2024-08-14
  • work-to-fix: low
  • rca: Retval is -1. More examination is needed to understand why.
  • test: any gtpuhw
  • frequency: 100%
  • testbed: all
  • example: [6]
  • ticket: CSIT-1950

(M) Lossy trials in nat udp mlx5 tests

  • last update: 2024-02-07
  • work-to-fix: hard
  • rca: VPP counters suggest the packet is lost somewhere between TG on in-side [1] and VPP [2].
  • test: It is affecting only cx7 with mlx5 driver (not e810cq with avf driver), only udp tests (not tcp or other), and interestingly both ASTF (both cps and tput for nat44ed) and STL (det44) traffic profiles. It does not affect corresponding ASTF tests with ip4base routing.
  • frequency: depends on scale, 100% on high scale tests.
  • testbed: 2n-icx, 2n-spr.
  • example: [7]
  • ticket: CSIT-1929

(L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch

  • last update: 2023-04-19
  • work-to-fix:
  • rca:
  • test: TB38 AVF 4c e810cq, earlier l2patch nowadays eth-l2xcbase
  • frequency: always
  • testbed: 3n-icx (only TB38, never TB37)
  • example: [8]
  • ticket: CSIT-1901

(L) 2n-tx2: af_xdp mrr failures

  • last update: 2023-11-08
  • work-to-fix:
  • rca: Some workers see no traffic, "error af_xdp_device_input_refill_db: rx poll() failed: Bad address" in show hardware. More examination needed.
  • test: ip4 and ip6, base and scale
  • frequency: more than 80% on a subset of cases, 100% on (multicore) 2n-tx2
  • testbed: 2n-tx2; to a lesser extent also 2n-clx and 2n-icx, where just decreased performance (and failure in ndrpdr) is more likely outcome.
  • example: [9]
  • ticket: CSIT-1922

Not In Trending

(M) Combination of AVF and vhost drops all 9000B packets

  • last update: 2024-08-14
  • work-to-fix: medium
  • rca: Buffer alloc error is seen, not sure why that happens.
  • test: 9000B vhost
  • frequency: 100%
  • testbed: all
  • example: [10]
  • ticket: CSIT-1951
  • note: Sometimes VPP crashes, not sure if the cause is the same.

(M) 9000B tests with encap overhead and non-dpdk plugins see fragmented packets

  • last update: 2024-08-14
  • work-to-fix: medium
  • rca: Some internal MTU is at 9000 (not 9200). More examination needed to see if the issue is in VPP or CSIT.
  • test: 9000B testcases for loadbalance, geneva, vxlan or SRv6
  • frequency: 100%
  • testbed: all
  • example: [11]
  • ticket: CSIT-1950
  • note: Some tests drop packets (affects all tests), some tests fragment packets (does not fail MRR).

(M) IMIX 4c tests may fail PDR due to ~10% loss

  • last update: 2024-02-07
  • work-to-fix: hard
  • rca: Only seen in coverage tests so far, more data needed.
  • test: various high-performance, mostly mlx5
  • frequency: <1%
  • testbed: 2n-icx, 3n-icx, 3na-spr
  • example: [12]
  • ticket: CSIT-1943

(L) 3n-snr: Increased heap size in ipsec policy tests prevents VPP from starting

  • last update: 2024-08-14
  • work-to-fix: low
  • rca: Probably not enough huge pages.
  • test: ipsec policy large scale
  • frequency: 100%
  • testbed: 3n-snr
  • example: [13]
  • ticket: CSIT-1966
  • note: Not even in coverage. Found when investigating VPP-2121.

(L) Some tests have too long ramp-up trials

  • last update: 2024-08-14
  • work-to-fix: low
  • rca: Ramp-up trial should have high enough rate to fit into session timeout.
  • test: NAT ASTF TPUT at large scale
  • frequency: 100%
  • testbed: all
  • example: [14]
  • ticket: CSIT-1961
  • note: We removed largest scale from most jobs, only few stragglers remain on coverage.

(L) Memif crashes VPP in container with jumbo frames

  • last update: 2024-08-14
  • work-to-fix: hard (bug in VPP, no easy access to logs)
  • test: any memif with 9000B
  • testbed: all
  • examples: [15]
  • ticket: VPP-2091

(L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B

  • last update: 2023-06-28
  • work-to-fix: medium
  • test: 9000B + Cx7 with DPDK DUT
  • testbed: 2n-icx
  • examples: [16]
  • ticket: CSIT-1924

(L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail

  • last update: 2023-07-12
  • work-to-fix: hard
  • rca: Ramp-up trial takes more than 5 minutes so sessions are timing out.
  • test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c
  • frequency: always
  • testbeds: 2n-skx, 2n-clx, 2n-icx
  • example: [17]
  • ticket: CSIT-1799

(L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions

  • last update: 2023-07-12
  • work-to-fix: medium
  • rca: One possible cause is CSIT not counting ramp-up rate properly for IMIX (if multiple packets belong to the same session).
  • test: IMIX over 1M sessions bidir
  • frequency: always
  • testbed: 2n-skx, 2n-clx, 2n-icx
  • example: [18]
  • ticket: CSIT-1884

Occasional Failures

In Trending

(H) 3n-icxd: Various symptoms pointing to hardware (cable/nic/driver) issues

  • last update: 2024-08-14
  • work-to-fix: medium
  • rca: Unknown hardware instability. Not sure if on SUT or TG side.
  • test: any on 3n-icxd
  • frequency: medium
  • testbed: 3n-icxd
  • example: [19]
  • ticket: CSIT-1963
  • note: My be multiple issues, e.g. one hardware (loose cable) and one software (driver like in CSIT-1936).

(M) 3n-icx hoststack: Udpquicscale tests sometimes fail with various symptoms

  • last update: 2024-08-14
  • work-to-fix: hard
  • rca: Cause unclear a symptoms vary. Maybe memory corruption?
  • test: Udpquicscale
  • frequency: medium
  • testbed: 3n-icx
  • example: [20]
  • ticket: CSIT-1962
  • note: Not visible in trending as udpscale already suffers from CSIT-1935.

(M) 3n-alt: high scale ipsec policy tests may crash VPP

  • last update: 2024-02-07
  • work-to-fix: hard
  • rca: Vpp is crashing without creating core.
  • test: policy ipsec tests, large scale increases probability.
  • frequency: 15% and lower.
  • testbed: 3n-alt, also seen rarely on 3n-icx.
  • example: [21]
  • ticket: CSIT-1938

(M) 3nb-spr: Wireguardhw tests are likely to crash

  • last update: before 2024-08-14
  • work-to-fix: hard
  • rca: More investigation needed.
  • test: any wireguardhw
  • frequency: low
  • testbed: 3nb-spr
  • examples: [22]
  • ticket: CSIT-1964
  • note: Symptoms are similar to CSIT-1938 or CSIT-1886, but both of those are frequent in high scale, this is not.

(M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c

  • last update: before 2023-01-31
  • work-to-fix: easy
  • rca:
  • test: wireguard 100 tunnels and more
  • frequency: high
  • testbed: 3n-icx, 3n-snr
  • examples: 3n-icx
  • ticket: CSIT-1886

(M) e810cq sometimes reporting link down

Not In Trending

(H) sw_interface_add_del_address: avf process node failed to reply in 5 seconds

  • last update: 2024-08-14
  • work-to-fix: medium
  • rca: AVF processes API commands asynchronously, not prepared to receive multiple messages at once.
  • test: ipsec policy, but any other test using async PAPI processing may be affected.
  • frequency: high on 3n-snr, less frequent on Xeon testbeds
  • testbed: all but with different frwequencies
  • example: [23]
  • ticket: VPP-2121
  • note: Only coverage jobs are running ipsec policy with AFT driver.

Rare Failures

In Trending

(M) 3n-icx, 3n-snr: 1518B IPsec packets not passing

(M) all testbeds: mlrsearch fails to find NDR rate

  • last update: before 2023-06-22
  • work-to-fix: hard
  • rca: On 3n-tsh, the symptom is TRex reporting ierrors, only in one direction. Other testbeds may have a different symptom, but failures there are less frequent.
  • test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
  • frequency: low
  • testbed: 3n-tsh, 3n-alt, 2n-clx
  • example: [24]
  • ticket: CSIT-1804

(M) all testbeds: AF_XDP mlrsearch fails to find NDR rate

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: af-xdp multicore tests
  • frequency: low
  • testbed: 2n-clx, 2n-skx, 2n-tx2, 2n-icx
  • example: 2n-skx, 2n-clx
  • ticket: CSIT-1802
  • note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100

(L) 2n-zn2: Geneve sometimes loses one direction of traffic

  • last update: 2024-08-14
  • work-to-fix: hard
  • rca: More investigation needed to see the mechanism.
  • test: any geneve
  • frequency: <1%
  • testbed: 2n-zn2
  • example: [25]
  • ticket: CSIT-1960
  • note: Very rare, recently seen only as a failure in report iterative (not weekly trending) and in one run as MRR regression.

(L) Rare VPP crash in nat avf tests

  • last update: 2024-08-14
  • work-to-fix: medium
  • rca: clib_dlist_remove called by nat44_session_update_lru, but probably just a memory corruption symptom
  • test: any NAT test with AVF
  • frequency: low
  • testbed: any
  • example: [26]
  • ticket: CSIT-1947
  • note: More frequent in soak tests.

(L) ipsec hwasync fails with large scale and multiple queues

  • last update: 2024-08-14
  • work-to-fix: hard
  • test: ipsec hwasync
  • frequency: low
  • testbed: all with QAT
  • example: [27]
  • ticket: CSIT-1946
  • note: Frequency decreased when CSIT changes RXQ ratio in 40824.

(L) all testbeds: nat44det 4M and 16M scale 1 session not established

  • last update: 2023-02-14
  • work-to-fix: hard
  • rca: unknown
  • test: nat44det udp 4m and 16m (64k is ok, 1m can fail but rarely than bigger scales)
  • frequency: low
  • testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx
  • example: 2n-zn2, 2n-clx
  • ticket: CSIT-1795

(L) TRex may wrongly detect link bandwidth

  • last update: 2024-02-07
  • work-to-fix: hard
  • rca: Quite rare failure affecting unpredictable tests. Perhaps a less severe symptom of CSIT-1936.
  • test: No obvious pattern due to low frequency
  • frequency: <0.4%
  • testbed: recently seen on 3n-tsh and 3nb-spr
  • example: [28]
  • ticket: CSIT-1941

Not In Trending

(L) Occasional failure on 1518B CX5: Trex failed to send message

  • last update: 2024-08-14
  • work-to-fix: hard
  • rca: More investigation needed.
  • test: Any 1518B on CX5, any driver.
  • frequency: low
  • testbed: 2n-clx
  • example: [29]
  • ticket: CSIT-1965
  • note: Only in coverage and 2n-clx has been decommissioned, but may appear if we move CX5 NICs to another testbed.

Past Failures

(H) 3na-spr, 2n-zn2: VPP fails to start in first test cases if dpdk/mlx5

  • last update: 2024-02-07
  • work-to-fix: medium
  • rca: DUT1 fails to boot up in first test case. If DUT2 is present, it fails to start in second test case. Other test cases in the run are unaffected. This looks like an infra issue, ansible cleanup is doing something wrong, hard to tell what.
  • test: First test case, unless it uses AVF driver.
  • frequency: 100%
  • testbed: 3na-spr, 2n-zn2
  • example: [30]
  • ticket: CSIT-1939

(H) 2n-icx: NFV density VPP does not start in container

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: all subsequent
  • frequency: medium
  • testbed: 2n-icx
  • example: 2n-icx mrr, 2n-icx ndrpdr
  • ticket: CSIT-1881
  • note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.

(M) 3n-icx, 3n-snr: 1518B IPsec packets not passing

(L) all testbeds: vpp create avf interface failure in multi-core configs

  • last update: 2023-02-06
  • work-to-fix: hard
  • rca: issue in Intel FVL driver
  • test: multicore AVF
  • frequency: low
  • testbed: all testbeds
  • example: 2n-clx, 3n-icx
  • ticket: CSIT-1782
  • note: A long standing issue without a final permanent fix.