Difference between revisions of "CSIT/TestFailuresTracking"

From fd.io
Jump to: navigation, search
(CSIT-1877 is worked around enough to be moved to Past.)
((M) tests with 9000B payload frames not forwarded over memif interfaces =: Add VPP-2091 as the otehr ticket.)
 
(41 intermediate revisions by the same user not shown)
Line 25: Line 25:
 
=== In Trending ===
 
=== In Trending ===
  
==== (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device ====
+
==== (H) 3nb-spr hoststack: interface not up after first test ====
  
* last update: before 2023-01-31
+
* last update: 2024-02-07
* work-to-fix: hard
+
* work-to-fix: medium
* rca: Missing QAT driver. Symptom: Failed to bind PCI device 0000:f4:00.0 to c4xxx on host 10.30.51.93
+
* rca: After first test, HundredGigabitEthernetab/0/0 never goes up within the run. Not sure which part f test setup is missing, the tests do work correctly on other testbeds.
* test: hwasync wireguard
+
* test: All subsequent tests.
* frequency: always
+
* frequency: 100%
* testbed: 3n-snr
+
* testbed: 3nb-apr
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-snr/95/log.html.gz#s1-s1-s1-s3-s1 3n-snr]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3nb-spr/87/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k3-k7-k7
* ticket: [https://jira.fd.io/browse/CSIT-1883 CSIT-1883]
+
* ticket: [https://jira.fd.io/browse/CSIT-1942 CSIT-1942]
  
==== (M) 3n-alt, 3n-snr: testpmd no traffic forwarded ====
+
==== (H) Zero traffic reported in udpquic tests ====
  
* last update: 2023-02-09
+
* last update: 2024-02-07
 
* work-to-fix: medium
 
* work-to-fix: medium
* rca: DUT-DUT link takes too long to come up on some testbeds. This happens *after* a test case with a DPDK app (not VPP even when using dpdk plugin), although multiple subsequent tests (even with VPP) may be affected. The real cause is probably in NIC firmware or driver, but CSIT can be better at detecting port status as a workaround.
+
* rca: There are errors when closing sessions. The current CSIT logic happens to report this as a passing test with zero traffic, which is wrong.
* test: testpmd (also l3fwd but hidden by CSIT-1896)
+
* test: All tests udpquic tests
* frequency: always (almost)
+
* frequency: 100%
* testbed: 3n-alt, 3n-snr
+
* testbed: all running the tests
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-3n-alt/42/log.html.gz#s1-s1-s1-s1-t1 3n-alt], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-3n-snr/6/log.html.gz#s1-s1-s1-s1-t1 3n-snr], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-3n-snr/14/log.html.gz#s1-s1-s1-s1-t1 3n-snr]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3n-icx/217/log.html.gz#s1-s1-s1-s1-s4-t1-k2-k4-k17
* ticket: [https://jira.fd.io/browse/CSIT-1848 CSIT-1848]
+
* ticket: [https://jira.fd.io/browse/CSIT-1935 CSIT-1935]
 +
 
 +
==== (H) 3na-spr, 2n-zn2: VPP fails to start in first test cases if dpdk/mlx5 ====
 +
 
 +
* last update: 2024-02-07
 +
* work-to-fix: medium
 +
* rca: DUT1 fails to boot up in first test case. If DUT2 is present, it fails to start in second test case. Other test cases in the run are unaffected. This looks like an infra issue, ansible cleanup is doing something wrong, hard to tell what.
 +
* test: First test case, unless it uses AVF driver.
 +
* frequency: 100%
 +
* testbed: 3na-spr, 2n-zn2
 +
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3na-spr/139/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k3-k2-k5-k1-k1-k1-k1
 +
* ticket: [https://jira.fd.io/browse/CSIT-1939 CSIT-1939]
 +
 
 +
==== (M) DPDK 23.03 testpmd startup fails on some testbeds ====
 +
 
 +
* last update: 2023-11-06
 +
* work-to-fix: medium
 +
* rca: The DUT-DUT link sometimes does not go up. The same consequences as CSIT-1848 but affects more testbed+NIC combinations. Can be fixed by a different startup procedure (better detection before for restart).
 +
* test: Testpmd (no vpp). Around half of tested+NIC combinations are affected.
 +
* frequency: Low, ~1%, as in most testbeds the link does go up.
 +
* testbed: multiple
 +
* example:https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2310-2n-zn2/8/log.html.gz#s1-s1-s1-s1-t6-k2-k4
 +
* ticket: [https://jira.fd.io/browse/CSIT-1904 CSIT-1904]
  
 
==== (M) 2n-spr: zero traffic on cx7 rdma ====
 
==== (M) 2n-spr: zero traffic on cx7 rdma ====
Line 59: Line 81:
 
* note: Also would affect 3n-alt with mlxc6 and rdma. Will probably be made invisible by removing rdma (except mlxc5) from jobspecs.
 
* note: Also would affect 3n-alt with mlxc6 and rdma. Will probably be made invisible by removing rdma (except mlxc5) from jobspecs.
  
==== (M) 3n-icx, 3n-snr: first few swasync scheduler tests timing out in runtime stat ====
+
==== (M) Lossy trials in nat udp mlx5 tests ====
  
* last update: 2023-06-21
+
* last update: 2024-02-07
* work-to-fix: medium
+
* work-to-fix:
* rca:  
+
* rca: VPP counters suggest the packet is lost somewhere between TG on in-side [1] and VPP [2].
* test: first two tests on 2n-icx, first 8 (or on occasion 9) on 3n-snr.
+
* test: It is affecting only cx7 with mlx5 driver (not e810cq with avf driver), only udp tests (not tcp or other), and interestingly both ASTF (both cps and tput for nat44ed) and STL (det44) traffic profiles. It does not affect corresponding ASTF tests with ip4base routing.
* frequency: always (except the one test on 3n-snr), last good run was 2023-05-29, first bad was 2023-06-05.
+
* frequency: depends on scale, 100% on high scale tests.
* testbed: 3n-icx, 3n-snr
+
* testbed: 2n-icx, 2n-spr.
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/287/log.html.gz#s1-s1-s1-s1-s5-t1-k2-k14-k9-k10-k1-k1-k1-k12
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-spr/45/log.html.gz#s1-s1-s1-s2-s49-t2-k2-k14-k14
* ticket: [https://jira.fd.io/browse/CSIT-1923 CSIT-1923]
+
* ticket: [https://jira.fd.io/browse/CSIT-1929 CSIT-1929]
  
 
==== (L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch ====
 
==== (L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch ====
Line 81: Line 103:
 
* ticket: [https://jira.fd.io/browse/CSIT-1901 CSIT-1901]
 
* ticket: [https://jira.fd.io/browse/CSIT-1901 CSIT-1901]
  
==== (L) af_xdp mrr failures ====
+
==== (L) 2n-tx2: af_xdp mrr failures ====
  
* last update: 2023-06-21
+
* last update: 2023-11-08
 
* work-to-fix:
 
* work-to-fix:
* rca:
+
* rca: Some workers see no traffic, "error af_xdp_device_input_refill_db: rx poll() failed: Bad address" in show hardware. More examination needed.
* test: 25Ge2P1Xxv710-Af-Xdp-Ethip4-Ip4Base-Mrr
+
* test: ip4 and ip6, base and scale
 
* frequency: more than 80% on a subset of cases, 100% on (multicore) 2n-tx2
 
* frequency: more than 80% on a subset of cases, 100% on (multicore) 2n-tx2
* testbed: 2n-clx, 2n-icx, 2n-tx2
+
* testbed: 2n-tx2; to a lesser extent also 2n-clx and 2n-icx, where just decreased performance (and failure in ndrpdr) is more likely outcome.
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1349/log.html.gz#s1-s1-s1-s2-s60-t1
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/120/log.html.gz#s1-s1-s1-s2-s2-t2-k2-k10-k9-k14-k1-k1-k1-k1
 
* ticket: [https://jira.fd.io/browse/CSIT-1922 CSIT-1922]
 
* ticket: [https://jira.fd.io/browse/CSIT-1922 CSIT-1922]
  
 
=== Not In Trending ===
 
=== Not In Trending ===
  
==== (M) all testbeds: some vpp 9000B tests ====
+
==== (M) tests with 9000B payload frames not forwarded over memif interfaces =====
 
+
* last update: 2023-02-09
+
* work-to-fix: hard
+
* rca: VPP code: [https://gerrit.fd.io/r/c/vpp/+/34839 34839: dpdk: cleanup MTU handling]. CSIT needs to rework how it sets MTU / max frame rate (CSIT-1797). Some tests will continue failing due to missing support on VPP side, we will open specific Jira tickets for those.
+
* test: see sub-items
+
* frequency: always
+
* testbed: all
+
* examples: see sub-items
+
* ticket: [https://jira.fd.io/browse/CSIT-1809 CSIT-1809]
+
* gerrit: https://gerrit.fd.io/r/c/csit/+/37824
+
 
+
===== (M) tests with 9000B payload frames not forwarded over vhost interfaces =====
+
 
+
* last update: 2023-02-09
+
* work-to-fix: hard
+
* test: 9000B + vhostuser
+
* testbed: 2n-skx, 3n-skx, 2n-clx
+
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2202-3n-skx/67/log.html.gz#s1-s1-s1-s1-s1 3n-skx vhostuser]
+
* ticket: [https://jira.fd.io/browse/CSIT-1809 CSIT-1809]
+
 
+
===== tests with 9000B payload frames not forwarded over memif interfaces =====
+
  
 
* last update: 2023-02-09
 
* last update: 2023-02-09
Line 122: Line 123:
 
* testbed: 2n-skx, 3n-skx, 2n-clx
 
* testbed: 2n-skx, 3n-skx, 2n-clx
 
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2202-2n-skx/33/log.html.gz#s1-s1-s1-s1-s1 2n-skx Memif]
 
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2202-2n-skx/33/log.html.gz#s1-s1-s1-s1-s1 2n-skx Memif]
* ticket: [https://jira.fd.io/browse/CSIT-1808 CSIT-1808]
+
* ticket: [https://jira.fd.io/browse/CSIT-1808 CSIT-1808] VPP-2091
  
===== 9000B payload frames not forwarded over tunnels due to violating supported Max Frame Size (VxLAN, LISP, SRv6) =====
+
==== (M) IMIX 4c tests may fail PDR due to ~10% loss =====
  
* last update: 2023-02-09
+
* last update: 2024-02-07
* work-to-fix: medium
+
* test: 9000B + (IP4 tunnels VXLAN, IP4 tunnels LISP, Srv6, IpSec)
+
* testbed: 2n-icx, 3n-icx
+
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/10/log.html.gz 2n-icx VXLAN], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-3n-icx/22/log.html.gz#s1-s1-s1-s1-s1-t6 3n-icx]
+
* ticket: [https://jira.fd.io/browse/CSIT-1801 CSIT-1801]
+
 
+
===== (M) 9000b all AVF tests are failing to forward traffic =====
+
 
+
* last update: 2023-02-09
+
 
* work-to-fix: hard
 
* work-to-fix: hard
* test: 9000B + AVF
+
* rca: Only seen in coverage tests so far, more data needed.
* testbed: 3n-icx
+
* test: various high-performance, mostly mlx5
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-3n-icx/13/log.html.gz#s1-s1-s1-s1-s1-t6 3n-icx ip4base]
+
* frequency: <1%
* ticket: [https://jira.fd.io/browse/CSIT-1885 CSIT-1885]
+
* testbed: 2n-icx, 3n-icx, 3na-spr
 +
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-coverage-2310-2n-icx/1/log.html.gz#s1-s1-s1-s3-t9-k2-k5-k16-k9
 +
* ticket: [https://jira.fd.io/browse/CSIT-1943 CSIT-1943]
  
==== (M) 2n-clx, 2n-icx, 2n-zn2: DPDK testpmd 9000b tests on xxv710 nic are failing with no traffic ====
+
==== (L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B =====
  
* last update: 2023-02-09
+
* last update: 2023-06-28
 
* work-to-fix: medium
 
* work-to-fix: medium
* rca: The DPDK app only attempts to set MTU once, but if interface is down (CSIT-1848) it fails. As a workaround, MTU could be set on Linux interface before starting the DPDK app.
+
* test: 9000B + Cx7 with DPDK DUT
* test: DPDK testpmd 9000b
+
* testbed: 2n-icx
* frequency: always
+
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-coverage-2306-2n-icx/2/log.html.gz#s1-s1-s1-s4-t6-k2-k4]
* testbed: 2n-clx, 2n-icx, 2n-zn2
+
* ticket: [https://jira.fd.io/browse/CSIT-1924 CSIT-1924]
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-2n-clx/1/log.html.gz#s1-s1-s1-s3-t6 2n-clx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2210-2n-icx/3/log.html.gz#s1-s1-s1-s1-t6 2n-icx]
+
* ticket: [https://jira.fd.io/browse/CSIT-1870 CSIT-1870]
+
* note: Vratko will fix, either in general workaround for CSIT-1848 or in a separate change.
+
 
+
==== (M) 2n-clx, 2n-icx: all Geneve tests with 1024 tunnels fail ====
+
 
+
* last update: before 2023-01-31
+
* work-to-fix: hard
+
* rca: VPP crash, Failed to add IP neighbor on interface geneve_tunnel258
+
* test: avf-ethip4--ethip4udpgeneve-1024tun-ip4base 64B 1518B IMIX 1c 2c 4c
+
* frequency: always
+
* testbed: 2n-skx, 2n-clx, 2n-icx
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/10/log.html.gz#s1-s1-s1-s1-s1 2n-icx]
+
* ticket: [https://jira.fd.io/browse/CSIT-1800 CSIT-1800]
+
  
 
==== (L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail ====
 
==== (L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail ====
  
* last update: before 2023-01-31
+
* last update: 2023-07-12
 
* work-to-fix: hard
 
* work-to-fix: hard
* rca: VPP crash, Failed to set NAT44 address range on host 10.30.51.44 (connections-per-second tests only)
+
* rca: Ramp-up trial takes more than 5 minutes so sessions are timing out.
 
* test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c
 
* test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c
 
* frequency: always
 
* frequency: always
 
* testbeds: 2n-skx, 2n-clx, 2n-icx
 
* testbeds: 2n-skx, 2n-clx, 2n-icx
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/18/log.html.gz#s1-s1-s1-s1-s11-t3 2n-icx], [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-clx/9/log.html.gz#s1-s1-s1-s1-s11-t1 2n-clx]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2306-2n-clx/8/log.html.gz#s1-s1-s1-s1-s14-t2-k2-k15-k1
 
* ticket: [https://jira.fd.io/browse/CSIT-1799 CSIT-1799]
 
* ticket: [https://jira.fd.io/browse/CSIT-1799 CSIT-1799]
  
 
==== (L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions ====
 
==== (L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions ====
  
* last update: before 2023-01-31
+
* last update: 2023-07-12
* work-to-fix: hard
+
* work-to-fix: medium
* rca:
+
* rca: One possible cause is CSIT not counting ramp-up rate properly for IMIX (if multiple packets belong to the same session).
 
* test: IMIX over 1M sessions bidir
 
* test: IMIX over 1M sessions bidir
 
* frequency: always
 
* frequency: always
 
* testbed: 2n-skx, 2n-clx, 2n-icx
 
* testbed: 2n-skx, 2n-clx, 2n-icx
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2210-2n-icx/18/log.html.gz#s1-s1-s1-s1-s2-t4 2n-icx]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2306-2n-clx/7/log.html.gz#s1-s1-s1-s1-s2-t4-k2-k11-k1-k2
 
* ticket: [https://jira.fd.io/browse/CSIT-1884 CSIT-1884]
 
* ticket: [https://jira.fd.io/browse/CSIT-1884 CSIT-1884]
 
==== (L) 2n-clx: DPDK 23.03 link failures ====
 
 
* last update: 2023-06-22
 
* work-to-fix: medium
 
* rca: No link comes up on some NICs, investigation continues.
 
* test: Testpmd (no vpp). Around half of tested+NIC combinations are affected.
 
* frequency: always since 23.03.0 got released
 
* testbed: multiple
 
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-2n-clx/189/log.html.gz#s1-s1-s1-s1-t1-k2-k4
 
* ticket: [https://jira.fd.io/browse/CSIT-1904 CSIT-1904]
 
* note: The affected combinations were removed from jobspecs.
 
  
 
== Occasional Failures ==
 
== Occasional Failures ==
Line 215: Line 183:
 
* note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.
 
* note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.
  
==== (M) 2n-clx: e810 mlrsearch tests packets forwarding in one direction ====
+
==== (M) 3n-alt: high scale ipsec policy tests may crash VPP ====
  
* last update: before 2023-01-31
+
* last update: 2024-02-07
 
* work-to-fix: hard
 
* work-to-fix: hard
* rca:
+
* rca: Vpp is crashing without creating core.
* test: e810Cq ip4base, ip6base
+
* test: policy ipsec tests, large scale increases probability.
* frequency: high
+
* frequency: 15% and lower.
* testbed: 2n-clx
+
* testbed: 3n-alt, also seen rarely on 3n-icx.
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-clx/176/log.html.gz#s1-s1-s1-s2-s8-t1 2n-clx]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/357/log.html.gz#s1-s1-s1-s1-s15-t4-k3-k5-k1-k1
* ticket: [https://jira.fd.io/browse/CSIT-1864 CSIT-1864]
+
* ticket: [https://jira.fd.io/browse/CSIT-1938 CSIT-1938]
  
 
==== (M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c ====
 
==== (M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c ====
Line 236: Line 204:
 
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/56/log.html.gz#s1-s1-s1-s3-s8-t4 3n-icx]
 
* examples: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-3n-icx/56/log.html.gz#s1-s1-s1-s3-s8-t4 3n-icx]
 
* ticket: [https://jira.fd.io/browse/CSIT-1886 CSIT-1886]
 
* ticket: [https://jira.fd.io/browse/CSIT-1886 CSIT-1886]
 +
 +
==== (M) e810cq sometimes reporting link down ====
 +
 +
* last update: 2024-02-07
 +
* work-to-fix: hard
 +
* rca: Mostly causing failure symptom of TRex complaining about link down. Probably also causes zero throughput in one direction in ASTF tests or even defice test failures. More frequent with more performant tests (L2) but seen affecting any test on occasion.
 +
* test: Any that uses Intel-E810CQ NIC.
 +
* frequency: <20%
 +
* testbed: all with the NIC.
 +
* examples: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-spr/45/log.html.gz#s1-s1-s1-s2-s24-t1-k2-k12-k14
 +
* ticket: [https://jira.fd.io/browse/CSIT-1936 CSIT-1936]
  
 
== Rare Failures ==
 
== Rare Failures ==
Line 298: Line 277:
 
* ticket: [https://jira.fd.io/browse/CSIT-1795 CSIT-1795]
 
* ticket: [https://jira.fd.io/browse/CSIT-1795 CSIT-1795]
  
==== (L) 3n-alt: Tests failing until 40Ge Interface comes up ====
+
==== (L) TRex may wrongly detect link bandwidth ====
  
* last update: 2023-06-22
+
* last update: 2024-02-07
* work-to-fix: medium
+
* work-to-fix: hard
* rca: DUT-DUT link takes too long to come up due to CSIT-1848.
+
* rca: Quite rare failure affecting unpredictable tests. Perhaps a less severe symptom of CSIT-1936.
* test: first tests in order
+
* test: No obvious pattern due to low frequency
* frequency: rare in recent times, but still not impossible
+
* frequency: <0.4%
* testbed: 3n-alt (3n-snr link does not take that long)
+
* testbed: recently seen on 3n-tsh and 3nb-spr
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/155/log.html.gz#s1-s1-s1-s1-s1-t1
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2310-3nb-spr/28/log.html.gz#s1-s1-s1-s1-s4-t1-k2-k13-k14
* ticket: [https://jira.fd.io/browse/CSIT-1890 CSIT-1890]
+
* ticket: [https://jira.fd.io/browse/CSIT-1941 CSIT-1941]
  
 
= Past Failures =
 
= Past Failures =
 
==== (H) AVF suite setup fails if previous suite was also AVF ====
 
 
* last update: 2023-05-15
 
* work-to-fix: low
 
* rca: After a recent change, CSIT attempt to bind an already bound driver.
 
* test: All AVF suites if the previous suite running on the NIC was also AVF.
 
* frequency: always since 2023-05-09
 
* testbed: all (if having AVF supported NIC)
 
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/265/log.html.gz#s1-s1-s1-s3-s2-k1-k5-k3-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1
 
* ticket: [https://jira.fd.io/browse/CSIT-1913 CSIT-1913]
 
* note: proposed fix https://gerrit.fd.io/r/c/csit/+/38831
 
 
==== (M) 2n-spr 200Ge2P1Cx7Veat: TRex sees port line rate as 100 Gbps ====
 
 
* last update: 2023-04-19
 
* work-to-fix: medium
 
* rca: TRex has hard cap on perceived line rate.
 
* test: All tests on 2n-spr with 200Ge2P1Cx7Veat NIC.
 
* frequency: always (since 2n-spr was set up)
 
* testbed: 2n-spr
 
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/15/log.html.gz#s1-s1-s1-s5-s19-t1-k2-k9-k9-k10-k1-k1-k1-k11
 
* ticket: [https://jira.fd.io/browse/CSIT-1905 CSIT-1905]
 
* note: Fixed by bumping TRex v3.03
 
 
==== (M) 2n-icx: interface down in nginx tests ====
 
 
* last update: 2023-06-21
 
* work-to-fix: medium
 
* rca: Likely an infra issue for TG with AB on NIC using ICE.
 
* test: All nginx tests except for xxv710 with dpdk driver.
 
* frequency: always since 2023-04-28
 
* testbed: 2n-icx
 
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-2n-icx/51/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k8-k3-k2-k1
 
* ticket: [https://jira.fd.io/browse/CSIT-1910 CSIT-1910]
 
* note: Fixed by adding more logic to CSIT suite setup code for tests using AB.
 
 
==== (M) hoststack: ip4udpscale1cl10s-ldpreload-iperf3 times out waiting for strace ====
 
 
* last update: 2023-06-21
 
* work-to-fix: medium
 
* rca:
 
* test: Only the two ip4udpscale1cl10s-ldpreload-iperf3 tests
 
* frequency: always since 2023-03-04
 
* testbed: 3n-icx
 
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3n-icx/41/log.html.gz#s1-s1-s1-s1-s8-t1-k2-k5-k12
 
* ticket: [https://jira.fd.io/browse/CSIT-1908 CSIT-1908]
 
* note: Fixed in vpp by https://gerrit.fd.io/r/c/vpp/+/38906
 
 
==== (L) 1n-aws: TRex NDR PDR ALL IP4 scale and L2 scale tests failing with 50% packet loss ====
 
 
* last update: 2023-06-21
 
* work-to-fix: hard
 
* rca: Perhaps AWS limits the number of usable IPv4 addresses.
 
* test: ip4scale2m
 
* frequency: always
 
* testbed: 1n-aws
 
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-trex-perf-ndrpdr-weekly-master-1n-aws/18/log.html.gz#s1-s1-s1-s1-s2-t1 1n-aws]
 
* ticket: [https://jira.fd.io/browse/CSIT-1876 CSIT-1876]
 
* note: The issue is still present, but the affected scales are no longer used in trending.
 
 
==== (M) 3n-tsh: vpp in VM starting too slowly ====
 
 
* last update: before 2023-02-22
 
* work-to-fix: medium
 
* rca: perhaps related to numa, investigation continues
 
* test: 3n-tsh: sporadic VM vhost
 
* frequency: high
 
* testbed: 3n-tsh
 
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-tsh/738/log.html.gz#s1-s1-s1-s7-s2-t1 3n-tsh], [https://jenkins.fd.io/view/csit/job/csit-vpp-perf-verify-master-3n-tsh/123/ 3n-tsh]
 
* ticket: [https://jira.fd.io/browse/CSIT-1877 CSIT-1877]
 
* note: Applied a workaround that simply waits longer. May lead to a rare failure, but for now considered fixed.
 

Latest revision as of 14:49, 7 February 2024

Contents

CSIT Test Failure Clasification

All known CSIT failures grouped and listed in the following order:

  • Always failing followed by sometimes failing.
  • Always failing tests:
    • Most common use cases followed by less common.
  • Sometimes failing tests:
    • Most frequently failing followed by less frequently failing.
      • High frequency 50%-100%
      • medium frequency 10%-50%
      • low frequency 0%-10%.
    • Within each sub-group: most common use cases followed by less common.

CSIT Test Fixing Priorities

Test fixing work priorities defined as follows:

  • (H)igh priority, most common use cases and most common test code.
  • (M)edium priority, specific HW and pervasive test code issue.
  • (L)ow priority, corner cases and external dependencies.

Current Failures

Deterministic Failures

In Trending

(H) 3nb-spr hoststack: interface not up after first test

(H) Zero traffic reported in udpquic tests

(H) 3na-spr, 2n-zn2: VPP fails to start in first test cases if dpdk/mlx5

(M) DPDK 23.03 testpmd startup fails on some testbeds

(M) 2n-spr: zero traffic on cx7 rdma

(M) Lossy trials in nat udp mlx5 tests

(L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch

(L) 2n-tx2: af_xdp mrr failures

Not In Trending

(M) tests with 9000B payload frames not forwarded over memif interfaces =

  • last update: 2023-02-09
  • work-to-fix: hard
  • test: 9000B + memif
  • testbed: 2n-skx, 3n-skx, 2n-clx
  • examples: 2n-skx Memif
  • ticket: CSIT-1808 VPP-2091

(M) IMIX 4c tests may fail PDR due to ~10% loss =

(L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B =

  • last update: 2023-06-28
  • work-to-fix: medium
  • test: 9000B + Cx7 with DPDK DUT
  • testbed: 2n-icx
  • examples: [1]
  • ticket: CSIT-1924

(L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail

(L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions

Occasional Failures

In Trending

(H) 2n-icx: NFV density VPP does not start in container

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: all subsequent
  • frequency: medium
  • testbed: 2n-icx
  • example: 2n-icx mrr, 2n-icx ndrpdr
  • ticket: CSIT-1881
  • note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.

(M) 3n-alt: high scale ipsec policy tests may crash VPP

(M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c

  • last update: before 2023-01-31
  • work-to-fix: easy
  • rca:
  • test: wireguard 100 tunnels and more
  • frequency: high
  • testbed: 3n-icx, 3n-snr
  • examples: 3n-icx
  • ticket: CSIT-1886

(M) e810cq sometimes reporting link down

Rare Failures

In Trending

(M) 3n-icx, 3n-snr: 1518B IPsec packets not passing

(M) all testbeds: mlrsearch fails to find NDR rate

  • last update: before 2023-06-22
  • work-to-fix: hard
  • rca: On 3n-tsh, the symptom is TRex reporting ierrors, only in one direction. Other testbeds may have a different symptom, but failures there are less frequent.
  • test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
  • frequency: low
  • testbed: 3n-tsh, 3n-alt, 2n-clx
  • example: [2]
  • ticket: CSIT-1804

(M) all testbeds: AF_XDP mlrsearch fails to find NDR rate

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: af-xdp multicore tests
  • frequency: low
  • testbed: 2n-clx, 2n-skx, 2n-tx2, 2n-icx
  • example: 2n-skx, 2n-clx
  • ticket: CSIT-1802
  • note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100

(L) all testbeds: vpp create avf interface failure in multi-core configs

  • last update: 2023-02-06
  • work-to-fix: hard
  • rca: issue in Intel FVL driver
  • test: multicore AVF
  • frequency: low
  • testbed: all testbeds
  • example: 2n-clx, 3n-icx
  • ticket: CSIT-1782
  • note: A long standing issue without a final permanent fix.

(L) all testbeds: nat44det 4M and 16M scale 1 session not established

  • last update: 2023-02-14
  • work-to-fix: hard
  • rca: unknown
  • test: nat44det udp 4m and 16m (64k is ok, 1m can fail but rarely than bigger scales)
  • frequency: low
  • testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx
  • example: 2n-zn2, 2n-clx
  • ticket: CSIT-1795

(L) TRex may wrongly detect link bandwidth

Past Failures