Difference between revisions of "CSIT/TestFailuresTracking"
From fd.io
< CSIT
(→In Trending: Add CSIT-1960.) |
(→Not In Trending: ===== -> ====.) |
||
Line 163: | Line 163: | ||
* note: Some tests drop packets (affects all tests), some tests fragment packets (does not fail MRR). | * note: Some tests drop packets (affects all tests), some tests fragment packets (does not fail MRR). | ||
− | ==== (M) tests with 9000B payload frames not forwarded over memif interfaces | + | ==== (M) tests with 9000B payload frames not forwarded over memif interfaces ==== |
* last update: 2023-02-09 | * last update: 2023-02-09 | ||
Line 172: | Line 172: | ||
* ticket: [https://jira.fd.io/browse/CSIT-1808 CSIT-1808] VPP-2091 | * ticket: [https://jira.fd.io/browse/CSIT-1808 CSIT-1808] VPP-2091 | ||
− | ==== (M) IMIX 4c tests may fail PDR due to ~10% loss | + | ==== (M) IMIX 4c tests may fail PDR due to ~10% loss ==== |
* last update: 2024-02-07 | * last update: 2024-02-07 | ||
Line 183: | Line 183: | ||
* ticket: [https://jira.fd.io/browse/CSIT-1943 CSIT-1943] | * ticket: [https://jira.fd.io/browse/CSIT-1943 CSIT-1943] | ||
− | ==== (L) Memif crashes VPP in container with jumbo frames | + | ==== (L) Memif crashes VPP in container with jumbo frames ==== |
* last update: 2024-08-14 | * last update: 2024-08-14 | ||
Line 192: | Line 192: | ||
* ticket: [https://jira.fd.io/browse/VPP-2091 VPP-2091] | * ticket: [https://jira.fd.io/browse/VPP-2091 VPP-2091] | ||
− | ==== (L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B | + | ==== (L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B ==== |
* last update: 2023-06-28 | * last update: 2023-06-28 |
Revision as of 12:07, 14 August 2024
Contents
- 1 CSIT Test Failure Clasification
- 2 CSIT Test Fixing Priorities
- 3 Current Failures
- 3.1 Deterministic Failures
- 3.1.1 In Trending
- 3.1.1.1 (H) 3n spr: Unusable performance of ipsec tests with SHA_256_128
- 3.1.1.2 (H) 3nb-spr hoststack: interface not up after first test
- 3.1.1.3 (H) Zero traffic reported in udpquic tests
- 3.1.1.4 (H) 3na-spr, 2n-zn2: VPP fails to start in first test cases if dpdk/mlx5
- 3.1.1.5 (M) DPDK 23.03 testpmd startup fails on some testbeds
- 3.1.1.6 (M) 2n-spr: zero traffic on cx7 rdma
- 3.1.1.7 (M) 3n-icx 3nb-spr: Failed to enable GTPU offload RX
- 3.1.1.8 (M) Lossy trials in nat udp mlx5 tests
- 3.1.1.9 (L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch
- 3.1.1.10 (L) 2n-tx2: af_xdp mrr failures
- 3.1.2 Not In Trending
- 3.1.2.1 (M) Combination of AVF and vhost drops all 9000B packets
- 3.1.2.2 (M) 9000B tests with encap overhead and non-dpdk plugins see fragmented packets
- 3.1.2.3 (M) tests with 9000B payload frames not forwarded over memif interfaces
- 3.1.2.4 (M) IMIX 4c tests may fail PDR due to ~10% loss
- 3.1.2.5 (L) Memif crashes VPP in container with jumbo frames
- 3.1.2.6 (L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B
- 3.1.2.7 (L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail
- 3.1.2.8 (L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions
- 3.1.1 In Trending
- 3.2 Occasional Failures
- 3.3 Rare Failures
- 3.3.1 In Trending
- 3.3.1.1 (M) 3n-icx, 3n-snr: 1518B IPsec packets not passing
- 3.3.1.2 (M) all testbeds: mlrsearch fails to find NDR rate
- 3.3.1.3 (M) all testbeds: AF_XDP mlrsearch fails to find NDR rate
- 3.3.1.4 (L) 2n-zn2: Geneve sometimes loses one direction of traffic
- 3.3.1.5 (L) Rare VPP crash in nat avf tests
- 3.3.1.6 (L) ipsec hwasync fails with large scale and multiple queues
- 3.3.1.7 (L) all testbeds: vpp create avf interface failure in multi-core configs
- 3.3.1.8 (L) all testbeds: nat44det 4M and 16M scale 1 session not established
- 3.3.1.9 (L) TRex may wrongly detect link bandwidth
- 3.3.1 In Trending
- 3.1 Deterministic Failures
- 4 Past Failures
CSIT Test Failure Clasification
All known CSIT failures grouped and listed in the following order:
- Always failing followed by sometimes failing.
- Always failing tests:
- Most common use cases followed by less common.
- Sometimes failing tests:
- Most frequently failing followed by less frequently failing.
- High frequency 50%-100%
- medium frequency 10%-50%
- low frequency 0%-10%.
- Within each sub-group: most common use cases followed by less common.
- Most frequently failing followed by less frequently failing.
CSIT Test Fixing Priorities
Test fixing work priorities defined as follows:
- (H)igh priority, most common use cases and most common test code.
- (M)edium priority, specific HW and pervasive test code issue.
- (L)ow priority, corner cases and external dependencies.
Current Failures
Deterministic Failures
In Trending
(H) 3n spr: Unusable performance of ipsec tests with SHA_256_128
- last update: 2024-08-14
- work-to-fix: low (if you are Damjan)
- rca: It seems compiler emits wrong instructions, VPP build system needs to be fixed.
- test: Three new ipsec tests that use SHA_256_128 as integrity algorithm.
- frequency: 50%
- testbed: 3na-spr, 3nb-spr
- example: [1]
- ticket: VPP-2118
- note: Other Xeon testbeds are also affected, but performance is not as bad to fail NDR. ARM is not affected at all.
(H) 3nb-spr hoststack: interface not up after first test
- last update: 2024-02-07
- work-to-fix: medium
- rca: After first test, HundredGigabitEthernetab/0/0 never goes up within the run. Not sure which part f test setup is missing, the tests do work correctly on other testbeds.
- test: All subsequent tests.
- frequency: 100%
- testbed: 3nb-apr
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3nb-spr/87/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k3-k7-k7
- ticket: CSIT-1942
(H) Zero traffic reported in udpquic tests
- last update: 2024-02-07
- work-to-fix: medium
- rca: There are errors when closing sessions. The current CSIT logic happens to report this as a passing test with zero traffic, which is wrong.
- test: All tests udpquic tests
- frequency: 100%
- testbed: all running the tests
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3n-icx/217/log.html.gz#s1-s1-s1-s1-s4-t1-k2-k4-k17
- ticket: CSIT-1935
(H) 3na-spr, 2n-zn2: VPP fails to start in first test cases if dpdk/mlx5
- last update: 2024-02-07
- work-to-fix: medium
- rca: DUT1 fails to boot up in first test case. If DUT2 is present, it fails to start in second test case. Other test cases in the run are unaffected. This looks like an infra issue, ansible cleanup is doing something wrong, hard to tell what.
- test: First test case, unless it uses AVF driver.
- frequency: 100%
- testbed: 3na-spr, 2n-zn2
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3na-spr/139/log.html.gz#s1-s1-s1-s1-s1-t1-k2-k3-k2-k5-k1-k1-k1-k1
- ticket: CSIT-1939
(M) DPDK 23.03 testpmd startup fails on some testbeds
- last update: 2023-11-06
- work-to-fix: medium
- rca: The DUT-DUT link sometimes does not go up. The same consequences as CSIT-1848 but affects more testbed+NIC combinations. Can be fixed by a different startup procedure (better detection before for restart).
- test: Testpmd (no vpp). Around half of tested+NIC combinations are affected.
- frequency: Low, ~1%, as in most testbeds the link does go up.
- testbed: multiple
- example:https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-iterative-2310-2n-zn2/8/log.html.gz#s1-s1-s1-s1-t6-k2-k4
- ticket: CSIT-1904
(M) 2n-spr: zero traffic on cx7 rdma
- last update: 2023-06-22
- work-to-fix: medium
- rca: VPP reports "tx completion errors", more investigation ongoing.
- test: All tests on 2n-spr with 200Ge2P1Cx7Veat NIC and RDMA driver.
- frequency: always (since 2n-spr was set up)
- testbed: 2n-spr
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/58/log.html.gz#s1-s1-s1-s5-s19-t1-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1
- ticket: CSIT-1906
- note: Also would affect 3n-alt with mlxc6 and rdma. Will probably be made invisible by removing rdma (except mlxc5) from jobspecs.
(M) 3n-icx 3nb-spr: Failed to enable GTPU offload RX
- last update: 2024-08-14
- work-to-fix: low
- rca: Retval is -1. More examination is needed to understand why.
- test: any gtpuhw
- frequency: 100%
- testbed: all
- example: [2]
- ticket: CSIT-1950
(M) Lossy trials in nat udp mlx5 tests
- last update: 2024-02-07
- work-to-fix: hard
- rca: VPP counters suggest the packet is lost somewhere between TG on in-side [1] and VPP [2].
- test: It is affecting only cx7 with mlx5 driver (not e810cq with avf driver), only udp tests (not tcp or other), and interestingly both ASTF (both cps and tput for nat44ed) and STL (det44) traffic profiles. It does not affect corresponding ASTF tests with ip4base routing.
- frequency: depends on scale, 100% on high scale tests.
- testbed: 2n-icx, 2n-spr.
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-spr/45/log.html.gz#s1-s1-s1-s2-s49-t2-k2-k14-k14
- ticket: CSIT-1929
(L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch
- last update: 2023-04-19
- work-to-fix:
- rca:
- test: TB38 AVF 4c e810cq, earlier l2patch nowadays eth-l2xcbase
- frequency: always
- testbed: 3n-icx (only TB38, never TB37)
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-icx/214/log.html.gz#s1-s1-s1-s5-s3-t3-k2-k9-k8-k13-k1-k2
- ticket: CSIT-1901
(L) 2n-tx2: af_xdp mrr failures
- last update: 2023-11-08
- work-to-fix:
- rca: Some workers see no traffic, "error af_xdp_device_input_refill_db: rx poll() failed: Bad address" in show hardware. More examination needed.
- test: ip4 and ip6, base and scale
- frequency: more than 80% on a subset of cases, 100% on (multicore) 2n-tx2
- testbed: 2n-tx2; to a lesser extent also 2n-clx and 2n-icx, where just decreased performance (and failure in ndrpdr) is more likely outcome.
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/120/log.html.gz#s1-s1-s1-s2-s2-t2-k2-k10-k9-k14-k1-k1-k1-k1
- ticket: CSIT-1922
Not In Trending
(M) Combination of AVF and vhost drops all 9000B packets
- last update: 2024-08-14
- work-to-fix: medium
- rca: Buffer alloc error is seen, not sure why that happens.
- test: 9000B vhost
- frequency: 100%
- testbed: all
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2406-2n-icx/33/log.html.gz#s1-s1-s1-s1-s1-t6-k3-k7-k1-k1-k1-k8-k14-k1-k1-k1-k1]
- ticket: CSIT-1951
- note: Sometimes VPP crashes, not sure if the cause is the same.
(M) 9000B tests with encap overhead and non-dpdk plugins see fragmented packets
- last update: 2024-08-14
- work-to-fix: medium
- rca: Some internal MTU is at 9000 (not 9200). More examination needed to see if the issue is in VPP or CSIT.
- test: 9000B testcases for loadbalance, geneva, vxlan or SRv6
- frequency: 100%
- testbed: all
- example: [3]
- ticket: CSIT-1950
- note: Some tests drop packets (affects all tests), some tests fragment packets (does not fail MRR).
(M) tests with 9000B payload frames not forwarded over memif interfaces
- last update: 2023-02-09
- work-to-fix: hard
- test: 9000B + memif
- testbed: 2n-skx, 3n-skx, 2n-clx
- examples: 2n-skx Memif
- ticket: CSIT-1808 VPP-2091
(M) IMIX 4c tests may fail PDR due to ~10% loss
- last update: 2024-02-07
- work-to-fix: hard
- rca: Only seen in coverage tests so far, more data needed.
- test: various high-performance, mostly mlx5
- frequency: <1%
- testbed: 2n-icx, 3n-icx, 3na-spr
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-report-coverage-2310-2n-icx/1/log.html.gz#s1-s1-s1-s3-t9-k2-k5-k16-k9
- ticket: CSIT-1943
(L) Memif crashes VPP in container with jumbo frames
- last update: 2024-08-14
- work-to-fix: hard (bug in VPP, no easy access to logs)
- test: any memif with 9000B
- testbed: all
- examples: [4]
- ticket: VPP-2091
(L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B
- last update: 2023-06-28
- work-to-fix: medium
- test: 9000B + Cx7 with DPDK DUT
- testbed: 2n-icx
- examples: [5]
- ticket: CSIT-1924
(L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail
- last update: 2023-07-12
- work-to-fix: hard
- rca: Ramp-up trial takes more than 5 minutes so sessions are timing out.
- test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c
- frequency: always
- testbeds: 2n-skx, 2n-clx, 2n-icx
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2306-2n-clx/8/log.html.gz#s1-s1-s1-s1-s14-t2-k2-k15-k1
- ticket: CSIT-1799
(L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions
- last update: 2023-07-12
- work-to-fix: medium
- rca: One possible cause is CSIT not counting ramp-up rate properly for IMIX (if multiple packets belong to the same session).
- test: IMIX over 1M sessions bidir
- frequency: always
- testbed: 2n-skx, 2n-clx, 2n-icx
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-coverage-2306-2n-clx/7/log.html.gz#s1-s1-s1-s1-s2-t4-k2-k11-k1-k2
- ticket: CSIT-1884
Occasional Failures
In Trending
(H) 2n-icx: NFV density VPP does not start in container
- last update: before 2023-01-31
- work-to-fix: hard
- rca:
- test: all subsequent
- frequency: medium
- testbed: 2n-icx
- example: 2n-icx mrr, 2n-icx ndrpdr
- ticket: CSIT-1881
- note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.
(M) 3n-alt: high scale ipsec policy tests may crash VPP
- last update: 2024-02-07
- work-to-fix: hard
- rca: Vpp is crashing without creating core.
- test: policy ipsec tests, large scale increases probability.
- frequency: 15% and lower.
- testbed: 3n-alt, also seen rarely on 3n-icx.
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/357/log.html.gz#s1-s1-s1-s1-s15-t4-k3-k5-k1-k1
- ticket: CSIT-1938
(M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c
- last update: before 2023-01-31
- work-to-fix: easy
- rca:
- test: wireguard 100 tunnels and more
- frequency: high
- testbed: 3n-icx, 3n-snr
- examples: 3n-icx
- ticket: CSIT-1886
(M) e810cq sometimes reporting link down
- last update: 2024-02-07
- work-to-fix: hard
- rca: Mostly causing failure symptom of TRex complaining about link down. Probably also causes zero throughput in one direction in ASTF tests or even defice test failures. More frequent with more performant tests (L2) but seen affecting any test on occasion.
- test: Any that uses Intel-E810CQ NIC.
- frequency: <20%
- testbed: all with the NIC.
- examples: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-ndrpdr-weekly-master-2n-spr/45/log.html.gz#s1-s1-s1-s2-s24-t1-k2-k12-k14
- ticket: CSIT-1936
Rare Failures
In Trending
(M) 3n-icx, 3n-snr: 1518B IPsec packets not passing
- last update: before 2023-01-31
- work-to-fix: hard
- rca:
- test: all AVF crypto
- frequency: low
- testbed: 3n-skx, 3n-icx, 3n-snr
- example: 3n-icx daily, 3n-snr, 3n-icx weekly
- ticket: CSIT-1827
(M) all testbeds: mlrsearch fails to find NDR rate
- last update: before 2023-06-22
- work-to-fix: hard
- rca: On 3n-tsh, the symptom is TRex reporting ierrors, only in one direction. Other testbeds may have a different symptom, but failures there are less frequent.
- test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
- frequency: low
- testbed: 3n-tsh, 3n-alt, 2n-clx
- example: [6]
- ticket: CSIT-1804
(M) all testbeds: AF_XDP mlrsearch fails to find NDR rate
- last update: before 2023-01-31
- work-to-fix: hard
- rca:
- test: af-xdp multicore tests
- frequency: low
- testbed: 2n-clx, 2n-skx, 2n-tx2, 2n-icx
- example: 2n-skx, 2n-clx
- ticket: CSIT-1802
- note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100
(L) 2n-zn2: Geneve sometimes loses one direction of traffic
- last update: 2024-08-14
- work-to-fix: hard
- rca: More investigation needed to see the mechanism.
- test: any geneve
- frequency: <1%
- testbed: 2n-zn2
- example: [7]
- ticket: CSIT-1960
- note: Very rare, recently seen only as a failure in report iterative (not weekly trending) and in one run as MRR regression.
(L) Rare VPP crash in nat avf tests
- last update: 2024-08-14
- work-to-fix: medium
- rca: clib_dlist_remove called by nat44_session_update_lru, but probably just a memory corruption symptom
- test: any NAT test with AVF
- frequency: low
- testbed: any
- example: [8]
- ticket: CSIT-1947
- note: More frequent in soak tests.
(L) ipsec hwasync fails with large scale and multiple queues
- last update: 2024-08-14
- work-to-fix: hard
- test: ipsec hwasync
- frequency: low
- testbed: all with QAT
- example: [9]
- ticket: CSIT-1946
- note: Frequency decreased when CSIT changes RXQ ratio in 40824.
(L) all testbeds: vpp create avf interface failure in multi-core configs
- last update: 2023-02-06
- work-to-fix: hard
- rca: issue in Intel FVL driver
- test: multicore AVF
- frequency: low
- testbed: all testbeds
- example: 2n-clx, 3n-icx
- ticket: CSIT-1782
- note: A long standing issue without a final permanent fix.
(L) all testbeds: nat44det 4M and 16M scale 1 session not established
- last update: 2023-02-14
- work-to-fix: hard
- rca: unknown
- test: nat44det udp 4m and 16m (64k is ok, 1m can fail but rarely than bigger scales)
- frequency: low
- testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx
- example: 2n-zn2, 2n-clx
- ticket: CSIT-1795
(L) TRex may wrongly detect link bandwidth
- last update: 2024-02-07
- work-to-fix: hard
- rca: Quite rare failure affecting unpredictable tests. Perhaps a less severe symptom of CSIT-1936.
- test: No obvious pattern due to low frequency
- frequency: <0.4%
- testbed: recently seen on 3n-tsh and 3nb-spr
- example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2310-3nb-spr/28/log.html.gz#s1-s1-s1-s1-s4-t1-k2-k13-k14
- ticket: CSIT-1941