Difference between revisions of "CSIT/TestFailuresTracking"

From fd.io
Jump to: navigation, search
((M) all testbeds: AF_XDP mlrsearch fails to find NDR rate: Fix ====.)
(In Trending: Add CSIT-1947.)
Line 262: Line 262:
 
* ticket: [https://jira.fd.io/browse/CSIT-1802 CSIT-1802]
 
* ticket: [https://jira.fd.io/browse/CSIT-1802 CSIT-1802]
 
* note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100
 
* note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100
 +
 +
==== (L) Rare VPP crash in nat avf tests ====
 +
 +
* last update: 2024-08-14
 +
* work-to-fix: medium
 +
* rca: clib_dlist_remove called by nat44_session_update_lru, but probably just a memory corruption symptom
 +
* test: any NAT test with AVF
 +
* frequency: low
 +
* testbed: any
 +
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2406-2n-clx/79/log.html.gz#s1-s1-s1-s2-s37-t2-k3-k4-k1]
 +
* ticket: [https://jira.fd.io/browse/CSIT-1947 CSIT-1947]
 +
* note: More frequent in soak tests.
  
 
==== (L) ipsec hwasync fails with large scale and multiple queues ====
 
==== (L) ipsec hwasync fails with large scale and multiple queues ====

Revision as of 11:32, 14 August 2024

Contents

CSIT Test Failure Clasification

All known CSIT failures grouped and listed in the following order:

  • Always failing followed by sometimes failing.
  • Always failing tests:
    • Most common use cases followed by less common.
  • Sometimes failing tests:
    • Most frequently failing followed by less frequently failing.
      • High frequency 50%-100%
      • medium frequency 10%-50%
      • low frequency 0%-10%.
    • Within each sub-group: most common use cases followed by less common.

CSIT Test Fixing Priorities

Test fixing work priorities defined as follows:

  • (H)igh priority, most common use cases and most common test code.
  • (M)edium priority, specific HW and pervasive test code issue.
  • (L)ow priority, corner cases and external dependencies.

Current Failures

Deterministic Failures

In Trending

(H) 3nb-spr hoststack: interface not up after first test

(H) Zero traffic reported in udpquic tests

(H) 3na-spr, 2n-zn2: VPP fails to start in first test cases if dpdk/mlx5

(M) DPDK 23.03 testpmd startup fails on some testbeds

(M) 2n-spr: zero traffic on cx7 rdma

(M) Lossy trials in nat udp mlx5 tests

(L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch

(L) 2n-tx2: af_xdp mrr failures

Not In Trending

(M) tests with 9000B payload frames not forwarded over memif interfaces =

  • last update: 2023-02-09
  • work-to-fix: hard
  • test: 9000B + memif
  • testbed: 2n-skx, 3n-skx, 2n-clx
  • examples: 2n-skx Memif
  • ticket: CSIT-1808 VPP-2091

(M) IMIX 4c tests may fail PDR due to ~10% loss =

(L) Memif crashes VPP in container with jumbo frames =

  • last update: 2024-08-14
  • work-to-fix: hard (bug in VPP, no easy access to logs)
  • test: any memif with 9000B
  • testbed: all
  • examples: [1]
  • ticket: VPP-2091

(L) l3fwd error in 200Ge2P1Cx7Veat-Mlx5 test with 9000B =

  • last update: 2023-06-28
  • work-to-fix: medium
  • test: 9000B + Cx7 with DPDK DUT
  • testbed: 2n-icx
  • examples: [2]
  • ticket: CSIT-1924

(L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail

(L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions

Occasional Failures

In Trending

(H) 2n-icx: NFV density VPP does not start in container

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: all subsequent
  • frequency: medium
  • testbed: 2n-icx
  • example: 2n-icx mrr, 2n-icx ndrpdr
  • ticket: CSIT-1881
  • note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.

(M) 3n-alt: high scale ipsec policy tests may crash VPP

(M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c

  • last update: before 2023-01-31
  • work-to-fix: easy
  • rca:
  • test: wireguard 100 tunnels and more
  • frequency: high
  • testbed: 3n-icx, 3n-snr
  • examples: 3n-icx
  • ticket: CSIT-1886

(M) e810cq sometimes reporting link down

Rare Failures

In Trending

(M) 3n-icx, 3n-snr: 1518B IPsec packets not passing

(M) all testbeds: mlrsearch fails to find NDR rate

  • last update: before 2023-06-22
  • work-to-fix: hard
  • rca: On 3n-tsh, the symptom is TRex reporting ierrors, only in one direction. Other testbeds may have a different symptom, but failures there are less frequent.
  • test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
  • frequency: low
  • testbed: 3n-tsh, 3n-alt, 2n-clx
  • example: [3]
  • ticket: CSIT-1804

(M) all testbeds: AF_XDP mlrsearch fails to find NDR rate

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: af-xdp multicore tests
  • frequency: low
  • testbed: 2n-clx, 2n-skx, 2n-tx2, 2n-icx
  • example: 2n-skx, 2n-clx
  • ticket: CSIT-1802
  • note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100

(L) Rare VPP crash in nat avf tests

  • last update: 2024-08-14
  • work-to-fix: medium
  • rca: clib_dlist_remove called by nat44_session_update_lru, but probably just a memory corruption symptom
  • test: any NAT test with AVF
  • frequency: low
  • testbed: any
  • example: [4]
  • ticket: CSIT-1947
  • note: More frequent in soak tests.

(L) ipsec hwasync fails with large scale and multiple queues

  • last update: 2024-08-14
  • work-to-fix: hard
  • test: ipsec hwasync
  • frequency: low
  • testbed: all with QAT
  • example: [5]
  • ticket: CSIT-1946
  • note: Frequency decreased when CSIT changes RXQ ratio in 40824.

(L) all testbeds: vpp create avf interface failure in multi-core configs

  • last update: 2023-02-06
  • work-to-fix: hard
  • rca: issue in Intel FVL driver
  • test: multicore AVF
  • frequency: low
  • testbed: all testbeds
  • example: 2n-clx, 3n-icx
  • ticket: CSIT-1782
  • note: A long standing issue without a final permanent fix.

(L) all testbeds: nat44det 4M and 16M scale 1 session not established

  • last update: 2023-02-14
  • work-to-fix: hard
  • rca: unknown
  • test: nat44det udp 4m and 16m (64k is ok, 1m can fail but rarely than bigger scales)
  • frequency: low
  • testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx
  • example: 2n-zn2, 2n-clx
  • ticket: CSIT-1795

(L) TRex may wrongly detect link bandwidth

Past Failures