Difference between revisions of "CSIT/TestFailuresTracking"

From fd.io
Jump to: navigation, search
((M) 2n-zn2: All 4c RDMA tests are failing)
(In Trending: Add CSIT-1908.)
(8 intermediate revisions by the same user not shown)
Line 25: Line 25:
 
=== In Trending ===
 
=== In Trending ===
  
==== (M) 2n-zn2: All 4c RDMA tests are failing ====
+
==== (H) 2n-icx: some E810-CQDA2 ports not found by PCI address ====
  
* last update: before 2023-02-22
+
* last update: 2023-04-19
* work-to-fix: medium?
+
* work-to-fix: medium
* rca: VPP change 38242 causes a crash, stack trace looks the same (if it exists), does not happen with debug VPP build. More specifics not known yet.
+
* rca: Infra issue with some ports.
* test: only on RDMA, only on 2n-zn2 (not on 2n-clx), higher core count is affected more
+
* test: All tests that happen to reserve TB212 or TB213.
* frequency: 4c almost always, 2c sometimes, 1c rarely.
+
* frequency: always since 2023-04-04
* testbed: 2n-zn2
+
* testbed: 2n-icx (TB212 and TB213)
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-zn2/683/log.html.gz#s1-s1-s1-s2-s1-t2 2n-zn2]
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-icx/256/log.html.gz#s1-s1-s1-k1-k10
* ticket: [https://jira.fd.io/browse/VPP-2070 VPP-2070]
+
* ticket: [https://jira.fd.io/browse/CSIT-1907 CSIT-1907]
 +
* note: The affected testbeds fail before starting the first test case.
 +
 
 +
==== (M) hoststack: ip4udpscale1cl10s-ldpreload-iperf3 times out waiting for strace ====
 +
 
 +
* last update: 2023-04-19
 +
* work-to-fix: medium
 +
* rca:
 +
* test: Only the two ip4udpscale1cl10s-ldpreload-iperf3 tests
 +
* frequency: always since 2023-03-04
 +
* testbed: 3n-icx
 +
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-hoststack-daily-master-3n-icx/41/log.html.gz#s1-s1-s1-s1-s8-t1-k2-k5-k12
 +
* ticket: [https://jira.fd.io/browse/CSIT-1908 CSIT-1908]
  
 
==== (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device ====
 
==== (M) 3n-snr: All hwasync wireguard tests failing when trying to verify device ====
Line 80: Line 92:
 
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/155/log.html.gz#s1-s1-s1-s1-s1-t1
 
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/155/log.html.gz#s1-s1-s1-s1-s1-t1
 
* ticket: [https://jira.fd.io/browse/CSIT-1890 CSIT-1890]
 
* ticket: [https://jira.fd.io/browse/CSIT-1890 CSIT-1890]
 +
 +
==== (M) 2n-spr 200Ge2P1Cx7Veat: TRex sees port line rate as 100 Gbps ====
 +
 +
* last update: 2023-04-19
 +
* work-to-fix: medium
 +
* rca: TRex has hard cap on perceived line rate.
 +
* test: All tests on 2n-spr with 200Ge2P1Cx7Veat NIC.
 +
* frequency: always (since 2n-spr was set up)
 +
* testbed: 2n-spr
 +
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-spr/15/log.html.gz#s1-s1-s1-s5-s19-t1-k2-k9-k9-k10-k1-k1-k1-k11
 +
* ticket: [https://jira.fd.io/browse/CSIT-1905 CSIT-1905]
 +
* note: Fix will be in TRex v3.03, possible workarounds being discussed.
 +
 +
==== (M) 2n-spr: zero traffic on cx7 rdma ====
 +
 +
* last update: 2023-04-19
 +
* work-to-fix: medium
 +
* rca: VPP reports "tx completion errors", more investigation ongoing.
 +
* test: All tests on 2n-spr with 200Ge2P1Cx7Veat NIC and RDMA driver.
 +
* frequency: always (since 2n-spr was set up)
 +
* testbed: 2n-spr
 +
* ticket: [https://jira.fd.io/browse/CSIT-1906 CSIT-1906]
 +
* note: Currently not visible in trending as CSIT-1905 hits first.
 +
 +
==== (M) 2n-clx: DPDK 23.03 link failures ====
 +
 +
* last update: 2023-04-17
 +
* work-to-fix: medium
 +
* rca: No link comes up on some NICs, investigation continues.
 +
* test: Testpmd (no vpp). Around half of tested+NIC combinations are affected.
 +
* frequency: always since 23.03.0 got released
 +
* testbed: multiple
 +
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-dpdk-perf-mrr-weekly-master-2n-clx/189/log.html.gz#s1-s1-s1-s1-t1-k2-k4
 +
* ticket: [https://jira.fd.io/browse/CSIT-1904 CSIT-1904]
  
 
==== (L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch ====
 
==== (L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch ====
  
* last update: 2023-02-22
+
* last update: 2023-04-19
 
* work-to-fix:
 
* work-to-fix:
 
* rca:
 
* rca:
* test: TB38 AVF 4c l2patch e810cq
+
* test: TB38 AVF 4c e810cq, earlier l2patch nowadays eth-l2xcbase
 
* frequency: always
 
* frequency: always
 
* testbed: 3n-icx (only TB38, never TB37)
 
* testbed: 3n-icx (only TB38, never TB37)
Line 253: Line 299:
 
==== (M) all testbeds: mlrsearch fails to find NDR rate ====
 
==== (M) all testbeds: mlrsearch fails to find NDR rate ====
  
* last update: before 2023-01-31
+
* last update: before 2023-04-19
 
* work-to-fix: hard
 
* work-to-fix: hard
* rca:
+
* rca: One (not sure whether only) possible symptom is ierrors on TRex side. Not sure it is TRex error or VPP sending mangled packets.
 
* test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
 
* test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
 
* frequency: low
 
* frequency: low
Line 299: Line 345:
 
= Past Failures =
 
= Past Failures =
  
==== (M) csit-dpdk-perf-mrr-weekly-master-3n-snr fails due to a missing symlink ====
+
==== (M) 2n-zn2: All 4c RDMA tests are failing ====
  
* last update: 2023-02-14
+
* last update: before 2023-04-19
* rca: Missing file in CSIT git (probably an oversight).
+
* work-to-fix: medium?
* test: all (robot does not even start)
+
* rca: VPP change 38242 causes a crash, stack trace looks the same (if it exists), does not happen with debug VPP build. More specifics not known yet.
* testbed: 3n-snr
+
* test: only on RDMA, only on 2n-zn2 (not on 2n-clx), higher core count is affected more
* frequency: always
+
* frequency: 4c almost always, 2c sometimes, 1c rarely.
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-3n-alt/155/log.html.gz#s1-s1-s1-s1-s1-t1
+
* testbed: 2n-zn2
* ticket: [https://jira.fd.io/browse/CSIT-1894 CSIT-1894]
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-zn2/683/log.html.gz#s1-s1-s1-s2-s1-t2 2n-zn2]
* gerrit: https://gerrit.fd.io/r/c/csit/+/38239
+
* ticket: [https://jira.fd.io/browse/VPP-2070 VPP-2070]
* note: Fix verified by https://jenkins.fd.io/view/csit/job/csit-dpdk-perf-mrr-weekly-master-3n-snr/26/
+
* note: Fixed in VPP: https://gerrit.fd.io/r/c/vpp/+/38527
 
+
==== (H) 3n-icx: vpp hoststack QUIC vppecho tests failing ====
+
 
+
* last update: 2023-02-14
+
* test: Quic vppecho BPS
+
* frequency: always
+
* testbed: 3n-skx, 3n-icx
+
* example: [https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2210-3n-icx/17/log.html.gz#s1-s1-s1-s1-s5-t1 3n-icx]
+
* ticket: [https://jira.fd.io/browse/CSIT-1835 CSIT-1835]
+
* gerrit: https://gerrit.fd.io/r/c/csit/+/38085
+
* note: Fix verified since https://jenkins.fd.io/view/csit/job/csit-vpp-perf-hoststack-daily-master-3n-icx/2/
+
 
+
==== (M) wrong MAC address on lf_2n_clx_testbed27.yaml ====
+
 
+
* last update: 2023-02-14
+
* rca: typo in topology yaml file
+
* test: mlx5 relying on MAC. Affected: memif, vhost, l2bd. Not affected: ip4, ip6, dot1q, other L2.
+
* testbed: 2n-clx, only the first testbed out of three in lab
+
* frequency: always, unless other 2n-clx testbed is reserved
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-mrr-daily-master-2n-clx/1270/log.html.gz#s1-s1-s1-s1-s3-t1-k2-k9-k1-k1-k1-k21
+
* ticket: [https://jira.fd.io/browse/CSIT-1893 CSIT-1893]
+
* gerrit: https://gerrit.fd.io/r/c/csit/+/38239
+
* note: Fix verified since https://jenkins.fd.io/view/csit/job/csit-vpp-perf-mrr-daily-master-2n-clx/1271/
+
 
+
==== (M) wrong MAC address on lf_3n_icx_testbed37.yaml ====
+
 
+
* last update: 2023-02-21
+
* work-to-fix: easy
+
* rca: typo in topology yaml file
+
* test: tests using 100Ge2P1E810Cq on that testbed with dpdk plugin; AVF is not affected (as that has its own MAC addresses on VFs)
+
* testbed: 3n-icx, only the first testbed out of three in lab
+
* frequency: always, unless other 3n-icx testbed is reserved
+
* example: https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-icx/5/log.html.gz#s1-s1-s1-s5-s7-t1-k2-k5-k4
+
* ticket: [https://jira.fd.io/browse/CSIT-1898 CSIT-1898]
+
* gerrit: https://gerrit.fd.io/r/c/csit/+/38239
+
* note: Fix verified since https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-report-iterative-2302-3n-icx/15/log.html.gz#s1-s1-s1-s4-s2
+

Revision as of 13:49, 19 April 2023

Contents

CSIT Test Failure Clasification

All known CSIT failures grouped and listed in the following order:

  • Always failing followed by sometimes failing.
  • Always failing tests:
    • Most common use cases followed by less common.
  • Sometimes failing tests:
    • Most frequently failing followed by less frequently failing.
      • High frequency 50%-100%
      • medium frequency 10%-50%
      • low frequency 0%-10%.
    • Within each sub-group: most common use cases followed by less common.

CSIT Test Fixing Priorities

Test fixing work priorities defined as follows:

  • (H)igh priority, most common use cases and most common test code.
  • (M)edium priority, specific HW and pervasive test code issue.
  • (L)ow priority, corner cases and external dependencies.

Current Failures

Deterministic Failures

In Trending

(H) 2n-icx: some E810-CQDA2 ports not found by PCI address

(M) hoststack: ip4udpscale1cl10s-ldpreload-iperf3 times out waiting for strace

(M) 3n-snr: All hwasync wireguard tests failing when trying to verify device

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca: Missing QAT driver. Symptom: Failed to bind PCI device 0000:f4:00.0 to c4xxx on host 10.30.51.93
  • test: hwasync wireguard
  • frequency: always
  • testbed: 3n-snr
  • example: 3n-snr
  • ticket: CSIT-1883

(M) 1n-aws: TRex mlrsearch fails to find NDR & PDR due to AWS rate limiting (5min total test duration)

  • last update: 2023-02-09
  • work-to-fix: hard
  • rca:
  • test: ip4scale2m
  • frequency: always
  • testbed: 1n-aws
  • example: 1n-aws
  • ticket: CSIT-1876
  • note: The root cause can be shared environment in aws cloud. We may need to use a smaller scale there.

(M) 3n-alt, 3n-snr: testpmd no traffic forwarded

  • last update: 2023-02-09
  • work-to-fix: medium
  • rca: DUT-DUT link takes too long to come up on some testbeds. This happens *after* a test case with a DPDK app (not VPP even when using dpdk plugin), although multiple subsequent tests (even with VPP) may be affected. The real cause is probably in NIC firmware or driver, but CSIT can be better at detecting port status as a workaround.
  • test: testpmd (also l3fwd but hidden by CSIT-1896)
  • frequency: always (almost)
  • testbed: 3n-alt, 3n-snr
  • example: 3n-alt, 3n-snr, 3n-snr
  • ticket: CSIT-1848

(M) 3n-alt: Tests failing until 40Ge Interface comes up

(M) 2n-spr 200Ge2P1Cx7Veat: TRex sees port line rate as 100 Gbps

(M) 2n-spr: zero traffic on cx7 rdma

  • last update: 2023-04-19
  • work-to-fix: medium
  • rca: VPP reports "tx completion errors", more investigation ongoing.
  • test: All tests on 2n-spr with 200Ge2P1Cx7Veat NIC and RDMA driver.
  • frequency: always (since 2n-spr was set up)
  • testbed: 2n-spr
  • ticket: CSIT-1906
  • note: Currently not visible in trending as CSIT-1905 hits first.

(M) 2n-clx: DPDK 23.03 link failures

(L) 3n-icx: negative ipackets on TB38 AVF 4c l2patch

Not In Trending

(M) all testbeds: some vpp 9000B tests

  • last update: 2023-02-09
  • work-to-fix: hard
  • rca: VPP code: 34839: dpdk: cleanup MTU handling. CSIT needs to rework how it sets MTU / max frame rate (CSIT-1797). Some tests will continue failing due to missing support on VPP side, we will open specific Jira tickets for those.
  • test: see sub-items
  • frequency: always
  • testbed: all
  • examples: see sub-items
  • ticket: CSIT-1809
  • gerrit: https://gerrit.fd.io/r/c/csit/+/37824
(M) tests with 9000B payload frames not forwarded over vhost interfaces
  • last update: 2023-02-09
  • work-to-fix: hard
  • test: 9000B + vhostuser
  • testbed: 2n-skx, 3n-skx, 2n-clx
  • examples: 3n-skx vhostuser
  • ticket: CSIT-1809
tests with 9000B payload frames not forwarded over memif interfaces
  • last update: 2023-02-09
  • work-to-fix: hard
  • test: 9000B + memif
  • testbed: 2n-skx, 3n-skx, 2n-clx
  • examples: 2n-skx Memif
  • ticket: CSIT-1808
9000B payload frames not forwarded over tunnels due to violating supported Max Frame Size (VxLAN, LISP, SRv6)
  • last update: 2023-02-09
  • work-to-fix: medium
  • test: 9000B + (IP4 tunnels VXLAN, IP4 tunnels LISP, Srv6, IpSec)
  • testbed: 2n-icx, 3n-icx
  • examples: 2n-icx VXLAN, 3n-icx
  • ticket: CSIT-1801
(M) 9000b all AVF tests are failing to forward traffic
  • last update: 2023-02-09
  • work-to-fix: hard
  • test: 9000B + AVF
  • testbed: 3n-icx
  • examples: 3n-icx ip4base
  • ticket: CSIT-1885

(M) 2n-clx, 2n-icx, 2n-zn2: DPDK testpmd 9000b tests on xxv710 nic are failing with no traffic

  • last update: 2023-02-09
  • work-to-fix: medium
  • rca: The DPDK app only attempts to set MTU once, but if interface is down (CSIT-1848) it fails. As a workaround, MTU could be set on Linux interface before starting the DPDK app.
  • test: DPDK testpmd 9000b
  • frequency: always
  • testbed: 2n-clx, 2n-icx, 2n-zn2
  • example: 2n-clx, 2n-icx
  • ticket: CSIT-1870
  • note: Vratko will fix, either in general workaround for CSIT-1848 or in a separate change.

(M) 2n-clx, 2n-icx: all Geneve tests with 1024 tunnels fail

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca: VPP crash, Failed to add IP neighbor on interface geneve_tunnel258
  • test: avf-ethip4--ethip4udpgeneve-1024tun-ip4base 64B 1518B IMIX 1c 2c 4c
  • frequency: always
  • testbed: 2n-skx, 2n-clx, 2n-icx
  • example: 2n-icx
  • ticket: CSIT-1800

(L) 2n-clx, 2n-icx: nat44ed cps 16M sessions scale fail

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca: VPP crash, Failed to set NAT44 address range on host 10.30.51.44 (connections-per-second tests only)
  • test: 64B-avf-ethip4tcp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c, 64B-avf-ethip4udp-nat44ed-h262144-p63-s16515072-cps-ndrpdr 1c 2c 4c
  • frequency: always
  • testbeds: 2n-skx, 2n-clx, 2n-icx
  • example: 2n-icx, 2n-clx
  • ticket: CSIT-1799

(L) 2n-clx, 2n-icx: nat44det imix 1M sessions fails to create sessions

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: IMIX over 1M sessions bidir
  • frequency: always
  • testbed: 2n-skx, 2n-clx, 2n-icx
  • example: 2n-icx
  • ticket: CSIT-1884

Occasional Failures

In Trending

(H) 2n-icx: NFV density VPP does not start in container

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: all subsequent
  • frequency: medium
  • testbed: 2n-icx
  • example: 2n-icx mrr, 2n-icx ndrpdr
  • ticket: CSIT-1881
  • note: Once VPP breaks, all subsequent tests fail. Even all subsequent builds will be failing until Peter makes TB working again. Although it's failing with medium frequency when it happens it breaks all subsequent builds on the TB therefore [H] priority.

(M) 2n-clx: e810 mlrsearch tests packets forwarding in one direction

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: e810Cq ip4base, ip6base
  • frequency: high
  • testbed: 2n-clx
  • example: 2n-clx
  • ticket: CSIT-1864

(M) 3n-icx, 3n-snr: wireguard 100 and 1000 tunnels mlrsearch tests failing with 2c and 4c

  • last update: before 2023-01-31
  • work-to-fix: easy
  • rca:
  • test: wireguard 100 tunnels and more
  • frequency: high
  • testbed: 3n-icx, 3n-snr
  • examples: 3n-icx
  • ticket: CSIT-1886

(M) 3n-tsh: vpp in VM starting too slowly

  • last update: before 2023-02-22
  • work-to-fix: medium
  • rca: perhaps related to numa, investigation continues
  • test: 3n-tsh: sporadic VM vhost
  • frequency: high
  • testbed: 3n-tsh
  • example: 3n-tsh, 3n-tsh
  • ticket: CSIT-1877

Rare Failures

In Trending

(M) 3n-icx, 3n-snr: 1518B IPsec packets not passing

(M) all testbeds: mlrsearch fails to find NDR rate

  • last update: before 2023-04-19
  • work-to-fix: hard
  • rca: One (not sure whether only) possible symptom is ierrors on TRex side. Not sure it is TRex error or VPP sending mangled packets.
  • test: Crypto, Ip4, L2, Srv6, Vm Vhost (all packet sizes, all core configurations affected)
  • frequency: low
  • testbed: 3n-tsh, 3n-alt, 2n-clx
  • example: 2n-icx
  • ticket: CSIT-1804

(M) all testbeds: AF_XDP mlrsearch fails to find NDR rate

  • last update: before 2023-01-31
  • work-to-fix: hard
  • rca:
  • test: af-xdp multicore tests
  • frequency: low
  • testbed: 2n-clx, 2n-skx, 2n-tx2, 2n-icx
  • example: 2n-skx, 2n-clx
  • ticket: CSIT-1802
  • note: This is mainly observed in iterative and coverage. It's very low frequency ~ 1 out of 100

(L) all testbeds: vpp create avf interface failure in multi-core configs

  • last update: 2023-02-06
  • work-to-fix: hard
  • rca: issue in Intel FVL driver
  • test: multicore AVF
  • frequency: low
  • testbed: all testbeds
  • example: 2n-clx, 3n-icx
  • ticket: CSIT-1782
  • note: A long standing issue without a final permanent fix.

(L) all testbeds: nat44det 4M and 16M scale 1 session not established

  • last update: 2023-02-14
  • work-to-fix: hard
  • rca: unknown
  • test: nat44det udp 4m and 16m (64k is ok, 1m can fail but rarely than bigger scales)
  • frequency: low
  • testbed: 2n-zn2, 2n-skx, 2n-icx, 2n-clx
  • example: 2n-zn2, 2n-clx
  • ticket: CSIT-1795

Past Failures

(M) 2n-zn2: All 4c RDMA tests are failing

  • last update: before 2023-04-19
  • work-to-fix: medium?
  • rca: VPP change 38242 causes a crash, stack trace looks the same (if it exists), does not happen with debug VPP build. More specifics not known yet.
  • test: only on RDMA, only on 2n-zn2 (not on 2n-clx), higher core count is affected more
  • frequency: 4c almost always, 2c sometimes, 1c rarely.
  • testbed: 2n-zn2
  • example: 2n-zn2
  • ticket: VPP-2070
  • note: Fixed in VPP: https://gerrit.fd.io/r/c/vpp/+/38527