CSIT/PerformanceTrendingAnalysis
Contents
Purpose
With increasing number of features and code changes in the FD.io VPP data plane codebase, it is increasingly difficult to measure and detect VPP data plane performance changes. Similarly, once degradation is detected, it is getting harder to bisect the source code in search of the Bad code change or addition. The problem is further escalated by a large combination of compute platforms that VPP is running and used on, including Intel Xeon, Intel Atom, ARM Aarch64.
Existing FD.io CSIT continuous performance trending test jobs help, but they rely on human factors for anomaly detection, and as such are error prone and unreliable, as the volume of data generated by these jobs is growing exponentially.
Proposed solution is to eliminate human factor and fully automate performance trending, regression and progression detection, as well as bisecting.
This document describes a high-level design of a system for continuous measuring, trending and performance change detection for FD.io VPP SW data plane. It builds upon the existing CSIT framework with extensions to its throughput testing methodology, CSIT data analytics engine (PAL – Presentation-and-Analytics-Layer) and associated Jenkins jobs definitions.
Continuous Performance Trending and Analysis
Proposed design replaces existing CSIT performance trending jobs and tests with new Performance Trending (PT) CSIT module and separate Performance Analysis (PA) module ingesting results from PT and analysing, detecting and reporting any performance anomalies using historical trending data and statistical metrics. PA does also produce trending graphs with summary and drill-down views across all specified tests that can be reviewed and inspected regularly by FD.io developers and users community.
Trend Analysis
All measured performance trend data is treated as time-series data that can be modelled using normal distribution. After trimming the outliers, the average and deviations from average are used for detecting performance change anomalies following the three-sigma rule of thumb (a.k.a. 68-95-99.7 rule).
Analysis Metrics
Following statistical metrics are proposed as performance trend indicators over the rolling window of last <N> sets of historical measurement data:
- Quartiles Q1, Q2, Q3 – three points dividing a ranked set of data set into four equal parts, Q2 is the median of the data.
- Inter Quartile Range IQR=Q3-Q1 – measure of variability, used here to eliminate outliers.
- Outliers – extreme values that are at least 1.5*IQR below Q1, or at least 1.5*IQR above Q3.
- Trimmed Moving Average (TMA) – average across the data set of the rolling window of <N> values without the outliers. Used here to calculate TMSD.
- Trimmed Moving Standard Deviation (TMSD) – standard deviation over the data set of the rolling window of <N> values without the outliers, requires calculating TMA. Used here for anomaly detection.
- Moving Median (MM) - median across the data set of the rolling window of <N> values with all data points, including the outliers. Used here for anomaly detection.
The relation between IQR and Standard Deviation (denoted by sigma) is shown in figure below.
598px-Boxplot_vs_PDF.png
Anomaly Detection
Based on the assumption that all performance measurements can be modelled using normal distribution, a three-sigma rule of thumb is proposed as the main criteria for anomaly detection.
Three-sigma rule of thumb, aka 68–95–99.7 rule, is a shorthand used to capture the percentage of values that lie within a band around the average (mean) in a normal distribution within a width of two, four and six standard deviations. More accurately 68.27%, 95.45% and 99.73% of the result values should lie within one, two or three standard deviations of the mean, see figure below.
To verify compliance of test result with value X against defined trend analysis metric and detect anomalies, three simple evaluation criteria are proposed:
Evaluation Criteria Compliance Confidence Level Evaluation Result ===================================================================================== (MM-3*TMSD) <= X <= (MM+3*TMSD) 99.73% Normal X < (MM-3*TMSD) Anomaly Regression X > (MM+3*TMSD) Anomaly Progression
MM is used for the central trend reference point instead of TMA as it is more robust to anomalies.
Reporting
Analysis results are reported in text format per test case result, in graphical format with trending graphs and as a cumulative Jenkins job result, as follows:
Test Result Evaluation Reported Result Reported Reason Trending Graph Markers ========================================================================================== Normal Pass Normal Part of plot line Regression Fail Regression Red circle Progression Pass Progression Green circle
Jenkins job cumulative results:
- Pass - if all detection results are Pass or Warning.
- Fail - if any detection result is Fail.
Performance Trending (PT)
CSIT PT runs regular performance test jobs finding MRR, PDR and NDR per test cases. PT is designed as follows:
- PT job triggers:
- Periodic e.g. daily.
- On-demand gerrit triggered.
- Other periodic TBD.
- Measurements and calculations per test case:
- MRR Max Received Rate
- Measured: Unlimited tolerance of packet loss.
- Send packets at link rate, count total received packets, divide by test trial period.
- Optimized binary search bounds for PDR and NDR tests:
- Calculated: High and low bounds for binary search based on MRR and pre-defined Packet Loss Ratio (PLR).
- HighBound=MRR, LowBound=to-be-determined.
- PLR – acceptable loss ratio for PDR tests, currently set to 0.5% for all performance tests.
- PDR and NDR:
- Run binary search within the calculated bounds, find PDR and NDR.
- Measured: PDR Partial Drop Rate – limited non-zero tolerance of packet loss.
- Measured: NDR Non Drop Rate - zero packet loss.
- MRR Max Received Rate
- Archive MRR, PDR and NDR per test case.
- Archive counters collected at MRR, PDR and NDR.
Performance Analysis (PA)
CSIT PA runs performance analysis, change detection and trending using specified trend analysis metrics over the rolling window of last <N> sets of historical measurement data. PA is defined as follows:
- PA job triggers:
- By PT job at its completion.
- On-demand gerrit triggered.
- Other periodic TBD.
- Download and parse archived historical data and the new data:
- New data from latest PT job is evaluated against the rolling window of <N> sets of historical data.
- Download RF output.xml files and compressed archived data.
- Parse out the data filtering test cases listed in PA specification (part of CSIT PAL specification file).
- Calculate trend metrics for the rolling window of <N> sets of historical data:
- Calculate quartiles Q1, Q2, Q3.
- Trim outliers using IQR.
- Calculate TMA and TMSD.
- Calculate normal trending range per test case based on TMA and TMSD.
- Evaluate new test data against trend metrics:
- If within the range of (TMA +/- 3*TMSD) => Result = Pass, Reason = Normal.
- If below the range => Result = Fail, Reason = Regression.
- If above the range => Result = Pass, Reason = Progression.
- Generate and publish results
- Relay evaluation result to job result.
- Generate a new set of trend analysis summary graphs and drill-down graphs.
- Summary graphs to include measured values with Normal, Progression and Regression markers. MM shown in the background if possible.
- Drill-down graphs to include MM, TMA and TMSD.
- Publish trend analysis graphs in html format on https://docs.fd.io/.