Difference between revisions of "CSIT/CSIT LF VIRL testbed"

From fd.io
Jump to: navigation, search
Line 11: Line 11:
 
===High Priority Tasks===
 
===High Priority Tasks===
  
# [WIP] Detecting and clearing stuck VIRL simulations. [https://jira.fd.io/browse/CSIT-582 CSIT-582].
+
# [DONE] Detecting and clearing stuck VIRL simulations. [https://jira.fd.io/browse/CSIT-582 CSIT-582].
 
## Description: Continue getting stuck VIRL simulations due to  either LF network connectivity interruptions or failing CSIT bootstrap teardown.
 
## Description: Continue getting stuck VIRL simulations due to  either LF network connectivity interruptions or failing CSIT bootstrap teardown.
 
## Solution: automate clearing old (garbage) simulations, detect non-successful simulation teardowns.
 
## Solution: automate clearing old (garbage) simulations, detect non-successful simulation teardowns.
 
## Tasks:
 
## Tasks:
##* [IN-REVIEW] Use built-in VIRL simulation expire timer set to 2hrs.
+
##* [DONE] Use built-in VIRL simulation expire timer set to 2hrs.
 
##** coded default simulation expiry to 120min, semi-weekly to 500min, weekly to 120min. [https://jira.fd.io/browse/CSIT-579 CSIT-579]. [https://gerrit.fd.io/r/#/c/6656/ gr6656].
 
##** coded default simulation expiry to 120min, semi-weekly to 500min, weekly to 120min. [https://jira.fd.io/browse/CSIT-579 CSIT-579]. [https://gerrit.fd.io/r/#/c/6656/ gr6656].
##* [OPEN] Extend CSIT bootstrap teardown stop-simulation API call to verify if it was SUCCESS/FAIL. [https://jira.fd.io/browse/CSIT-583 CSIT-583].
+
##* [DONE] Extend CSIT bootstrap teardown stop-simulation API call to verify if it was SUCCESS/FAIL. [https://jira.fd.io/browse/CSIT-583 CSIT-583].
 
# [WIP] Add VIRL server healthchecks in CSIT. [https://jira.fd.io/browse/CSIT-584 CSIT-584].
 
# [WIP] Add VIRL server healthchecks in CSIT. [https://jira.fd.io/browse/CSIT-584 CSIT-584].
 
## Description: no regular automated healthchecks executed against VIRL servers.
 
## Description: no regular automated healthchecks executed against VIRL servers.
Line 30: Line 30:
 
## Solution: Need to increase IPv4 address space given to VIRL hosts. Dedicating /24 subnet per VIRL server, will give address capacity for 60 concurrent simulations. Based on previous memory calcs each VIRL host is capable of doing 30 simulations (30*3 VMs) - need to test verify this.
 
## Solution: Need to increase IPv4 address space given to VIRL hosts. Dedicating /24 subnet per VIRL server, will give address capacity for 60 concurrent simulations. Based on previous memory calcs each VIRL host is capable of doing 30 simulations (30*3 VMs) - need to test verify this.
 
## Tasks:
 
## Tasks:
##* [WIP] Request LF FD.io infra team to allocate at least another /24 for VIRL simulations. [https://jira.fd.io/browse/CSIT-589 CSIT-589].
+
##* [DONE] Request LF FD.io infra team to allocate at least another /24 for VIRL simulations. [https://jira.fd.io/browse/CSIT-589 CSIT-589].
##** [WIP] Request acknowledged by LF as part of [FD.io Helpdesk #40733]. https://lists.fd.io/pipermail/csit-dev/2017-May/001911.html.
+
##** [DONE] Request acknowledged by LF as part of [FD.io Helpdesk #40733]. https://lists.fd.io/pipermail/csit-dev/2017-May/001911.html.
 
##* [OPEN] Add the new mgmt IPv4 address range to virl hosts. [https://jira.fd.io/browse/CSIT-590 CSIT-590].
 
##* [OPEN] Add the new mgmt IPv4 address range to virl hosts. [https://jira.fd.io/browse/CSIT-590 CSIT-590].
# [WIP] Script expecting VIRL sim nodes to be active within ca. 120sec after launch request - this is too tight. [https://jira.fd.io/browse/CSIT-593 CSIT-593].
+
# [DONE] Script expecting VIRL sim nodes to be active within ca. 120sec after launch request - this is too tight. [https://jira.fd.io/browse/CSIT-593 CSIT-593].
 
## Description: Intermittent test job failures due to 'ERROR: Simulation started OK but devices never changed to ACTIVE state’. Number of these can be avoided by increasing the script timeout to 240sec or so.
 
## Description: Intermittent test job failures due to 'ERROR: Simulation started OK but devices never changed to ACTIVE state’. Number of these can be avoided by increasing the script timeout to 240sec or so.
 
## Solution: Increasing the script timeout to 240sec or so. But don’t wait 4min every time before trying, as this will add to the overall execution time.
 
## Solution: Increasing the script timeout to 240sec or so. But don’t wait 4min every time before trying, as this will add to the overall execution time.
 
## Tasks:
 
## Tasks:
##* [WIP] Increase test script timeout to 240sec. [https://jira.fd.io/browse/CSIT-593 CSIT-593]
+
##* [DONE] Increase test script timeout to 240sec. [https://jira.fd.io/browse/CSIT-593 CSIT-593]
 
# [WIP] tb4-virl servers upgrade to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. [https://jira.fd.io/browse/CSIT-594 CSIT-594]
 
# [WIP] tb4-virl servers upgrade to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. [https://jira.fd.io/browse/CSIT-594 CSIT-594]
 
## Description: virl upgrade to address issues with Centos7 test instabilities related to QEMU, and to improve general virl system robustness.
 
## Description: virl upgrade to address issues with Centos7 test instabilities related to QEMU, and to improve general virl system robustness.
Line 51: Line 51:
 
##* [OPEN] Check the current max CSIT simulation capacity per VIRL host. Focus on upgraded hosts. [https://jira.fd.io/browse/CSIT-591 CSIT-591].
 
##* [OPEN] Check the current max CSIT simulation capacity per VIRL host. Focus on upgraded hosts. [https://jira.fd.io/browse/CSIT-591 CSIT-591].
 
##** Description: Test it by dedicated patch executing specified number of max simulations keeping them short, only few test cases.
 
##** Description: Test it by dedicated patch executing specified number of max simulations keeping them short, only few test cases.
# [OPEN] Optimize VIRL job scheduling algorithm based on available VIRL host capacity. [https://jira.fd.io/browse/CSIT-607 CSIT-607].
+
# [IN-REVIEW] Optimize VIRL job scheduling algorithm based on available VIRL host capacity. [https://jira.fd.io/browse/CSIT-607 CSIT-607].
 
## Description: Currently VIRL jobs schedule simulations round robin. At busy times this results in reaching the capcity limit of VIRL hosts, and tests fail.
 
## Description: Currently VIRL jobs schedule simulations round robin. At busy times this results in reaching the capcity limit of VIRL hosts, and tests fail.
 
## Solution:  Adjust a VIRL simulation scheduling algorithm to verify VIRL available capacity via API, and schedule simulations based on available capacity. If no capacity, then wait similarly to how it's done for performance jobs using physical testbeds.
 
## Solution:  Adjust a VIRL simulation scheduling algorithm to verify VIRL available capacity via API, and schedule simulations based on available capacity. If no capacity, then wait similarly to how it's done for performance jobs using physical testbeds.

Revision as of 12:59, 29 May 2017

VIRL infrastructure open tasks

This is the current working list of identified tasks for CSIT VIRL testbeds. It is updated periodically. All listed tasks and sub-tasks are tracked in CSIT jira:

  • High Priority Tasks grouped by Epic: VIRL-GetWellPlan.
    • CSIT-581 Address all known issues impacting CSIT VIRL testbeds stability and operation.
  • Other Priority Tasks grouped by Epic: VIRL-Optimizations.
    • CSIT-606 Address all known issues to optimize CSIT VIRL testbeds usability and operation.

High Priority Tasks

  1. [DONE] Detecting and clearing stuck VIRL simulations. CSIT-582.
    1. Description: Continue getting stuck VIRL simulations due to either LF network connectivity interruptions or failing CSIT bootstrap teardown.
    2. Solution: automate clearing old (garbage) simulations, detect non-successful simulation teardowns.
    3. Tasks:
      • [DONE] Use built-in VIRL simulation expire timer set to 2hrs.
        • coded default simulation expiry to 120min, semi-weekly to 500min, weekly to 120min. CSIT-579. gr6656.
      • [DONE] Extend CSIT bootstrap teardown stop-simulation API call to verify if it was SUCCESS/FAIL. CSIT-583.
  2. [WIP] Add VIRL server healthchecks in CSIT. CSIT-584.
    1. Description: no regular automated healthchecks executed against VIRL servers.
    2. Solution: introduce a CSIT health-check monitoring job for VIRL servers' health.
    3. Tasks:
      • [OPEN] Create a new job, executed periodically (6hrs?) for healthchecking all VIRL servers. CSIT-585.
      • [OPEN] VIRL health-check APIs: health status, VIRL API tests, simulation tests. CSIT-586.
      • [OPEN] VIRL capacity check, report number of simulations per virl server. CSIT-587.
      • [IN-REVIEW] pre-check to every start-testcase to better handle exceptions and printing errors. CSIT-579. gr6656.
  3. [OPEN] Address VIRL simulation mgmt IPv4 address depletion. CSIT-588.
    1. Description: Today there is one /24 subnet allocated for all VIRL simulations, split equally across 3 servers, 84 /32 addresses per server. Each CSIT simulation takes 4 addresses (mgmt, tg, sut1, sut2), each csit-vpp and vpp-csit verify job uses 3 simulation to parallized tests for reduced execution time. This means each server has capacity to run up to 7 verify jobs concurrently (3*4*7). Once Centos7 tests productized, where two jobs are always executed in parallel, this will reduce it down to 3 concurrent jobs. Not good. It's basically a show stopper to productize Centos7 into vpp-csit-verify per patch jobs.
    2. Solution: Need to increase IPv4 address space given to VIRL hosts. Dedicating /24 subnet per VIRL server, will give address capacity for 60 concurrent simulations. Based on previous memory calcs each VIRL host is capable of doing 30 simulations (30*3 VMs) - need to test verify this.
    3. Tasks:
  4. [DONE] Script expecting VIRL sim nodes to be active within ca. 120sec after launch request - this is too tight. CSIT-593.
    1. Description: Intermittent test job failures due to 'ERROR: Simulation started OK but devices never changed to ACTIVE state’. Number of these can be avoided by increasing the script timeout to 240sec or so.
    2. Solution: Increasing the script timeout to 240sec or so. But don’t wait 4min every time before trying, as this will add to the overall execution time.
    3. Tasks:
      • [DONE] Increase test script timeout to 240sec. CSIT-593
  5. [WIP] tb4-virl servers upgrade to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. CSIT-594
    1. Description: virl upgrade to address issues with Centos7 test instabilities related to QEMU, and to improve general virl system robustness.
    2. Solution: upgrade tb4-virl1 server to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. verify stability. follow gradually with tb4-virl2 and then tb4-virl3 upgrades.
    3.  Tasks:
      • [WIP] VIRL1 server 10.30.51.28 - currently in STAGING, resolving issues. testing ongoing. CSIT-595.
      • [OPEN] VIRL1 server 10.30.51.28 - move to PRODUCTION once determined stable. Monitor PRODUCTION performance. CSIT-596.
      • [OPEN] VIRL1 server 10.30.51.28 - complete upgrade process documentation and ansible scripts. CSIT-597.
      • [OPEN] VIRL2 server 10.30.51.29 - upgrade based on documentation and ansible scripts rom VIRL1 uprade process, verify stability. CSIT-598.
      • [OPEN] VIRL2 server 10.30.51.29 - once stable, move to PRODUCTION. CSIT-599.
      • [OPEN] VIRL3 server 10.30.51.30 - upgrade based on documentation and ansible scripts rom VIRL1 uprade process, verify stability. CSIT-600.
      • [OPEN] VIRL3 server 10.30.51.30 - once stable, move to PRODUCTION. CSIT-601.
      • [OPEN] Check the current max CSIT simulation capacity per VIRL host. Focus on upgraded hosts. CSIT-591.
        • Description: Test it by dedicated patch executing specified number of max simulations keeping them short, only few test cases.
  6. [IN-REVIEW] Optimize VIRL job scheduling algorithm based on available VIRL host capacity. CSIT-607.
    1. Description: Currently VIRL jobs schedule simulations round robin. At busy times this results in reaching the capcity limit of VIRL hosts, and tests fail.
    2. Solution: Adjust a VIRL simulation scheduling algorithm to verify VIRL available capacity via API, and schedule simulations based on available capacity. If no capacity, then wait similarly to how it's done for performance jobs using physical testbeds.
    3. Tasks: to be identified.
  7. [DONE] Need to periodically delete old files in /tmp directory. CSIT-578.
    1.  Tasks:
      • [DONE] Cron job to delete old (more then 2 weeks?) files in /tmp directory on every VIRL server. CSIT-578.
        crontab -e
        0 0 * * * * find /var/log/libvirt/qemu -type f -mtime +14 -name "instance*.log" -delete
        0 0 * * * * find /tmp -type f -atime +14 -name "*.deb" -delete
        0 0 * * * * find /tmp -type f -atime +14 -name "*.rpm" -delete
        0 0 * * * * find /nfs/scratch/ -type d -mtime +14 -name "session-*"
  8. [WIP] VIRL Centos7 tests productization into vpp-csit-verify. CSIT-602.
    1. Description: Following upgrade of tb4-virl1, Centos7 tests should be ready for productization.
    2. Solution: Proposal to run Centos7 tests periodically (daily) instead of per patch, to avoid VIRL simulations overload.
    3. Tasks:
      • [OPEN] Verify stability of csit-vpp-verify-Centos7 jobs. CSIT-603.
      • [OPEN] Create a daily vpp-csit-verify-Centos7 job. CSIT-604.

Other Priority Tasks

  1. [OPEN] CSIT-90: Nested-VM boot-up failed.
  2. [DELETED?] CSIT-210: Nested VM to include l3fwd startup script.
  3. [OPEN] CSIT-161: Update nested VM qemu library to use 3rd serial console.
  4. [OPEN] CSIT-356: Update VIRL testbed creation to allow specification of centos image.
  5. [OPEN] CSIT-605: Parameterize selection of VIRL nested VM image.
    1. Description: Currently VIRL is using only the latest nested Ubuntu or Centos VM image for all VM tests. Current inventory of VIRL nested Ubuntu VM images is tracked in https://git.fd.io/csit/tree/resources/tools/disk-image-builder/nested/CHANGELOG.
    2. Solution: Parameterize selection of VIRL nested VM image to allow tests to use specific VM image version - start with Ubuntu.
    3. Tasks: to be identified.
  6. [OPEN] CSIT-116: Modify VIRL and nested-VM username/password.
  7. [OPEN] CSIT-159: Nested VM: Replace cisco/cisco credentials with csit/csit.
  8. [OPEN] CSIT-160: Ubuntu VM: Replace cisco login with csit.
  9. [OPEN] CSIT-145: Out-of-band access to SUTs.
  10. [OPEN] CSIT-151: Do not destroy VM in case of test failure due to infrastructure issue.
  11. [OPEN] CSIT-150: Health-check to capture TG/SUT environment after failed test case.
  12. [OPEN] CSIT-202: Execute start/stop-testcase scripts from git repository.
  13. [OPEN] CSIT-115: Usage and status monitoring of VIRL hosts.
  14. [OPEN] CSIT-112: VIRL infrastructure periodic creation and distribution of images.