Difference between revisions of "CSIT/CSIT LF VIRL testbed"

From fd.io
Jump to: navigation, search
(VIRL infrastructure open tasks)
(VIRL infrastructure open tasks)
 
(19 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== VIRL testbed description ==
+
== VIRL infrastructure open tasks ==
  
''TODO''
+
This is the current working list of identified tasks for CSIT VIRL testbeds. It is updated periodically.
 +
All listed tasks and sub-tasks are tracked in CSIT jira:
  
== VIRL infrastructure open tasks ==
+
* '''High Priority Tasks''' grouped by Epic: '''VIRL-GetWellPlan'''.
 +
** [https://jira.fd.io/browse/CSIT-581 CSIT-581] Address all known issues impacting CSIT VIRL testbeds stability and operation.
 +
* '''Other Priority Tasks''' grouped by Epic: '''VIRL-Optimizations'''.
 +
** [https://jira.fd.io/browse/CSIT-606 CSIT-606] Address all known issues to optimize CSIT VIRL testbeds usability and operation.
 +
 
 +
===High Priority Tasks===
 +
 
 +
# [DONE] Detecting and clearing stuck VIRL simulations. [https://jira.fd.io/browse/CSIT-582 CSIT-582].
 +
## Description: Continue getting stuck VIRL simulations due to  either LF network connectivity interruptions or failing CSIT bootstrap teardown.
 +
## Solution: automate clearing old (garbage) simulations, detect non-successful simulation teardowns.
 +
## Tasks:
 +
##* [DONE] Use built-in VIRL simulation expire timer set to 2hrs.
 +
##** coded default simulation expiry to 120min, semi-weekly to 500min, weekly to 120min. [https://jira.fd.io/browse/CSIT-579 CSIT-579]. [https://gerrit.fd.io/r/#/c/6656/ gr6656].
 +
##* [DONE] Extend CSIT bootstrap teardown stop-simulation API call to verify if it was SUCCESS/FAIL. [https://jira.fd.io/browse/CSIT-583 CSIT-583].
 +
# [OPEN] Add VIRL server healthchecks in CSIT. [https://jira.fd.io/browse/CSIT-584 CSIT-584].
 +
## Description: no regular automated healthchecks executed against VIRL servers.
 +
## Solution: introduce a CSIT health-check monitoring job for VIRL servers' health.
 +
## Tasks:
 +
##* [OPEN] Create a new job, executed periodically (6hrs?) for healthchecking all VIRL servers. [https://jira.fd.io/browse/CSIT-585 CSIT-585].
 +
##* [OPEN] VIRL health-check APIs: health status, VIRL API tests, simulation tests. [https://jira.fd.io/browse/CSIT-586 CSIT-586].
 +
##* [OPEN] VIRL capacity check, report number of simulations per virl server. [https://jira.fd.io/browse/CSIT-587 CSIT-587].
 +
##* [DONE] pre-check to every start-testcase to better handle exceptions and printing errors. [https://jira.fd.io/browse/CSIT-579 CSIT-579]. [https://gerrit.fd.io/r/#/c/6656/ gr6656].
 +
# [WIP] Address VIRL simulation mgmt IPv4 address depletion. [https://jira.fd.io/browse/CSIT-588 CSIT-588].
 +
## Description: Today there is one /24 subnet allocated for all VIRL simulations, split equally across 3 servers, 84 /32 addresses per server. Each CSIT simulation takes 4 addresses (mgmt, tg, sut1, sut2), each csit-vpp and vpp-csit verify job uses 3 simulation to parallized tests for reduced execution time. This means each server has capacity to run up to 7 verify jobs concurrently (3*4*7). Once  Centos7 tests productized, where two jobs are always executed in parallel, this will reduce it down to 3 concurrent jobs. Not good. It's basically a show stopper to productize Centos7 into vpp-csit-verify per patch jobs.
 +
## Solution: Need to increase IPv4 address space given to VIRL hosts. Dedicating /24 subnet per VIRL server, will give address capacity for 60 concurrent simulations. Based on previous memory calcs each VIRL host is capable of doing 30 simulations (30*3 VMs) - need to test verify this.
 +
## Tasks:
 +
##* [DONE] Request LF FD.io infra team to allocate at least another /24 for VIRL simulations. [https://jira.fd.io/browse/CSIT-589 CSIT-589].
 +
##** [DONE] Request acknowledged by LF as part of [FD.io Helpdesk #40733]. https://lists.fd.io/pipermail/csit-dev/2017-May/001911.html.
 +
##* [OPEN] Add the new mgmt IPv4 address range to virl hosts. [https://jira.fd.io/browse/CSIT-590 CSIT-590].
 +
# [DONE] Script expecting VIRL sim nodes to be active within ca. 120sec after launch request - this is too tight. [https://jira.fd.io/browse/CSIT-593 CSIT-593].
 +
## Description: Intermittent test job failures due to 'ERROR: Simulation started OK but devices never changed to ACTIVE state’. Number of these can be avoided by increasing the script timeout to 240sec or so.
 +
## Solution: Increasing the script timeout to 240sec or so. But don’t wait 4min every time before trying, as this will add to the overall execution time.
 +
## Tasks:
 +
##* [DONE] Increase test script timeout to 240sec. [https://jira.fd.io/browse/CSIT-593 CSIT-593]
 +
# [OPEN] tb4-virl servers upgrade to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. [https://jira.fd.io/browse/CSIT-594 CSIT-594]
 +
## Description: virl upgrade to address issues with Centos7 test instabilities related to QEMU, and to improve general virl system robustness.
 +
## Solution: upgrade tb4-virl1 server to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. verify stability. follow gradually with tb4-virl2 and then tb4-virl3 upgrades.
 +
## Tasks:
 +
##* [OPEN] VIRL1 server 10.30.51.28 - currently in STAGING, resolving issues. testing ongoing. [https://jira.fd.io/browse/CSIT-595 CSIT-595].
 +
##* [OPEN] VIRL1 server 10.30.51.28 - move to PRODUCTION once determined stable. Monitor PRODUCTION performance. [https://jira.fd.io/browse/CSIT-596 CSIT-596].
 +
##* [OPEN] VIRL1 server 10.30.51.28 - complete upgrade process documentation and ansible scripts. [https://jira.fd.io/browse/CSIT-597 CSIT-597].
 +
##* [OPEN] VIRL2 server 10.30.51.29 - upgrade based on documentation and ansible scripts rom VIRL1 uprade process, verify stability. [https://jira.fd.io/browse/CSIT-598 CSIT-598].
 +
##* [OPEN] VIRL2 server 10.30.51.29 - once stable, move to PRODUCTION. [https://jira.fd.io/browse/CSIT-599 CSIT-599].
 +
##* [OPEN] VIRL3 server 10.30.51.30 - upgrade based on documentation and ansible scripts rom VIRL1 uprade process, verify stability. [https://jira.fd.io/browse/CSIT-600 CSIT-600].
 +
##* [OPEN] VIRL3 server 10.30.51.30 - once stable, move to PRODUCTION. [https://jira.fd.io/browse/CSIT-601 CSIT-601].
 +
##* [DONE] Check the current max CSIT simulation capacity per VIRL host. Focus on upgraded hosts. [https://jira.fd.io/browse/CSIT-591 CSIT-591].
 +
##** Description: Test it by dedicated patch executing specified number of max simulations keeping them short, only few test cases.
 +
# [DONE] Optimize VIRL job scheduling algorithm based on available VIRL host capacity. [https://jira.fd.io/browse/CSIT-607 CSIT-607].
 +
## Description: Currently VIRL jobs schedule simulations round robin. At busy times this results in reaching the capcity limit of VIRL hosts, and tests fail.
 +
## Solution:  Adjust a VIRL simulation scheduling algorithm to verify VIRL available capacity via API, and schedule simulations based on available capacity. If no capacity, then wait similarly to how it's done for performance jobs using physical testbeds.
 +
## Tasks: to be identified.
 +
# [DONE] Need to periodically delete old files in /tmp directory. [https://jira.fd.io/browse/CSIT-578 CSIT-578].
 +
## Tasks:
 +
##* [DONE]  Cron job to delete old (more then 2 weeks?) files in /tmp directory on every VIRL server. [https://jira.fd.io/browse/CSIT-578 CSIT-578].
 +
##*: crontab -e
 +
##*: 0 0 * * * * find /var/log/libvirt/qemu -type f -mtime +14 -name "instance*.log" -delete
 +
##*: 0 0 * * * * find /tmp -type f -atime +14 -name "*.deb" -delete
 +
##*: 0 0 * * * * find /tmp -type f -atime +14 -name "*.rpm" -delete
 +
##*: 0 0 * * * * find /nfs/scratch/ -type d -mtime +14 -name "session-*"
 +
# [WIP] VIRL Centos7 tests productization into vpp-csit-verify. [https://jira.fd.io/browse/CSIT-602 CSIT-602].
 +
## Description: Following upgrade of tb4-virl1, Centos7 tests should be ready for productization.
 +
## Solution: Proposal to run Centos7 tests periodically (daily) instead of per patch, to avoid VIRL simulations overload.
 +
## Tasks:
 +
##* [WIP] Verify stability of csit-vpp-verify-Centos7 jobs. [https://jira.fd.io/browse/CSIT-603 CSIT-603].
 +
##* [IN-REVIEW] Create a daily vpp-csit-verify-Centos7 job. [https://jira.fd.io/browse/CSIT-604 CSIT-604].
 +
 
 +
===Other Priority Tasks===
  
# The most important tasks
+
# [OPEN] [https://jira.fd.io/browse/CSIT-90 CSIT-90]: Nested-VM boot-up failed.
#* VIRL server 10.30.51.28 - currently in TESTING status because of some issue (VIRL licence, keystone), we need to move it to PRODUCTION status
+
# [DELETED?] [https://jira.fd.io/browse/CSIT-210 CSIT-210]: Nested VM to include l3fwd startup script.
#* Create cron job to kill old ((more then 24h?) sessions on every VIRL server
+
# [OPEN] [https://jira.fd.io/browse/CSIT-161 CSIT-161]: Update nested VM qemu library to use 3rd serial console.
#* Cron job to delete old (more then 2 weeks?) files in /tmp directory on every VIRL server
+
# [OPEN] [https://jira.fd.io/browse/CSIT-356 CSIT-356]: Update VIRL testbed creation to allow specification of centos image.
#*: crontab -e
+
# [OPEN] [https://jira.fd.io/browse/CSIT-605 CSIT-605]: Parameterize selection of VIRL nested VM image.
#*: 0 0 * * * * find /var/log/libvirt/qemu -type f -mtime +14 -name "instance*.log" -delete
+
## Description: Currently VIRL is using only the latest nested Ubuntu or Centos VM image for all VM tests. Current inventory of VIRL nested Ubuntu VM images is tracked in https://git.fd.io/csit/tree/resources/tools/disk-image-builder/nested/CHANGELOG.
#*: 0 0 * * * * find /var/log/libvirt/qemu -type f -mtime +14 -name "instance*.log" -delete
+
## Solution: Parameterize selection of VIRL nested VM image to allow tests to use specific VM image version - start with Ubuntu.
#*: 0 0 * * * * find /tmp -type f -atime +14 -name "*.deb" -delete
+
## Tasks: to be identified.
#*: 0 0 * * * * find /tmp -type f -atime +14 -name "*.rpm" -delete
+
# [OPEN] [https://jira.fd.io/browse/CSIT-116 CSIT-116]: Modify VIRL and nested-VM username/password.
#* Currently the latest nested VM image is used for all Ubuntu/Centos images - we need solution to be able to link different nested VM images to different ubuntu/centos images
+
# [OPEN] [https://jira.fd.io/browse/CSIT-159 CSIT-159]: Nested VM: Replace cisco/cisco credentials with csit/csit.
#* Centos7 tests instability (under investigation by Tom Herbert)
+
# [OPEN] [https://jira.fd.io/browse/CSIT-160 CSIT-160]: Ubuntu VM: Replace cisco login with csit.
# Other tasks
+
# [OPEN] [https://jira.fd.io/browse/CSIT-145 CSIT-145]: Out-of-band access to SUTs.
#* CSIT-116 [https://jira.fd.io/browse/CSIT-116]: Modify VIRL and nested-VM username/password
+
# [OPEN] [https://jira.fd.io/browse/CSIT-151 CSIT-151]: Do not destroy VM in case of test failure due to infrastructure issue.
#* CSIT-159 [https://jira.fd.io/browse/CSIT-159]: Nested VM: Replace cisco/cisco credentials with csit/csit
+
# [OPEN] [https://jira.fd.io/browse/CSIT-150 CSIT-150]: Health-check to capture TG/SUT environment after failed test case.
#* CSIT-160 [https://jira.fd.io/browse/CSIT-160]: Ubuntu VM: Replace cisco login with csit
+
# [OPEN] [https://jira.fd.io/browse/CSIT-202 CSIT-202]: Execute start/stop-testcase scripts from git repository.
#* CSIT-145 [https://jira.fd.io/browse/CSIT-145]: Out-of-band access to SUTs
+
# [OPEN] [https://jira.fd.io/browse/CSIT-115 CSIT-115]: Usage and status monitoring of VIRL hosts.
#* CSIT-151 [https://jira.fd.io/browse/CSIT-151]: Do not destroy VM in case of test failure due to infrastructure issue  
+
# [OPEN] [https://jira.fd.io/browse/CSIT-112 CSIT-112]: VIRL infrastructure periodic creation and distribution of images.
#* CSIT-150 [https://jira.fd.io/browse/CSIT-150]: Health-check to capture TG/SUT environment after failed test case
+
#* CSIT-202 [https://jira.fd.io/browse/CSIT-202]: Execute start/stop-testcase scripts from git repository
+
#* CSIT-115 [https://jira.fd.io/browse/CSIT-115]: Usage and status monitoring of VIRL hosts
+
#* CSIT-112 [https://jira.fd.io/browse/CSIT-112]: VIRL infrastructure periodic creation and distribution of images
+
#* CSIT-90 [https://jira.fd.io/browse/CSIT-90]: Nested-VM boot-up failed
+
#* CSIT-210 [https://jira.fd.io/browse/CSIT-210]: Nested VM to include l3fwd startup script
+
#* CSIT-161 [https://jira.fd.io/browse/CSIT-161]: Update nested VM qemu library to use 3rd serial console
+
#* CSIT-356 [https://jira.fd.io/browse/CSIT-356]: Update VIRL testbed creation to allow specification of centos image
+

Latest revision as of 14:09, 28 June 2017

VIRL infrastructure open tasks

This is the current working list of identified tasks for CSIT VIRL testbeds. It is updated periodically. All listed tasks and sub-tasks are tracked in CSIT jira:

  • High Priority Tasks grouped by Epic: VIRL-GetWellPlan.
    • CSIT-581 Address all known issues impacting CSIT VIRL testbeds stability and operation.
  • Other Priority Tasks grouped by Epic: VIRL-Optimizations.
    • CSIT-606 Address all known issues to optimize CSIT VIRL testbeds usability and operation.

High Priority Tasks

  1. [DONE] Detecting and clearing stuck VIRL simulations. CSIT-582.
    1. Description: Continue getting stuck VIRL simulations due to either LF network connectivity interruptions or failing CSIT bootstrap teardown.
    2. Solution: automate clearing old (garbage) simulations, detect non-successful simulation teardowns.
    3. Tasks:
      • [DONE] Use built-in VIRL simulation expire timer set to 2hrs.
        • coded default simulation expiry to 120min, semi-weekly to 500min, weekly to 120min. CSIT-579. gr6656.
      • [DONE] Extend CSIT bootstrap teardown stop-simulation API call to verify if it was SUCCESS/FAIL. CSIT-583.
  2. [OPEN] Add VIRL server healthchecks in CSIT. CSIT-584.
    1. Description: no regular automated healthchecks executed against VIRL servers.
    2. Solution: introduce a CSIT health-check monitoring job for VIRL servers' health.
    3. Tasks:
      • [OPEN] Create a new job, executed periodically (6hrs?) for healthchecking all VIRL servers. CSIT-585.
      • [OPEN] VIRL health-check APIs: health status, VIRL API tests, simulation tests. CSIT-586.
      • [OPEN] VIRL capacity check, report number of simulations per virl server. CSIT-587.
      • [DONE] pre-check to every start-testcase to better handle exceptions and printing errors. CSIT-579. gr6656.
  3. [WIP] Address VIRL simulation mgmt IPv4 address depletion. CSIT-588.
    1. Description: Today there is one /24 subnet allocated for all VIRL simulations, split equally across 3 servers, 84 /32 addresses per server. Each CSIT simulation takes 4 addresses (mgmt, tg, sut1, sut2), each csit-vpp and vpp-csit verify job uses 3 simulation to parallized tests for reduced execution time. This means each server has capacity to run up to 7 verify jobs concurrently (3*4*7). Once Centos7 tests productized, where two jobs are always executed in parallel, this will reduce it down to 3 concurrent jobs. Not good. It's basically a show stopper to productize Centos7 into vpp-csit-verify per patch jobs.
    2. Solution: Need to increase IPv4 address space given to VIRL hosts. Dedicating /24 subnet per VIRL server, will give address capacity for 60 concurrent simulations. Based on previous memory calcs each VIRL host is capable of doing 30 simulations (30*3 VMs) - need to test verify this.
    3. Tasks:
  4. [DONE] Script expecting VIRL sim nodes to be active within ca. 120sec after launch request - this is too tight. CSIT-593.
    1. Description: Intermittent test job failures due to 'ERROR: Simulation started OK but devices never changed to ACTIVE state’. Number of these can be avoided by increasing the script timeout to 240sec or so.
    2. Solution: Increasing the script timeout to 240sec or so. But don’t wait 4min every time before trying, as this will add to the overall execution time.
    3. Tasks:
      • [DONE] Increase test script timeout to 240sec. CSIT-593
  5. [OPEN] tb4-virl servers upgrade to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. CSIT-594
    1. Description: virl upgrade to address issues with Centos7 test instabilities related to QEMU, and to improve general virl system robustness.
    2. Solution: upgrade tb4-virl1 server to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. verify stability. follow gradually with tb4-virl2 and then tb4-virl3 upgrades.
    3.  Tasks:
      • [OPEN] VIRL1 server 10.30.51.28 - currently in STAGING, resolving issues. testing ongoing. CSIT-595.
      • [OPEN] VIRL1 server 10.30.51.28 - move to PRODUCTION once determined stable. Monitor PRODUCTION performance. CSIT-596.
      • [OPEN] VIRL1 server 10.30.51.28 - complete upgrade process documentation and ansible scripts. CSIT-597.
      • [OPEN] VIRL2 server 10.30.51.29 - upgrade based on documentation and ansible scripts rom VIRL1 uprade process, verify stability. CSIT-598.
      • [OPEN] VIRL2 server 10.30.51.29 - once stable, move to PRODUCTION. CSIT-599.
      • [OPEN] VIRL3 server 10.30.51.30 - upgrade based on documentation and ansible scripts rom VIRL1 uprade process, verify stability. CSIT-600.
      • [OPEN] VIRL3 server 10.30.51.30 - once stable, move to PRODUCTION. CSIT-601.
      • [DONE] Check the current max CSIT simulation capacity per VIRL host. Focus on upgraded hosts. CSIT-591.
        • Description: Test it by dedicated patch executing specified number of max simulations keeping them short, only few test cases.
  6. [DONE] Optimize VIRL job scheduling algorithm based on available VIRL host capacity. CSIT-607.
    1. Description: Currently VIRL jobs schedule simulations round robin. At busy times this results in reaching the capcity limit of VIRL hosts, and tests fail.
    2. Solution: Adjust a VIRL simulation scheduling algorithm to verify VIRL available capacity via API, and schedule simulations based on available capacity. If no capacity, then wait similarly to how it's done for performance jobs using physical testbeds.
    3. Tasks: to be identified.
  7. [DONE] Need to periodically delete old files in /tmp directory. CSIT-578.
    1.  Tasks:
      • [DONE] Cron job to delete old (more then 2 weeks?) files in /tmp directory on every VIRL server. CSIT-578.
        crontab -e
        0 0 * * * * find /var/log/libvirt/qemu -type f -mtime +14 -name "instance*.log" -delete
        0 0 * * * * find /tmp -type f -atime +14 -name "*.deb" -delete
        0 0 * * * * find /tmp -type f -atime +14 -name "*.rpm" -delete
        0 0 * * * * find /nfs/scratch/ -type d -mtime +14 -name "session-*"
  8. [WIP] VIRL Centos7 tests productization into vpp-csit-verify. CSIT-602.
    1. Description: Following upgrade of tb4-virl1, Centos7 tests should be ready for productization.
    2. Solution: Proposal to run Centos7 tests periodically (daily) instead of per patch, to avoid VIRL simulations overload.
    3. Tasks:
      • [WIP] Verify stability of csit-vpp-verify-Centos7 jobs. CSIT-603.
      • [IN-REVIEW] Create a daily vpp-csit-verify-Centos7 job. CSIT-604.

Other Priority Tasks

  1. [OPEN] CSIT-90: Nested-VM boot-up failed.
  2. [DELETED?] CSIT-210: Nested VM to include l3fwd startup script.
  3. [OPEN] CSIT-161: Update nested VM qemu library to use 3rd serial console.
  4. [OPEN] CSIT-356: Update VIRL testbed creation to allow specification of centos image.
  5. [OPEN] CSIT-605: Parameterize selection of VIRL nested VM image.
    1. Description: Currently VIRL is using only the latest nested Ubuntu or Centos VM image for all VM tests. Current inventory of VIRL nested Ubuntu VM images is tracked in https://git.fd.io/csit/tree/resources/tools/disk-image-builder/nested/CHANGELOG.
    2. Solution: Parameterize selection of VIRL nested VM image to allow tests to use specific VM image version - start with Ubuntu.
    3. Tasks: to be identified.
  6. [OPEN] CSIT-116: Modify VIRL and nested-VM username/password.
  7. [OPEN] CSIT-159: Nested VM: Replace cisco/cisco credentials with csit/csit.
  8. [OPEN] CSIT-160: Ubuntu VM: Replace cisco login with csit.
  9. [OPEN] CSIT-145: Out-of-band access to SUTs.
  10. [OPEN] CSIT-151: Do not destroy VM in case of test failure due to infrastructure issue.
  11. [OPEN] CSIT-150: Health-check to capture TG/SUT environment after failed test case.
  12. [OPEN] CSIT-202: Execute start/stop-testcase scripts from git repository.
  13. [OPEN] CSIT-115: Usage and status monitoring of VIRL hosts.
  14. [OPEN] CSIT-112: VIRL infrastructure periodic creation and distribution of images.