Difference between revisions of "CSIT/CSIT LF VIRL testbed"

From fd.io
Jump to: navigation, search
Line 1: Line 1:
 
== VIRL infrastructure open tasks ==
 
== VIRL infrastructure open tasks ==
  
This is the current working list of identified tasks for CSIT VIRL testbeds. It is updated periodically
+
This is the current working list of identified tasks for CSIT VIRL testbeds. It is updated periodically.
  
# The most important tasks
+
===High Priority Tasks===
## [IN REVIEW] Address issue with clearing old (garbage) simulations on every VIRL server. Use built-in VIRL simulation expire timer instead of cron kill job.
+
 
## [WIP] VIRL1 server upgrade to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka.
+
# [WIP] Detecting and clearing stuck VIRL simulations
## [IN REVIEW] - Cron job to delete old (more then 2 weeks?) files in /tmp directory on every VIRL server
+
## Description: Continue getting stuck VIRL simulations due to  either LF network connectivity interruptions or failing CSIT bootstrap teardown.
##: crontab -e
+
## Solution: automate clearing old (garbage) simulations, detect non-successful simulation teardowns.
##: 0 0 * * * * find /var/log/libvirt/qemu -type f -mtime +14 -name "instance*.log" -delete
+
## Tasks:
##: 0 0 * * * * find /tmp -type f -atime +14 -name "*.deb" -delete
+
##* [IN-REVIEW] Use built-in VIRL simulation expire timer set to 2hrs.
##: 0 0 * * * * find /tmp -type f -atime +14 -name "*.rpm" -delete
+
##** coded default simulation expiry to 120min, semi-weekly to 500min, weekly to 120min. https://gerrit.fd.io/r/#/c/6656/.
##: 0 0 * * * * find /nfs/scratch/ -type d -mtime +14 -name "session-*"
+
##* [OPEN] Extend CSIT bootstrap teardown stop-simulation API call to verify if it was SUCCESS/FAIL.
## [OPEN] - Currently the latest nested VM image is used for all Ubuntu/Centos images - we need solution to be able to link different nested VM images to different ubuntu/centos images
+
# [WIP] Lack of VIRL server healthcheck in CSIT
## [WIP] Centos7 tests instability - root cause found and fixed on upgraded VIRL1 server. Centos7 productization subject to completing VIRL1 upgrade, followed by remaining servers VIRL2, VIRL3.
+
## Description: no regular automated healthchecks executed against VIRL servers.
# Other tasks
+
## Solution: introduce a CSIT health-check monitoring job for VIRL servers' health.
## CSIT-116 [https://jira.fd.io/browse/CSIT-116]: Modify VIRL and nested-VM username/password
+
## Tasks:
## CSIT-159 [https://jira.fd.io/browse/CSIT-159]: Nested VM: Replace cisco/cisco credentials with csit/csit
+
##* [OPEN] Create a new job, executed periodically (6hrs?) for healthchecking all VIRL servers.
## CSIT-160 [https://jira.fd.io/browse/CSIT-160]: Ubuntu VM: Replace cisco login with csit
+
##* [OPEN] VIRL health-check APIs: health status, VIRL API tests, simulation tests.
## CSIT-145 [https://jira.fd.io/browse/CSIT-145]: Out-of-band access to SUTs
+
##* [OPEN] VIRL check for old (garbage simulations).
## CSIT-151 [https://jira.fd.io/browse/CSIT-151]: Do not destroy VM in case of test failure due to infrastructure issue
+
##* [IN-REVIEW] pre-check to every start-testcase to better handle exceptions and printing errors. https://gerrit.fd.io/r/#/c/6656/.
## CSIT-150 [https://jira.fd.io/browse/CSIT-150]: Health-check to capture TG/SUT environment after failed test case
+
# [OPEN] VIRL simulation IPv4 address depletion
## CSIT-202 [https://jira.fd.io/browse/CSIT-202]: Execute start/stop-testcase scripts from git repository
+
## Description: Today there is one /24 subnet allocated for all VIRL simulations, split equally across 3 servers, 84 /32 addresses per server. Each CSIT simulation takes 4 addresses (mgmt, tg, sut1, sut2), each csit-vpp and vpp-csit verify job uses 3 simulation to parallized tests for reduced execution time. This means each server has capacity to run up to 7 verify jobs concurrently (3*4*7). Once  Centos7 tests productized, where two jobs are always executed in parallel, this will reduce it down to 3 concurrent jobs. Not good. It's basically a show stopper to productize Centos7 into vpp-csit-verify per patch jobs.
## CSIT-115 [https://jira.fd.io/browse/CSIT-115]: Usage and status monitoring of VIRL hosts
+
## Solution options: Need to increase IPv4 address space given to VIRL hosts. Dedicating /24 subnet per VIRL server, will give address capacity for 60 concurrent simulations. Based on previous memory calcs each VIRL host is capable of doing 30 simulations (30*3 VMs) - need to check this.
## CSIT-112 [https://jira.fd.io/browse/CSIT-112]: VIRL infrastructure periodic creation and distribution of images
+
## Tasks:
## CSIT-90 [https://jira.fd.io/browse/CSIT-90]: Nested-VM boot-up failed
+
##* [OPEN] Check the current max CSIT simulation capacity per VIRL host. Test it.
## CSIT-210 [https://jira.fd.io/browse/CSIT-210]: Nested VM to include l3fwd startup script
+
##* [WIP] Request LF FD.io infra team to allocate at least another /24 for VIRL simulations.
## CSIT-161 [https://jira.fd.io/browse/CSIT-161]: Update nested VM qemu library to use 3rd serial console
+
##** [WIP] Request acknowledged by LF as part of [FD.io Helpdesk #40733]. https://lists.fd.io/pipermail/csit-dev/2017-May/001911.html.
## CSIT-356 [https://jira.fd.io/browse/CSIT-356]: Update VIRL testbed creation to allow specification of centos image
+
# [OPEN] Script expecting VIRL sim nodes to be active within ca. 120sec after launch request - this is too tight
# Tracking only
+
## Description: Intermittent test job failures due to 'ERROR: Simulation started OK but devices never changed to ACTIVE state’. Number of these can be avoided by increasing the script timeout to 240sec or so.
## [WIP] VIRL server 10.30.51.28 - currently in TESTING status because of some issue (VIRL licence, keystone), we need to move it to PRODUCTION status. This should not interfere with operation. Will move into PRODUCTION once VIRL-core ver. 10.32.8 +Mitaka productized.
+
## Solution: Increasing the script timeout to 240sec or so. But don’t wait 4min every time before trying, as this will add to the overall execution time.
 +
## Tasks:
 +
##* [OPEN] Increase test script timeout to 240sec(?).
 +
# [WIP] tb4-virl1 server upgrade to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka.
 +
## Description: virl upgrade to address issues with Centos7 test instabilities related to QEMU, and to improve general virl system robustness.
 +
## Solution: upgrade tb4-virl1 server to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. verify stability. follow gradually with tb4-virl1 and then tb4-virl2 upgrades.
 +
## Tasks:
 +
##* [WIP] VIRL1 server 10.30.51.28 - currently in STAGING, resolving issues.
 +
##* [OPEN] VIRL1 server 10.30.51.28 - move to PRODUCTION once determined stable. Monitor PRODUCTION performance.
 +
##* [OPEN] VIRL2 server 10.30.51.29 - upgrade, verify stability, move to PRODUCTION.
 +
##* [OPEN] VIRL2 server 10.30.51.30 - upgrade, verify stability, move to PRODUCTION.
 +
# [OPEN] Need to periodically delete old files in /tmp directory
 +
## Tasks:
 +
##* Cron job to delete old (more then 2 weeks?) files in /tmp directory on every VIRL server
 +
##*: crontab -e
 +
##*: 0 0 * * * * find /var/log/libvirt/qemu -type f -mtime +14 -name "instance*.log" -delete
 +
##*: 0 0 * * * * find /tmp -type f -atime +14 -name "*.deb" -delete
 +
##*: 0 0 * * * * find /tmp -type f -atime +14 -name "*.rpm" -delete
 +
##*: 0 0 * * * * find /nfs/scratch/ -type d -mtime +14 -name "session-*"
 +
# [OPEN] - Currently the latest nested VM image is used for all Ubuntu/Centos images
 +
## Description: need solution to be able to link different nested VM images to different ubuntu/centos images
 +
# [WIP] Centos7 tests productization
 +
## Description: Following upgrade of tb4-virl1, Centos7 tests should be ready for productization.
 +
## Solution: Proposal to run Centos7 tests periodically (daily) instead of per patch, to avoid VIRL simulations overload.
 +
## Tasks:
 +
##* [OPEN] Create a daily vpp-csit-verify-Centos7 job.
 +
 
 +
===Other Tasks===
 +
 
 +
# CSIT-116 [https://jira.fd.io/browse/CSIT-116]: Modify VIRL and nested-VM username/password
 +
# CSIT-159 [https://jira.fd.io/browse/CSIT-159]: Nested VM: Replace cisco/cisco credentials with csit/csit
 +
# CSIT-160 [https://jira.fd.io/browse/CSIT-160]: Ubuntu VM: Replace cisco login with csit
 +
# CSIT-145 [https://jira.fd.io/browse/CSIT-145]: Out-of-band access to SUTs
 +
# CSIT-151 [https://jira.fd.io/browse/CSIT-151]: Do not destroy VM in case of test failure due to infrastructure issue
 +
# CSIT-150 [https://jira.fd.io/browse/CSIT-150]: Health-check to capture TG/SUT environment after failed test case
 +
# CSIT-202 [https://jira.fd.io/browse/CSIT-202]: Execute start/stop-testcase scripts from git repository
 +
# CSIT-115 [https://jira.fd.io/browse/CSIT-115]: Usage and status monitoring of VIRL hosts
 +
# CSIT-112 [https://jira.fd.io/browse/CSIT-112]: VIRL infrastructure periodic creation and distribution of images
 +
# CSIT-90 [https://jira.fd.io/browse/CSIT-90]: Nested-VM boot-up failed
 +
# CSIT-210 [https://jira.fd.io/browse/CSIT-210]: Nested VM to include l3fwd startup script
 +
# CSIT-161 [https://jira.fd.io/browse/CSIT-161]: Update nested VM qemu library to use 3rd serial console
 +
# CSIT-356 [https://jira.fd.io/browse/CSIT-356]: Update VIRL testbed creation to allow specification of centos image

Revision as of 10:35, 16 May 2017

VIRL infrastructure open tasks

This is the current working list of identified tasks for CSIT VIRL testbeds. It is updated periodically.

High Priority Tasks

  1. [WIP] Detecting and clearing stuck VIRL simulations
    1. Description: Continue getting stuck VIRL simulations due to either LF network connectivity interruptions or failing CSIT bootstrap teardown.
    2. Solution: automate clearing old (garbage) simulations, detect non-successful simulation teardowns.
    3. Tasks:
      • [IN-REVIEW] Use built-in VIRL simulation expire timer set to 2hrs.
      • [OPEN] Extend CSIT bootstrap teardown stop-simulation API call to verify if it was SUCCESS/FAIL.
  2. [WIP] Lack of VIRL server healthcheck in CSIT
    1. Description: no regular automated healthchecks executed against VIRL servers.
    2. Solution: introduce a CSIT health-check monitoring job for VIRL servers' health.
    3. Tasks:
      • [OPEN] Create a new job, executed periodically (6hrs?) for healthchecking all VIRL servers.
      • [OPEN] VIRL health-check APIs: health status, VIRL API tests, simulation tests.
      • [OPEN] VIRL check for old (garbage simulations).
      • [IN-REVIEW] pre-check to every start-testcase to better handle exceptions and printing errors. https://gerrit.fd.io/r/#/c/6656/.
  3. [OPEN] VIRL simulation IPv4 address depletion
    1. Description: Today there is one /24 subnet allocated for all VIRL simulations, split equally across 3 servers, 84 /32 addresses per server. Each CSIT simulation takes 4 addresses (mgmt, tg, sut1, sut2), each csit-vpp and vpp-csit verify job uses 3 simulation to parallized tests for reduced execution time. This means each server has capacity to run up to 7 verify jobs concurrently (3*4*7). Once Centos7 tests productized, where two jobs are always executed in parallel, this will reduce it down to 3 concurrent jobs. Not good. It's basically a show stopper to productize Centos7 into vpp-csit-verify per patch jobs.
    2. Solution options: Need to increase IPv4 address space given to VIRL hosts. Dedicating /24 subnet per VIRL server, will give address capacity for 60 concurrent simulations. Based on previous memory calcs each VIRL host is capable of doing 30 simulations (30*3 VMs) - need to check this.
    3. Tasks:
      • [OPEN] Check the current max CSIT simulation capacity per VIRL host. Test it.
      • [WIP] Request LF FD.io infra team to allocate at least another /24 for VIRL simulations.
  4. [OPEN] Script expecting VIRL sim nodes to be active within ca. 120sec after launch request - this is too tight
    1. Description: Intermittent test job failures due to 'ERROR: Simulation started OK but devices never changed to ACTIVE state’. Number of these can be avoided by increasing the script timeout to 240sec or so.
    2. Solution: Increasing the script timeout to 240sec or so. But don’t wait 4min every time before trying, as this will add to the overall execution time.
    3. Tasks:
      • [OPEN] Increase test script timeout to 240sec(?).
  5. [WIP] tb4-virl1 server upgrade to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka.
    1. Description: virl upgrade to address issues with Centos7 test instabilities related to QEMU, and to improve general virl system robustness.
    2. Solution: upgrade tb4-virl1 server to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. verify stability. follow gradually with tb4-virl1 and then tb4-virl2 upgrades.
    3.  Tasks:
      • [WIP] VIRL1 server 10.30.51.28 - currently in STAGING, resolving issues.
      • [OPEN] VIRL1 server 10.30.51.28 - move to PRODUCTION once determined stable. Monitor PRODUCTION performance.
      • [OPEN] VIRL2 server 10.30.51.29 - upgrade, verify stability, move to PRODUCTION.
      • [OPEN] VIRL2 server 10.30.51.30 - upgrade, verify stability, move to PRODUCTION.
  6. [OPEN] Need to periodically delete old files in /tmp directory
    1.  Tasks:
      • Cron job to delete old (more then 2 weeks?) files in /tmp directory on every VIRL server
        crontab -e
        0 0 * * * * find /var/log/libvirt/qemu -type f -mtime +14 -name "instance*.log" -delete
        0 0 * * * * find /tmp -type f -atime +14 -name "*.deb" -delete
        0 0 * * * * find /tmp -type f -atime +14 -name "*.rpm" -delete
        0 0 * * * * find /nfs/scratch/ -type d -mtime +14 -name "session-*"
  7. [OPEN] - Currently the latest nested VM image is used for all Ubuntu/Centos images
    1. Description: need solution to be able to link different nested VM images to different ubuntu/centos images
  8. [WIP] Centos7 tests productization
    1. Description: Following upgrade of tb4-virl1, Centos7 tests should be ready for productization.
    2. Solution: Proposal to run Centos7 tests periodically (daily) instead of per patch, to avoid VIRL simulations overload.
    3. Tasks:
      • [OPEN] Create a daily vpp-csit-verify-Centos7 job.

Other Tasks

  1. CSIT-116 [1]: Modify VIRL and nested-VM username/password
  2. CSIT-159 [2]: Nested VM: Replace cisco/cisco credentials with csit/csit
  3. CSIT-160 [3]: Ubuntu VM: Replace cisco login with csit
  4. CSIT-145 [4]: Out-of-band access to SUTs
  5. CSIT-151 [5]: Do not destroy VM in case of test failure due to infrastructure issue
  6. CSIT-150 [6]: Health-check to capture TG/SUT environment after failed test case
  7. CSIT-202 [7]: Execute start/stop-testcase scripts from git repository
  8. CSIT-115 [8]: Usage and status monitoring of VIRL hosts
  9. CSIT-112 [9]: VIRL infrastructure periodic creation and distribution of images
  10. CSIT-90 [10]: Nested-VM boot-up failed
  11. CSIT-210 [11]: Nested VM to include l3fwd startup script
  12. CSIT-161 [12]: Update nested VM qemu library to use 3rd serial console
  13. CSIT-356 [13]: Update VIRL testbed creation to allow specification of centos image