Difference between revisions of "CSIT/CSIT LF VIRL testbed"
From fd.io
< CSIT
Mackonstan (Talk | contribs) |
Mackonstan (Talk | contribs) |
||
Line 5: | Line 5: | ||
===High Priority Tasks=== | ===High Priority Tasks=== | ||
− | # [WIP] Detecting and clearing stuck VIRL simulations | + | # [WIP] Detecting and clearing stuck VIRL simulations. [https://jira.fd.io/browse/CSIT-582 CSIT-582]. |
## Description: Continue getting stuck VIRL simulations due to either LF network connectivity interruptions or failing CSIT bootstrap teardown. | ## Description: Continue getting stuck VIRL simulations due to either LF network connectivity interruptions or failing CSIT bootstrap teardown. | ||
## Solution: automate clearing old (garbage) simulations, detect non-successful simulation teardowns. | ## Solution: automate clearing old (garbage) simulations, detect non-successful simulation teardowns. | ||
## Tasks: | ## Tasks: | ||
##* [IN-REVIEW] Use built-in VIRL simulation expire timer set to 2hrs. | ##* [IN-REVIEW] Use built-in VIRL simulation expire timer set to 2hrs. | ||
− | ##** coded default simulation expiry to 120min, semi-weekly to 500min, weekly to 120min. https://gerrit.fd.io/r/#/c/6656/. | + | ##** coded default simulation expiry to 120min, semi-weekly to 500min, weekly to 120min. [https://jira.fd.io/browse/CSIT-579 CSIT-579]. [https://gerrit.fd.io/r/#/c/6656/ gr6656]. |
− | ##* [OPEN] Extend CSIT bootstrap teardown stop-simulation API call to verify if it was SUCCESS/FAIL. | + | ##* [OPEN] Extend CSIT bootstrap teardown stop-simulation API call to verify if it was SUCCESS/FAIL. [https://jira.fd.io/browse/CSIT-583 CSIT-583]. |
− | # [WIP] | + | # [WIP] Add VIRL server healthchecks in CSIT. [https://jira.fd.io/browse/CSIT-584 CSIT-584]. |
## Description: no regular automated healthchecks executed against VIRL servers. | ## Description: no regular automated healthchecks executed against VIRL servers. | ||
## Solution: introduce a CSIT health-check monitoring job for VIRL servers' health. | ## Solution: introduce a CSIT health-check monitoring job for VIRL servers' health. | ||
## Tasks: | ## Tasks: | ||
− | ##* [OPEN] Create a new job, executed periodically (6hrs?) for healthchecking all VIRL servers. | + | ##* [OPEN] Create a new job, executed periodically (6hrs?) for healthchecking all VIRL servers. [https://jira.fd.io/browse/CSIT-585 CSIT-585]. |
− | ##* [OPEN] VIRL health-check APIs: health status, VIRL API tests, simulation tests. | + | ##* [OPEN] VIRL health-check APIs: health status, VIRL API tests, simulation tests. [https://jira.fd.io/browse/CSIT-586 CSIT-586]. |
− | ##* [OPEN] VIRL check | + | ##* [OPEN] VIRL capacity check, report number of simulations per virl server. [https://jira.fd.io/browse/CSIT-587 CSIT-587]. |
− | ##* [IN-REVIEW] pre-check to every start-testcase to better handle exceptions and printing errors. https://gerrit.fd.io/r/#/c/6656/. | + | ##* [IN-REVIEW] pre-check to every start-testcase to better handle exceptions and printing errors. [https://jira.fd.io/browse/CSIT-579 CSIT-579]. [https://gerrit.fd.io/r/#/c/6656/ gr6656]. |
− | # [OPEN] VIRL simulation IPv4 address depletion | + | # [OPEN] Address VIRL simulation mgmt IPv4 address depletion. [https://jira.fd.io/browse/CSIT-588 CSIT-588]. |
## Description: Today there is one /24 subnet allocated for all VIRL simulations, split equally across 3 servers, 84 /32 addresses per server. Each CSIT simulation takes 4 addresses (mgmt, tg, sut1, sut2), each csit-vpp and vpp-csit verify job uses 3 simulation to parallized tests for reduced execution time. This means each server has capacity to run up to 7 verify jobs concurrently (3*4*7). Once Centos7 tests productized, where two jobs are always executed in parallel, this will reduce it down to 3 concurrent jobs. Not good. It's basically a show stopper to productize Centos7 into vpp-csit-verify per patch jobs. | ## Description: Today there is one /24 subnet allocated for all VIRL simulations, split equally across 3 servers, 84 /32 addresses per server. Each CSIT simulation takes 4 addresses (mgmt, tg, sut1, sut2), each csit-vpp and vpp-csit verify job uses 3 simulation to parallized tests for reduced execution time. This means each server has capacity to run up to 7 verify jobs concurrently (3*4*7). Once Centos7 tests productized, where two jobs are always executed in parallel, this will reduce it down to 3 concurrent jobs. Not good. It's basically a show stopper to productize Centos7 into vpp-csit-verify per patch jobs. | ||
− | ## Solution | + | ## Solution: Need to increase IPv4 address space given to VIRL hosts. Dedicating /24 subnet per VIRL server, will give address capacity for 60 concurrent simulations. Based on previous memory calcs each VIRL host is capable of doing 30 simulations (30*3 VMs) - need to test verify this. |
## Tasks: | ## Tasks: | ||
− | + | ##* [WIP] Request LF FD.io infra team to allocate at least another /24 for VIRL simulations. [https://jira.fd.io/browse/CSIT-589 CSIT-589]. | |
− | ##* [WIP] Request LF FD.io infra team to allocate at least another /24 for VIRL simulations. | + | |
##** [WIP] Request acknowledged by LF as part of [FD.io Helpdesk #40733]. https://lists.fd.io/pipermail/csit-dev/2017-May/001911.html. | ##** [WIP] Request acknowledged by LF as part of [FD.io Helpdesk #40733]. https://lists.fd.io/pipermail/csit-dev/2017-May/001911.html. | ||
− | # [OPEN] Script expecting VIRL sim nodes to be active within ca. 120sec after launch request - this is too tight | + | ##* [OPEN] Add the new mgmt IPv4 address range to virl hosts. [https://jira.fd.io/browse/CSIT-590 CSIT-590]. |
+ | ##* [OPEN] Check the current max CSIT simulation capacity per VIRL host. Test it. [https://jira.fd.io/browse/CSIT-591 CSIT-591]. | ||
+ | # [WIP] Script expecting VIRL sim nodes to be active within ca. 120sec after launch request - this is too tight. [] | ||
## Description: Intermittent test job failures due to 'ERROR: Simulation started OK but devices never changed to ACTIVE state’. Number of these can be avoided by increasing the script timeout to 240sec or so. | ## Description: Intermittent test job failures due to 'ERROR: Simulation started OK but devices never changed to ACTIVE state’. Number of these can be avoided by increasing the script timeout to 240sec or so. | ||
## Solution: Increasing the script timeout to 240sec or so. But don’t wait 4min every time before trying, as this will add to the overall execution time. | ## Solution: Increasing the script timeout to 240sec or so. But don’t wait 4min every time before trying, as this will add to the overall execution time. | ||
## Tasks: | ## Tasks: | ||
− | ##* [ | + | ##* [WIP] Increase test script timeout to 240sec. [https://jira.fd.io/browse/CSIT-593 CSIT-593] |
− | # [WIP] tb4- | + | # [WIP] tb4-virl servers upgrade to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. [https://jira.fd.io/browse/CSIT-594 CSIT-594] |
## Description: virl upgrade to address issues with Centos7 test instabilities related to QEMU, and to improve general virl system robustness. | ## Description: virl upgrade to address issues with Centos7 test instabilities related to QEMU, and to improve general virl system robustness. | ||
− | ## Solution: upgrade tb4-virl1 server to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. verify stability. follow gradually with tb4- | + | ## Solution: upgrade tb4-virl1 server to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. verify stability. follow gradually with tb4-virl2 and then tb4-virl3 upgrades. |
## Tasks: | ## Tasks: | ||
− | ##* [WIP] VIRL1 server 10.30.51.28 - currently in STAGING, resolving issues. | + | ##* [WIP] VIRL1 server 10.30.51.28 - currently in STAGING, resolving issues. testing ongoing. [https://jira.fd.io/browse/CSIT-595 CSIT-595]. |
− | ##* [OPEN] VIRL1 server 10.30.51.28 - move to PRODUCTION once determined stable. Monitor PRODUCTION performance. | + | ##* [OPEN] VIRL1 server 10.30.51.28 - move to PRODUCTION once determined stable. Monitor PRODUCTION performance. [https://jira.fd.io/browse/CSIT-596 CSIT-596]. |
− | ##* [OPEN] VIRL2 server 10.30.51.29 - upgrade, verify stability, move to PRODUCTION. | + | ##* [OPEN] VIRL1 server 10.30.51.28 - complete upgrade process documentation and ansible scripts. [https://jira.fd.io/browse/CSIT-597 CSIT-597]. |
− | ##* [OPEN] | + | ##* [OPEN] VIRL2 server 10.30.51.29 - upgrade based on documentation and ansible scripts rom VIRL1 uprade process, verify stability. [https://jira.fd.io/browse/CSIT-598 CSIT-598]. |
− | # [ | + | ##* [OPEN] VIRL2 server 10.30.51.29 - once stable, move to PRODUCTION. [https://jira.fd.io/browse/CSIT-599]. |
+ | ##* [OPEN] VIRL3 server 10.30.51.30 - upgrade based on documentation and ansible scripts rom VIRL1 uprade process, verify stability. [https://jira.fd.io/browse/CSIT-600 CSIT-600]. | ||
+ | ##* [OPEN] VIRL3 server 10.30.51.30 - once stable, move to PRODUCTION. [https://jira.fd.io/browse/CSIT-601 CSIT-601]. | ||
+ | # [DONE] Need to periodically delete old files in /tmp directory. [https://jira.fd.io/browse/CSIT-578 CSIT-578]. | ||
## Tasks: | ## Tasks: | ||
− | ##* Cron job to delete old (more then 2 weeks?) files in /tmp directory on every VIRL server | + | ##* [DONE] Cron job to delete old (more then 2 weeks?) files in /tmp directory on every VIRL server. [https://jira.fd.io/browse/CSIT-578 CSIT-578]. |
##*: crontab -e | ##*: crontab -e | ||
##*: 0 0 * * * * find /var/log/libvirt/qemu -type f -mtime +14 -name "instance*.log" -delete | ##*: 0 0 * * * * find /var/log/libvirt/qemu -type f -mtime +14 -name "instance*.log" -delete | ||
Line 48: | Line 52: | ||
##*: 0 0 * * * * find /tmp -type f -atime +14 -name "*.rpm" -delete | ##*: 0 0 * * * * find /tmp -type f -atime +14 -name "*.rpm" -delete | ||
##*: 0 0 * * * * find /nfs/scratch/ -type d -mtime +14 -name "session-*" | ##*: 0 0 * * * * find /nfs/scratch/ -type d -mtime +14 -name "session-*" | ||
− | + | # [WIP] VIRL Centos7 tests productization into vpp-csit-verify. [https://jira.fd.io/browse/CSIT-602 CSIT-602]. | |
− | + | ||
− | # [WIP] Centos7 tests productization | + | |
## Description: Following upgrade of tb4-virl1, Centos7 tests should be ready for productization. | ## Description: Following upgrade of tb4-virl1, Centos7 tests should be ready for productization. | ||
## Solution: Proposal to run Centos7 tests periodically (daily) instead of per patch, to avoid VIRL simulations overload. | ## Solution: Proposal to run Centos7 tests periodically (daily) instead of per patch, to avoid VIRL simulations overload. | ||
## Tasks: | ## Tasks: | ||
− | ##* [OPEN] Create a daily vpp-csit-verify-Centos7 job. | + | ##* [OPEN] Verify stability of csit-vpp-verify-Centos7 jobs. [https://jira.fd.io/browse/CSIT-603 CSIT-603]. |
+ | ##* [OPEN] Create a daily vpp-csit-verify-Centos7 job. [https://jira.fd.io/browse/CSIT-604 CSIT-604]. | ||
===Other Tasks=== | ===Other Tasks=== | ||
Line 71: | Line 74: | ||
# CSIT-161 [https://jira.fd.io/browse/CSIT-161]: Update nested VM qemu library to use 3rd serial console | # CSIT-161 [https://jira.fd.io/browse/CSIT-161]: Update nested VM qemu library to use 3rd serial console | ||
# CSIT-356 [https://jira.fd.io/browse/CSIT-356]: Update VIRL testbed creation to allow specification of centos image | # CSIT-356 [https://jira.fd.io/browse/CSIT-356]: Update VIRL testbed creation to allow specification of centos image | ||
+ | # [OPEN] - Currently the latest nested VM image is used for all Ubuntu/Centos images | ||
+ | ## Description: need solution to be able to link different nested VM images to different ubuntu/centos images |
Revision as of 10:50, 17 May 2017
VIRL infrastructure open tasks
This is the current working list of identified tasks for CSIT VIRL testbeds. It is updated periodically.
High Priority Tasks
- [WIP] Detecting and clearing stuck VIRL simulations. CSIT-582.
- Description: Continue getting stuck VIRL simulations due to either LF network connectivity interruptions or failing CSIT bootstrap teardown.
- Solution: automate clearing old (garbage) simulations, detect non-successful simulation teardowns.
- Tasks:
- [WIP] Add VIRL server healthchecks in CSIT. CSIT-584.
- Description: no regular automated healthchecks executed against VIRL servers.
- Solution: introduce a CSIT health-check monitoring job for VIRL servers' health.
- Tasks:
- [OPEN] Create a new job, executed periodically (6hrs?) for healthchecking all VIRL servers. CSIT-585.
- [OPEN] VIRL health-check APIs: health status, VIRL API tests, simulation tests. CSIT-586.
- [OPEN] VIRL capacity check, report number of simulations per virl server. CSIT-587.
- [IN-REVIEW] pre-check to every start-testcase to better handle exceptions and printing errors. CSIT-579. gr6656.
- [OPEN] Address VIRL simulation mgmt IPv4 address depletion. CSIT-588.
- Description: Today there is one /24 subnet allocated for all VIRL simulations, split equally across 3 servers, 84 /32 addresses per server. Each CSIT simulation takes 4 addresses (mgmt, tg, sut1, sut2), each csit-vpp and vpp-csit verify job uses 3 simulation to parallized tests for reduced execution time. This means each server has capacity to run up to 7 verify jobs concurrently (3*4*7). Once Centos7 tests productized, where two jobs are always executed in parallel, this will reduce it down to 3 concurrent jobs. Not good. It's basically a show stopper to productize Centos7 into vpp-csit-verify per patch jobs.
- Solution: Need to increase IPv4 address space given to VIRL hosts. Dedicating /24 subnet per VIRL server, will give address capacity for 60 concurrent simulations. Based on previous memory calcs each VIRL host is capable of doing 30 simulations (30*3 VMs) - need to test verify this.
- Tasks:
- [WIP] Request LF FD.io infra team to allocate at least another /24 for VIRL simulations. CSIT-589.
- [WIP] Request acknowledged by LF as part of [FD.io Helpdesk #40733]. https://lists.fd.io/pipermail/csit-dev/2017-May/001911.html.
- [OPEN] Add the new mgmt IPv4 address range to virl hosts. CSIT-590.
- [OPEN] Check the current max CSIT simulation capacity per VIRL host. Test it. CSIT-591.
- [WIP] Request LF FD.io infra team to allocate at least another /24 for VIRL simulations. CSIT-589.
- [WIP] Script expecting VIRL sim nodes to be active within ca. 120sec after launch request - this is too tight. []
- Description: Intermittent test job failures due to 'ERROR: Simulation started OK but devices never changed to ACTIVE state’. Number of these can be avoided by increasing the script timeout to 240sec or so.
- Solution: Increasing the script timeout to 240sec or so. But don’t wait 4min every time before trying, as this will add to the overall execution time.
- Tasks:
- [WIP] Increase test script timeout to 240sec. CSIT-593
- [WIP] tb4-virl servers upgrade to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. CSIT-594
- Description: virl upgrade to address issues with Centos7 test instabilities related to QEMU, and to improve general virl system robustness.
- Solution: upgrade tb4-virl1 server to ubuntu16.04, VIRL-core ver. 10.32.8, OpenStack Mitaka. verify stability. follow gradually with tb4-virl2 and then tb4-virl3 upgrades.
- Tasks:
- [WIP] VIRL1 server 10.30.51.28 - currently in STAGING, resolving issues. testing ongoing. CSIT-595.
- [OPEN] VIRL1 server 10.30.51.28 - move to PRODUCTION once determined stable. Monitor PRODUCTION performance. CSIT-596.
- [OPEN] VIRL1 server 10.30.51.28 - complete upgrade process documentation and ansible scripts. CSIT-597.
- [OPEN] VIRL2 server 10.30.51.29 - upgrade based on documentation and ansible scripts rom VIRL1 uprade process, verify stability. CSIT-598.
- [OPEN] VIRL2 server 10.30.51.29 - once stable, move to PRODUCTION. [1].
- [OPEN] VIRL3 server 10.30.51.30 - upgrade based on documentation and ansible scripts rom VIRL1 uprade process, verify stability. CSIT-600.
- [OPEN] VIRL3 server 10.30.51.30 - once stable, move to PRODUCTION. CSIT-601.
- [DONE] Need to periodically delete old files in /tmp directory. CSIT-578.
- Tasks:
- [DONE] Cron job to delete old (more then 2 weeks?) files in /tmp directory on every VIRL server. CSIT-578.
- crontab -e
- 0 0 * * * * find /var/log/libvirt/qemu -type f -mtime +14 -name "instance*.log" -delete
- 0 0 * * * * find /tmp -type f -atime +14 -name "*.deb" -delete
- 0 0 * * * * find /tmp -type f -atime +14 -name "*.rpm" -delete
- 0 0 * * * * find /nfs/scratch/ -type d -mtime +14 -name "session-*"
- [DONE] Cron job to delete old (more then 2 weeks?) files in /tmp directory on every VIRL server. CSIT-578.
- Tasks:
- [WIP] VIRL Centos7 tests productization into vpp-csit-verify. CSIT-602.
- Description: Following upgrade of tb4-virl1, Centos7 tests should be ready for productization.
- Solution: Proposal to run Centos7 tests periodically (daily) instead of per patch, to avoid VIRL simulations overload.
- Tasks:
Other Tasks
- CSIT-116 [2]: Modify VIRL and nested-VM username/password
- CSIT-159 [3]: Nested VM: Replace cisco/cisco credentials with csit/csit
- CSIT-160 [4]: Ubuntu VM: Replace cisco login with csit
- CSIT-145 [5]: Out-of-band access to SUTs
- CSIT-151 [6]: Do not destroy VM in case of test failure due to infrastructure issue
- CSIT-150 [7]: Health-check to capture TG/SUT environment after failed test case
- CSIT-202 [8]: Execute start/stop-testcase scripts from git repository
- CSIT-115 [9]: Usage and status monitoring of VIRL hosts
- CSIT-112 [10]: VIRL infrastructure periodic creation and distribution of images
- CSIT-90 [11]: Nested-VM boot-up failed
- CSIT-210 [12]: Nested VM to include l3fwd startup script
- CSIT-161 [13]: Update nested VM qemu library to use 3rd serial console
- CSIT-356 [14]: Update VIRL testbed creation to allow specification of centos image
- [OPEN] - Currently the latest nested VM image is used for all Ubuntu/Centos images
- Description: need solution to be able to link different nested VM images to different ubuntu/centos images