Difference between revisions of "Nomad Operations and Planning"
From fd.io
Dwallacelf (Talk | contribs) (→Nomad Planning Wish List) |
Dwallacelf (Talk | contribs) (→Nomad Operations Tasks) |
||
(15 intermediate revisions by 2 users not shown) | |||
Line 6: | Line 6: | ||
= Nomad Operational Status = | = Nomad Operational Status = | ||
− | * TBD - add description or link to Nomad architecture / configuration | + | * [[Nomad Physical Topology]] |
+ | * [[Nomad Configuration]] | ||
+ | * [[Nomad Monitoring]] | ||
+ | TBD - add description or link to Nomad architecture / configuration | ||
* TBD - Add links to Nomad monitoring status / data | * TBD - Add links to Nomad monitoring status / data | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= Nomad Operations Tasks = | = Nomad Operations Tasks = | ||
Line 92: | Line 21: | ||
! ETA | ! ETA | ||
|- | |- | ||
− | | Move Nomad Docker images from https://hub.docker.com/ | + | | Move Nomad Operational Docker images from [https://hub.docker.com/u/snergster snergster docker hub account] into [https://hub.docker.com/u/fdiotools fdiotools docker hub account]. |
+ | | Dave W. | ||
+ | | 50% | ||
+ | | June 1, 2020 | ||
+ | |- | ||
+ | | Update Ubuntu1804 & Centos7 Nomad Docker images to include clang-9 toolchain packages required by VPP 'make install-deps' and lf-infra-publish macro. | ||
| Dave W. | | Dave W. | ||
| 25% | | 25% | ||
− | | | + | | June 2, 2020 |
|- | |- | ||
− | | | + | | New Jenkins/Nomad labels for production, verify, & sandbox |
| Dave W. | | Dave W. | ||
− | | | + | | |
− | | | + | | June 1, 2020 |
|- | |- | ||
− | | | + | | Nomad server OS upgrades/normalization. Utilize ansible to create a uniform bare-metal OS environment across all Nomad servers. |
| Peter M. | | Peter M. | ||
− | | | + | | 99.9% |
− | | | + | | May 29, 2020 |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
|- | |- | ||
| Build & test ubuntu 20.04 and centos8 Docker images for CI executors to run respective OS jobs. | | Build & test ubuntu 20.04 and centos8 Docker images for CI executors to run respective OS jobs. | ||
Line 119: | Line 48: | ||
| Fix server-type-c4-3 (10.32.8.16) SDD with an HDD, reinstall Ubuntu 18.04 and restore to Nomad cluster. | | Fix server-type-c4-3 (10.32.8.16) SDD with an HDD, reinstall Ubuntu 18.04 and restore to Nomad cluster. | ||
| Dave W. | | Dave W. | ||
+ | | [https://secure.vexxhost.com/billing/viewticket.php?tid=AGA-517631&c=3RxtvTXs Vexxhost Ticket Created] | ||
| TBD | | TBD | ||
− | |||
|- | |- | ||
| Update VPP ci-management configurations to use global jjb macros (lf-publisher & build-discarder) | | Update VPP ci-management configurations to use global jjb macros (lf-publisher & build-discarder) | ||
Line 129: | Line 58: | ||
| Export Gerrit & Jenkins logs and other operational data to Nomad servers | | Export Gerrit & Jenkins logs and other operational data to Nomad servers | ||
| Dave W. & Vanessa V. | | Dave W. & Vanessa V. | ||
+ | | [https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-19811 LF Ticket Created] | ||
| TBD | | TBD | ||
− | |||
|} | |} | ||
= Nomad Planning Wish List = | = Nomad Planning Wish List = | ||
This is the list of long term Nomad tasks. Please move them to the Nomad Operations Tasks and provide owner/ET information when they are being actively worked on. | This is the list of long term Nomad tasks. Please move them to the Nomad Operations Tasks and provide owner/ET information when they are being actively worked on. | ||
− | * Add Nomad nodes to LF DNS & make the names the same as the hostname | + | * Clean up old/unused Jenkins jobs. |
+ | * Use Configuration as Code Jenkins Plugin to manage Jenkins configuration (including Nomad Plugin) via YAML configuration files. | ||
+ | * Nomad cluster resiliency testing/hardening improvements. | ||
+ | * Nomad docker image CI/CD pipeline. | ||
+ | * Convert task list to use JIRA tickets/epics for tracking ongoing Nomad work. | ||
+ | * Create ci-management jobs to do automated build/test/verify for CI process and weekly upgrade of docker images. | ||
+ | * Add Nomad nodes to LF DNS & make the names the same as the hostname. | ||
+ | * Add automated Nomad server quorum loss tests | ||
+ | * Add VPP "test crash" testcase to 'make test' | ||
* Add VPP 'make test-debug w/ ASAN enabled' verify job | * Add VPP 'make test-debug w/ ASAN enabled' verify job | ||
− | |||
− | |||
* [https://plugins.jenkins.io/nomad/ Investigate Jenkins Nomad-plugin security issues.] | * [https://plugins.jenkins.io/nomad/ Investigate Jenkins Nomad-plugin security issues.] | ||
* Convert Nomad/Jenkins/Gerrit monitoring/screen-scraping hacks into an operational monitoring system using exported gerrit & jenkins logs & nomad cli output. | * Convert Nomad/Jenkins/Gerrit monitoring/screen-scraping hacks into an operational monitoring system using exported gerrit & jenkins logs & nomad cli output. | ||
− | * Add a mechanism to measure/track the memory consumed by the CI jobs inside Docker images | + | * Add a mechanism to measure/track the memory consumed by the CI jobs inside Docker images. pmikus_comment: Depends if we want ability to do live monitoring or ability of storing logs (how long?). I can make prometheus to work for us by very simple change in config. |
= Completed Nomad Tasks = | = Completed Nomad Tasks = | ||
Line 164: | Line 99: | ||
| 100% | | 100% | ||
| May 18, 2020 | | May 18, 2020 | ||
+ | |- | ||
+ | | Perform fresh installation of Ubuntu 18.04 Server on t4-virl1, t4-virl2, & t4-virl3 | ||
+ | | Peter M. | ||
+ | | 100% | ||
+ | | May 25, 2020 | ||
+ | |- | ||
+ | | Restore Nomad configuration on t4-virl1, t4-virl2, & t4-virl3 and rejoin on VPP cluster. | ||
+ | | Peter M. | ||
+ | | 100% | ||
+ | | May 26, 2020 | ||
|} | |} |
Latest revision as of 01:23, 23 June 2020
Nomad clusters are hosted on dedicated servers in the FD.io lab and used to manage Docker container based executors for FD.io project CI jobs.
Contents
Physical Lab Infrastructure
- FD.io CSIT git repository keeps an up-to-date FD.io lab specification.
- Server naming convention is specified here.
Nomad Operational Status
TBD - add description or link to Nomad architecture / configuration
- TBD - Add links to Nomad monitoring status / data
Nomad Operations Tasks
This is the current list of high priority Nomad tasks.
Task Description | Owner | % Complete | ETA |
---|---|---|---|
Move Nomad Operational Docker images from snergster docker hub account into fdiotools docker hub account. | Dave W. | 50% | June 1, 2020 |
Update Ubuntu1804 & Centos7 Nomad Docker images to include clang-9 toolchain packages required by VPP 'make install-deps' and lf-infra-publish macro. | Dave W. | 25% | June 2, 2020 |
New Jenkins/Nomad labels for production, verify, & sandbox | Dave W. | June 1, 2020 | |
Nomad server OS upgrades/normalization. Utilize ansible to create a uniform bare-metal OS environment across all Nomad servers. | Peter M. | 99.9% | May 29, 2020 |
Build & test ubuntu 20.04 and centos8 Docker images for CI executors to run respective OS jobs. | Dave W. | ||
Fix server-type-c4-3 (10.32.8.16) SDD with an HDD, reinstall Ubuntu 18.04 and restore to Nomad cluster. | Dave W. | Vexxhost Ticket Created | TBD |
Update VPP ci-management configurations to use global jjb macros (lf-publisher & build-discarder) | Vanessa V. | ||
Export Gerrit & Jenkins logs and other operational data to Nomad servers | Dave W. & Vanessa V. | LF Ticket Created | TBD |
Nomad Planning Wish List
This is the list of long term Nomad tasks. Please move them to the Nomad Operations Tasks and provide owner/ET information when they are being actively worked on.
- Clean up old/unused Jenkins jobs.
- Use Configuration as Code Jenkins Plugin to manage Jenkins configuration (including Nomad Plugin) via YAML configuration files.
- Nomad cluster resiliency testing/hardening improvements.
- Nomad docker image CI/CD pipeline.
- Convert task list to use JIRA tickets/epics for tracking ongoing Nomad work.
- Create ci-management jobs to do automated build/test/verify for CI process and weekly upgrade of docker images.
- Add Nomad nodes to LF DNS & make the names the same as the hostname.
- Add automated Nomad server quorum loss tests
- Add VPP "test crash" testcase to 'make test'
- Add VPP 'make test-debug w/ ASAN enabled' verify job
- Investigate Jenkins Nomad-plugin security issues.
- Convert Nomad/Jenkins/Gerrit monitoring/screen-scraping hacks into an operational monitoring system using exported gerrit & jenkins logs & nomad cli output.
- Add a mechanism to measure/track the memory consumed by the CI jobs inside Docker images. pmikus_comment: Depends if we want ability to do live monitoring or ability of storing logs (how long?). I can make prometheus to work for us by very simple change in config.
Completed Nomad Tasks
Task Description | Owner | % Complete | Finish Date |
---|---|---|---|
Move Nomad build executor Dockerfiles from https://github.com/snergfdio/* into the ci-management project. | Dave W. | 100% | April 29, 2020 |
Add a sudoer/admin account to all Nomad Servers. | Dave W. | 100% | May 18, 2020 |
Move server-type-c4-2 from Class 's5ci' to Class 'builder' to cover t4-virl* nomad clients during upgrade. | Dave W. | 100% | May 18, 2020 |
Perform fresh installation of Ubuntu 18.04 Server on t4-virl1, t4-virl2, & t4-virl3 | Peter M. | 100% | May 25, 2020 |
Restore Nomad configuration on t4-virl1, t4-virl2, & t4-virl3 and rejoin on VPP cluster. | Peter M. | 100% | May 26, 2020 |