Nomad Operations and Planning
From fd.io
Revision as of 20:54, 29 May 2020 by Dwallacelf (Talk | contribs)
Nomad clusters are hosted on dedicated servers in the FD.io lab and used to manage Docker container based executors for FD.io project CI jobs.
Contents
Physical Lab Infrastructure
- FD.io CSIT git repository keeps an up-to-date FD.io lab specification.
- Server naming convention is specified here.
Nomad Operational Status
TBD - add description or link to Nomad architecture / configuration
- TBD - Add links to Nomad monitoring status / data
Nomad Operations Tasks
This is the current list of high priority Nomad tasks.
Task Description | Owner | % Complete | ETA |
---|---|---|---|
Move Nomad Docker images from https://hub.docker.com/search?q=snergster&type=image into fdiotools dockerhub account. | Dave W. | 50% | June 1, 2020 |
Update Ubuntu1804 & Centos7 Nomad Docker images to include clang-9 toolchain packages required by VPP 'make install-deps' and lf-infra-publish macro. | Dave W. | 25% | June 2, 2020 |
New Jenkins/Nomad labels for production, verify, & sandbox | Dave W. | June 1, 2020 | |
Nomad server OS upgrades/normalization. Utilize ansible to create a uniform bare-metal OS environment across all Nomad servers. | Peter M. | 99.9% | May 29, 2020 |
Build & test ubuntu 20.04 and centos8 Docker images for CI executors to run respective OS jobs. | Dave W. | ||
Fix server-type-c4-3 (10.32.8.16) SDD with an HDD, reinstall Ubuntu 18.04 and restore to Nomad cluster. | Dave W. | Vexxhost Ticket Created | TBD |
Update VPP ci-management configurations to use global jjb macros (lf-publisher & build-discarder) | Vanessa V. | ||
Export Gerrit & Jenkins logs and other operational data to Nomad servers | Dave W. & Vanessa V. | LF Ticket Created | TBD |
Nomad Planning Wish List
This is the list of long term Nomad tasks. Please move them to the Nomad Operations Tasks and provide owner/ET information when they are being actively worked on.
- Convert task list to use JIRA tickets/epics for tracking ongoing Nomad work.
- Create ci-management jobs to do automated build/test/verify for CI process and weekly upgrade of docker images.
- Add Nomad nodes to LF DNS & make the names the same as the hostname.
- Add automated Nomad server quorum loss tests
- Add VPP "test crash" testcase to 'make test'
- Add VPP 'make test-debug w/ ASAN enabled' verify job
- Convert Jenkins Nomad-plugin configuration spreadsheet to JJB managed YAML configuration files.
- Investigate Jenkins Nomad-plugin security issues.
- Convert Nomad/Jenkins/Gerrit monitoring/screen-scraping hacks into an operational monitoring system using exported gerrit & jenkins logs & nomad cli output.
- Add a mechanism to measure/track the memory consumed by the CI jobs inside Docker images. pmikus_comment: Depends if we want ability to do live monitoring or ability of storing logs (how long?). I can make prometheus to work for us by very simple change in config.
Completed Nomad Tasks
Task Description | Owner | % Complete | Finish Date |
---|---|---|---|
Move Nomad build executor Dockerfiles from https://github.com/snergfdio/* into the ci-management project. | Dave W. | 100% | April 29, 2020 |
Add a sudoer/admin account to all Nomad Servers. | Dave W. | 100% | May 18, 2020 |
Move server-type-c4-2 from Class 's5ci' to Class 'builder' to cover t4-virl* nomad clients during upgrade. | Dave W. | 100% | May 18, 2020 |
Perform fresh installation of Ubuntu 18.04 Server on t4-virl1, t4-virl2, & t4-virl3 | Peter M. | 100% | May 25, 2020 |
Restore Nomad configuration on t4-virl1, t4-virl2, & t4-virl3 and rejoin on VPP cluster. | Peter M. | 100% | May 26, 2020 |