Difference between revisions of "Nomad Operations and Planning"

From fd.io
Jump to: navigation, search
(Created page with "= Nomad Operations and Planning = Nomad clusters are hosted on dedicated servers in the FD.io lab and used to manage Docker container based executors for FD.io project CI job...")
 
(Nomad Operations Tasks)
 
(21 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Nomad Operations and Planning =
 
 
 
Nomad clusters are hosted on dedicated servers in the FD.io lab and used to manage Docker container based executors for FD.io project CI jobs.
 
Nomad clusters are hosted on dedicated servers in the FD.io lab and used to manage Docker container based executors for FD.io project CI jobs.
 +
 +
= Physical Lab Infrastructure =
 +
* FD.io CSIT git repository keeps an up-to-date [https://github.com/FDio/csit/blob/master/docs/lab/testbed_specifications.md FD.io lab specification].
 +
* Server naming convention is specified [https://github.com/FDio/csit/blob/master/docs/lab/testbed_specifications.md#naming-convention here].
  
 
= Nomad Operational Status =
 
= Nomad Operational Status =
* TBD - add description or link to Nomad architecture / configuration
+
* [[Nomad Physical Topology]]
 +
* [[Nomad Configuration]]
 +
* [[Nomad Monitoring]]
 +
TBD - add description or link to Nomad architecture / configuration
 
* TBD - Add links to Nomad monitoring status / data
 
* TBD - Add links to Nomad monitoring status / data
  
Line 16: Line 21:
 
! ETA
 
! ETA
 
|-
 
|-
| Add a sudoer/admin account to all Nomad Servers.
+
| Move Nomad Operational Docker images from [https://hub.docker.com/u/snergster snergster docker hub account] into [https://hub.docker.com/u/fdiotools fdiotools docker hub account].
 
| Dave W.
 
| Dave W.
| 90%
+
| 50%
| May 18, 2020
+
| June 1, 2020
 
|-
 
|-
| Move Nomad Docker images from https://hub.docker.com/search?q=snergster&type=image into fdiotools dockerhub account.
+
| Update Ubuntu1804 & Centos7 Nomad Docker images to include clang-9 toolchain packages required by VPP 'make install-deps' and lf-infra-publish macro.
 
| Dave W.
 
| Dave W.
| 10%
+
| 25%
| May 18, 2020
+
| June 2, 2020
 
|-
 
|-
| Update Ubuntu1804 & Centos7 Nomad Docker images to include clang-9 toolchain packages required by VPP 'make install-deps'.
+
| New Jenkins/Nomad labels for production, verify, & sandbox
 
| Dave W.
 
| Dave W.
| 10%
+
|  
| May 20, 2020
+
| June 1, 2020
 
|-
 
|-
| Perform fresh installation of Ubuntu 18.04 Server on t4-virl1, t4-virl2, & t4-virl3
+
| Nomad server OS upgrades/normalization. Utilize ansible to create a uniform bare-metal OS environment across all Nomad servers.
 
| Peter M.
 
| Peter M.
|
+
| 99.9%
|
+
| May 29, 2020
 
|-  
 
|-  
| Restore Nomad configuration on t4-virl1, t4-virl2, & t4-virl3 and rejoin on VPP cluster.
+
| Build & test ubuntu 20.04 and centos8 Docker images for CI executors to run respective OS jobs.
 
| Dave W.
 
| Dave W.
 
|  
 
|  
Line 42: Line 47:
 
|-  
 
|-  
 
| Fix server-type-c4-3 (10.32.8.16) SDD with an HDD, reinstall Ubuntu 18.04 and restore to Nomad cluster.
 
| Fix server-type-c4-3 (10.32.8.16) SDD with an HDD, reinstall Ubuntu 18.04 and restore to Nomad cluster.
|  
+
| Dave W.
|  
+
| [https://secure.vexxhost.com/billing/viewticket.php?tid=AGA-517631&c=3RxtvTXs Vexxhost Ticket Created]
|  
+
| TBD
 
|-  
 
|-  
| Update VPP ci-management configurations to use global jjb macros (publisher &  
+
| Update VPP ci-management configurations to use global jjb macros (lf-publisher & build-discarder)
 
| Vanessa V.
 
| Vanessa V.
 
|  
 
|  
 
|  
 
|  
 +
|-
 +
| Export Gerrit & Jenkins logs and other operational data to Nomad servers
 +
| Dave W. & Vanessa V.
 +
| [https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-19811 LF Ticket Created]
 +
| TBD
 
|}
 
|}
  
 
= Nomad Planning Wish List =
 
= Nomad Planning Wish List =
This is the long term list of Nomad tasks.  Please move them to the Nomad Operations Tasks and provide owner/ET information when they are being actively worked on.
+
This is the list of long term Nomad tasks.  Please move them to the Nomad Operations Tasks and provide owner/ET information when they are being actively worked on.
 +
* Clean up old/unused Jenkins jobs.
 +
* Use Configuration as Code Jenkins Plugin to manage Jenkins configuration (including Nomad Plugin) via YAML configuration files.
 +
* Nomad cluster resiliency testing/hardening improvements.
 +
* Nomad docker image CI/CD pipeline.
 +
* Convert task list to use JIRA tickets/epics for tracking ongoing Nomad work.
 +
* Create ci-management jobs to do automated build/test/verify for CI process and weekly upgrade of docker images.
 +
* Add Nomad nodes to LF DNS & make the names the same as the hostname.
 +
* Add automated Nomad server quorum loss tests
 +
* Add VPP "test crash" testcase to 'make test'
 
* Add VPP 'make test-debug w/ ASAN enabled' verify job
 
* Add VPP 'make test-debug w/ ASAN enabled' verify job
* Nomad server OS upgrades/normalization. Utilize ansible to create a uniform bare-metal OS environment across all Nomad servers.
 
* Convert Jenkins Nomad-plugin configuration spreadsheet to JJB managed YAML configuration files.
 
 
* [https://plugins.jenkins.io/nomad/ Investigate Jenkins Nomad-plugin security issues.]
 
* [https://plugins.jenkins.io/nomad/ Investigate Jenkins Nomad-plugin security issues.]
* Export Gerrit & Jenkins logs and other operational data to Nomad servers
 
 
* Convert Nomad/Jenkins/Gerrit monitoring/screen-scraping hacks into an operational monitoring system using exported gerrit & jenkins logs & nomad cli output.
 
* Convert Nomad/Jenkins/Gerrit monitoring/screen-scraping hacks into an operational monitoring system using exported gerrit & jenkins logs & nomad cli output.
* Add a mechanism to measure/track the memory consumed by the CI jobs inside Docker images
+
* Add a mechanism to measure/track the memory consumed by the CI jobs inside Docker images. pmikus_comment: Depends if we want ability to do live monitoring or ability of storing logs (how long?). I can make prometheus to work for us by very simple change in config.
  
 
= Completed Nomad Tasks =
 
= Completed Nomad Tasks =
Line 73: Line 89:
 
| 100%
 
| 100%
 
| April 29, 2020
 
| April 29, 2020
 +
|-
 +
| Add a sudoer/admin account to all Nomad Servers.
 +
| Dave W.
 +
| 100%
 +
| May 18, 2020
 +
|-
 +
| Move server-type-c4-2 from Class 's5ci' to Class 'builder' to cover t4-virl* nomad clients during upgrade.
 +
| Dave W.
 +
| 100%
 +
| May 18, 2020
 +
|-
 +
| Perform fresh installation of Ubuntu 18.04 Server on t4-virl1, t4-virl2, & t4-virl3
 +
| Peter M.
 +
| 100%
 +
| May 25, 2020
 +
|-
 +
| Restore Nomad configuration on t4-virl1, t4-virl2, & t4-virl3 and rejoin on VPP cluster.
 +
| Peter M.
 +
| 100%
 +
| May 26, 2020
 
|}
 
|}

Latest revision as of 01:23, 23 June 2020

Nomad clusters are hosted on dedicated servers in the FD.io lab and used to manage Docker container based executors for FD.io project CI jobs.

Physical Lab Infrastructure

Nomad Operational Status

TBD - add description or link to Nomad architecture / configuration

  • TBD - Add links to Nomad monitoring status / data

Nomad Operations Tasks

This is the current list of high priority Nomad tasks.

Task Description Owner  % Complete ETA
Move Nomad Operational Docker images from snergster docker hub account into fdiotools docker hub account. Dave W. 50% June 1, 2020
Update Ubuntu1804 & Centos7 Nomad Docker images to include clang-9 toolchain packages required by VPP 'make install-deps' and lf-infra-publish macro. Dave W. 25% June 2, 2020
New Jenkins/Nomad labels for production, verify, & sandbox Dave W. June 1, 2020
Nomad server OS upgrades/normalization. Utilize ansible to create a uniform bare-metal OS environment across all Nomad servers. Peter M. 99.9% May 29, 2020
Build & test ubuntu 20.04 and centos8 Docker images for CI executors to run respective OS jobs. Dave W.
Fix server-type-c4-3 (10.32.8.16) SDD with an HDD, reinstall Ubuntu 18.04 and restore to Nomad cluster. Dave W. Vexxhost Ticket Created TBD
Update VPP ci-management configurations to use global jjb macros (lf-publisher & build-discarder) Vanessa V.
Export Gerrit & Jenkins logs and other operational data to Nomad servers Dave W. & Vanessa V. LF Ticket Created TBD

Nomad Planning Wish List

This is the list of long term Nomad tasks. Please move them to the Nomad Operations Tasks and provide owner/ET information when they are being actively worked on.

  • Clean up old/unused Jenkins jobs.
  • Use Configuration as Code Jenkins Plugin to manage Jenkins configuration (including Nomad Plugin) via YAML configuration files.
  • Nomad cluster resiliency testing/hardening improvements.
  • Nomad docker image CI/CD pipeline.
  • Convert task list to use JIRA tickets/epics for tracking ongoing Nomad work.
  • Create ci-management jobs to do automated build/test/verify for CI process and weekly upgrade of docker images.
  • Add Nomad nodes to LF DNS & make the names the same as the hostname.
  • Add automated Nomad server quorum loss tests
  • Add VPP "test crash" testcase to 'make test'
  • Add VPP 'make test-debug w/ ASAN enabled' verify job
  • Investigate Jenkins Nomad-plugin security issues.
  • Convert Nomad/Jenkins/Gerrit monitoring/screen-scraping hacks into an operational monitoring system using exported gerrit & jenkins logs & nomad cli output.
  • Add a mechanism to measure/track the memory consumed by the CI jobs inside Docker images. pmikus_comment: Depends if we want ability to do live monitoring or ability of storing logs (how long?). I can make prometheus to work for us by very simple change in config.

Completed Nomad Tasks

Task Description Owner  % Complete Finish Date
Move Nomad build executor Dockerfiles from https://github.com/snergfdio/* into the ci-management project. Dave W. 100% April 29, 2020
Add a sudoer/admin account to all Nomad Servers. Dave W. 100% May 18, 2020
Move server-type-c4-2 from Class 's5ci' to Class 'builder' to cover t4-virl* nomad clients during upgrade. Dave W. 100% May 18, 2020
Perform fresh installation of Ubuntu 18.04 Server on t4-virl1, t4-virl2, & t4-virl3 Peter M. 100% May 25, 2020
Restore Nomad configuration on t4-virl1, t4-virl2, & t4-virl3 and rejoin on VPP cluster. Peter M. 100% May 26, 2020