Initialized “The Keepers of the Swarm” // Prometheus and Grafana // Node Monitoring

$$ High Performance Cluster Monitoring System Implementation Log - 7.14.2024 $$

Author: MrDew

Date: July 14, 2024

Objective: Implement a comprehensive monitoring system for our high-performance cluster using Prometheus and Grafana.

Environment:

Cluster: 29 nodes, Docker Swarm mode enabled
- 1 Manager Node ([email protected])
- 28 Worker Nodes (various IPs)
- Operating Systems: Mixture of Alpine Linux and Ubuntu
Monitoring Tools:
- Prometheus (deployed as a Docker Swarm service)
- Grafana (to be deployed as a Docker Swarm service)
Metrics Exporter: Node Exporter (deployed as a Docker Swarm service on all nodes)

Initial State:

Phase 1: Enable Docker Metrics on all Nodes

Challenge: Docker was not exposing metrics on any nodes besides the manager node where Prometheus was running.

Solution:

Identify Operating System: Determine the operating system running on each node to use the correct package manager (apk for Alpine Linux, apt-get for Ubuntu).
Install Docker (if necessary):
- Alpine Linux:
```
    `sudo apk update
```
  sudo apk add docker sudo addgroup ${USER} docker sudo rc-service docker start sudo rc-update add docker default docker --version`
  
  content_copy Use code with caution.
- Ubuntu:
```
    `sudo apt-get update
```
  sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" sudo apt-get update sudo apt-get install -y docker-ce docker --version`
  
  content_copy Use code with caution.
Configure Docker Daemon: Create/edit the /etc/docker/daemon.json file on each node:
```
   `{
```
"metrics-addr" : "0.0.0.0:9323", "experimental" : true }`

content_copy Use code with caution.Json
Restart Docker: Restart the Docker daemon on each node to apply the changes.
- Alpine Linux:
```
    `sudo rc-service docker restart`
```
  content_copy Use code with caution.
- Ubuntu:
```
    `sudo systemctl restart docker
```
  or
  
  sudo service docker restart`
  
  content_copy Use code with caution.