Managing the OCI Cluster with Slurm Workload Manager & Grafana

Objective

There are lot of things that you can do to manage all the cluster nodes ,such as shutting down or starting a particular or all compute nodes using Slurm Workload Manager https://slurm.schedmd.com/

Assumption : You have managed to create cluster using our Previous Blog – Create a Highly Scalable Cluster in the cloud using Terraform on OCI

SSH to Management Compute

SSH to Management Compute Node

D:\BM>ssh -i bm_ssh_key opc@public_ip
Last login: Sun Dec 16 05:45:13 2018 from public_ip
######################

Welcome to the cluster
In order to create users, run the script "./finish" and follow the instructions

######################
[opc@mgmt ~]$ ls
ansible-pull.log  bm_ssh_key.pub  finish  nodes.yaml       shapes.yaml             
test.slm   users.yml.example
bm_ssh_key        config          hosts   oci_api_key.pem  
slurm-ansible-playbook  users.yml

Check if you can access one of the Compute Nodes from Management Node

[opc@mgmt ~]$ ssh -i bm_ssh_key opc@public_ip_compute_1
Last login: Sun Dec 16 04:16:13 2018 from public_ip_compute_1
[opc@compute001 ~]$ ls
ansible-pull.log  hosts
[opc@compute001 ~]$ exit
logout
Connection to public_ip_compute_1 closed.

Check if the Nodes are Running

[opc@mgmt ~]$ cluset --list-all
@compute
@state:drained
@role:mgmt
[opc@mgmt ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      4  drain compute[001-004]

Slurm Elastic Computing (Cloud Bursting)

Slurm is configured to use its elastic computing mode. This allows Slurm to automatically turn off any nodes which are not currently being used for running jobs and turn on any nodes which are needed for running jobs. This is particularly useful in the cloud as a node which has been shut down will not be charged for.

Refer : https://slurm.schedmd.com/elastic_computing.html

Slurm Commands

[opc@mgmt ~]$ smap

Submitting a Job to Stop a Compute Node from Master

[opc@mgmt ~]$ sudo -u slurm /usr/local/bin/stopnode compute001
{
  "data": {
    "availability-domain": "zULs:US-ASHBURN-AD-1",
    "compartment-id": "ocid1.compartment.oc1..aaaaaaaay6kjvt2udXXXXjyy7rmx4eclxcbya",
    "defined-tags": {},
    "display-name": "compute001",
    "extended-metadata": {},
    "fault-domain": "FAULT-DOMAIN-1",
    "freeform-tags": {
      "cluster": "mycluster",
      "nodetype": "compute"
    },
    "id": "ocid1.instance.oc1.iadXXXXXbd7yez5qaltr2aeanya",
    "image-id": "ocid1.image.oc1.iad.aaaaaaaa2mnepqp7wn3XXXXXXli67z6mktdiq",
    "ipxe-script": null,
    "launch-mode": "NATIVE",
    "launch-options": {
      "boot-volume-type": "ISCSI",
      "firmware": "UEFI_64",
      "is-pv-encryption-in-transit-enabled": true,
      "network-type": "VFIO",
      "remote-data-volume-type": "PARAVIRTUALIZED"
    },
    "lifecycle-state": "STOPPING",
    "metadata": {
      "ssh_authorized_keys": "ssh-rsa AAAAB3NzaC1ycXXXXXXNy4P\n",
      "user_data": "IyEvYmluL2Jhc2gK"
    },
    "region": "iad",
    "shape": "VM.Standard1.2",
    "source-details": {
      "boot-volume-size-in-gbs": null,
      "image-id": "ocid1.image.oc1.iad.aaaaaaaa2mnepXXXXXXwf7uc246tcltg4li67z6mktdiq",
      "kms-key-id": null,
      "source-type": "image"
    },
    "time-created": "2018-12-14T17:16:14.298000+00:00",
    "time-maintenance-reboot-due": null
  },
  "etag": "5aa96088b4555d1820ea42bXXXX28266c9e3b711f0b487ef065b70"
}

[opc@mgmt ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      1 drain* compute001
compute*     up   infinite      3  drain compute[002-004]

Submitting a Job to Start Compute Node from Master

[opc@mgmt ~]$ sudo -u slurm /usr/local/bin/startnode compute001
{
  "data": {
    "availability-domain": "zULs:US-ASHBURN-AD-1",
    "compartment-id": "ocid1.compartment.oc1..aaaaaaaay6kjvtXXXXXXXXmx4eclxcbya",
    "defined-tags": {},
    "display-name": "compute001",
    "extended-metadata": {},
    "fault-domain": "FAULT-DOMAIN-1",
    "freeform-tags": {
      "cluster": "mycluster",
      "nodetype": "compute"
    },
    "id": "ocid1.instance.oc1.iad.abuwcljtdymfpnoppXXXXX7yez5qaltr2aeanya",
    "image-id": "ocid1.image.oc1.iad.aaaaaXXXXX246tcltg4li67z6mktdiq",
    "ipxe-script": null,
    "launch-mode": "NATIVE",
    "launch-options": {
      "boot-volume-type": "ISCSI",
      "firmware": "UEFI_64",
      "is-pv-encryption-in-transit-enabled": true,
      "network-type": "VFIO",
      "remote-data-volume-type": "PARAVIRTUALIZED"
    },
    "lifecycle-state": "STARTING",
    "metadata": {
      "ssh_authorized_keys": "ssh-rsa AAAAB3NzaC1yc2EXXX6/Ny4P\n",
      "user_data": "IyEvYmluL2Jhc2gK"
    },
    "region": "iad",
    "shape": "VM.Standard1.2",
    "source-details": {
      "boot-volume-size-in-gbs": null,
      "image-id": "ocid1.image.oc1.iad.aaaaaaaa2mnXXXXg4li67z6mktdiq",
      "kms-key-id": null,
      "source-type": "image"
    },
    "time-created": "2018-12-14T17:16:14.298000+00:00",
    "time-maintenance-reboot-due": null
  },
  "etag": "613d29962d14f0e98eXXXX4239bbcf139475b1e7"
}

[opc@mgmt ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      4  drain compute[001-004]

Using GRAFANA to Manage your cluster

Grafana is open platform for beautiful analytics and monitoring tool for your cluster

Effectively you can have MySQL Running or Stopped on all Compute Nodes as controlled by Grafana

Blog Author : Madhusudhan Rao

Previous Blog Link : Create a Highly Scalable Cluster in the cloud using Terraform on OCI

Reference Links :