Objective
There are lot of things that you can do to manage all the cluster nodes ,such as shutting down or starting a particular or all compute nodes using Slurm Workload Manager https://slurm.schedmd.com/
Assumption : You have managed to create cluster using our Previous Blog – Create a Highly Scalable Cluster in the cloud using Terraform on OCI
SSH to Management Compute
SSH to Management Compute Node
D:\BM>ssh -i bm_ssh_key [email protected]_ip Last login: Sun Dec 16 05:45:13 2018 from public_ip ###################### Welcome to the cluster In order to create users, run the script "./finish" and follow the instructions ###################### [[email protected] ~]$ ls ansible-pull.log bm_ssh_key.pub finish nodes.yaml shapes.yaml
test.slm users.yml.example bm_ssh_key config hosts oci_api_key.pem
slurm-ansible-playbook users.yml
Check if you can access one of the Compute Nodes from Management Node
[[email protected] ~]$ ssh -i bm_ssh_key [email protected]_ip_compute_1 Last login: Sun Dec 16 04:16:13 2018 from public_ip_compute_1 [[email protected] ~]$ ls ansible-pull.log hosts
[[email protected] ~]$ exit
logout
Connection to public_ip_compute_1 closed.
Check if the Nodes are Running
[[email protected] ~]$ cluset --list-all @compute @state:drained @role:mgmt [[email protected] ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up infinite 4 drain compute[001-004]
Slurm Elastic Computing (Cloud Bursting)
Slurm is configured to use its elastic computing mode. This allows Slurm to automatically turn off any nodes which are not currently being used for running jobs and turn on any nodes which are needed for running jobs. This is particularly useful in the cloud as a node which has been shut down will not be charged for.
Refer : https://slurm.schedmd.com/elastic_computing.html
Slurm Commands
[[email protected] ~]$ smap
Submitting a Job to Stop a Compute Node from Master
[[email protected] ~]$ sudo -u slurm /usr/local/bin/stopnode compute001 { "data": { "availability-domain": "zULs:US-ASHBURN-AD-1", "compartment-id": "ocid1.compartment.oc1..aaaaaaaay6kjvt2udXXXXjyy7rmx4eclxcbya", "defined-tags": {}, "display-name": "compute001", "extended-metadata": {}, "fault-domain": "FAULT-DOMAIN-1", "freeform-tags": { "cluster": "mycluster", "nodetype": "compute" }, "id": "ocid1.instance.oc1.iadXXXXXbd7yez5qaltr2aeanya", "image-id": "ocid1.image.oc1.iad.aaaaaaaa2mnepqp7wn3XXXXXXli67z6mktdiq", "ipxe-script": null, "launch-mode": "NATIVE", "launch-options": { "boot-volume-type": "ISCSI", "firmware": "UEFI_64", "is-pv-encryption-in-transit-enabled": true, "network-type": "VFIO", "remote-data-volume-type": "PARAVIRTUALIZED" }, "lifecycle-state": "STOPPING", "metadata": { "ssh_authorized_keys": "ssh-rsa AAAAB3NzaC1ycXXXXXXNy4P\n", "user_data": "IyEvYmluL2Jhc2gK" }, "region": "iad", "shape": "VM.Standard1.2", "source-details": { "boot-volume-size-in-gbs": null, "image-id": "ocid1.image.oc1.iad.aaaaaaaa2mnepXXXXXXwf7uc246tcltg4li67z6mktdiq", "kms-key-id": null, "source-type": "image" }, "time-created": "2018-12-14T17:16:14.298000+00:00", "time-maintenance-reboot-due": null }, "etag": "5aa96088b4555d1820ea42bXXXX28266c9e3b711f0b487ef065b70" }
[[email protected] ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up infinite 1 drain* compute001 compute* up infinite 3 drain compute[002-004]
Submitting a Job to Start Compute Node from Master
[[email protected] ~]$ sudo -u slurm /usr/local/bin/startnode compute001 { "data": { "availability-domain": "zULs:US-ASHBURN-AD-1", "compartment-id": "ocid1.compartment.oc1..aaaaaaaay6kjvtXXXXXXXXmx4eclxcbya", "defined-tags": {}, "display-name": "compute001", "extended-metadata": {}, "fault-domain": "FAULT-DOMAIN-1", "freeform-tags": { "cluster": "mycluster", "nodetype": "compute" }, "id": "ocid1.instance.oc1.iad.abuwcljtdymfpnoppXXXXX7yez5qaltr2aeanya", "image-id": "ocid1.image.oc1.iad.aaaaaXXXXX246tcltg4li67z6mktdiq", "ipxe-script": null, "launch-mode": "NATIVE", "launch-options": { "boot-volume-type": "ISCSI", "firmware": "UEFI_64", "is-pv-encryption-in-transit-enabled": true, "network-type": "VFIO", "remote-data-volume-type": "PARAVIRTUALIZED" }, "lifecycle-state": "STARTING", "metadata": { "ssh_authorized_keys": "ssh-rsa AAAAB3NzaC1yc2EXXX6/Ny4P\n", "user_data": "IyEvYmluL2Jhc2gK" }, "region": "iad", "shape": "VM.Standard1.2", "source-details": { "boot-volume-size-in-gbs": null, "image-id": "ocid1.image.oc1.iad.aaaaaaaa2mnXXXXg4li67z6mktdiq", "kms-key-id": null, "source-type": "image" }, "time-created": "2018-12-14T17:16:14.298000+00:00", "time-maintenance-reboot-due": null }, "etag": "613d29962d14f0e98eXXXX4239bbcf139475b1e7" }
[[email protected] ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up infinite 4 drain compute[001-004]
Using GRAFANA to Manage your cluster
Grafana is open platform for beautiful analytics and monitoring tool for your cluster
login as admin user to your public ip of Management Compute : 3000
Effectively you can have MySQL Running or Stopped on all Compute Nodes as controlled by Grafana
Blog Author : Madhusudhan Rao
Previous Blog Link : Create a Highly Scalable Cluster in the cloud using Terraform on OCI
Reference Links :
- https://cluster-in-the-cloud.readthedocs.io/en/latest/running.html
- https://slurm.schedmd.com/overview.html
- https://grafana.com/