This article is contributed. See the original author and article here.
Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High-Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale.
Slurm is a widely used open-source HPC scheduler that can manage workloads across clusters of compute nodes. Slurm can also be configured to interact with cloud resources, such as Azure CycleCloud, to dynamically add or remove nodes based on the demand of the jobs. This allows users to optimize their resource utilization and cost efficiency, as well as to access the scalability and flexibility of the cloud.
In this blog post, we are discussing how to integrate an external Slurm Scheduler to send jobs to CycleCloud for cloud bursting (Enabling on-premises workloads to be sent to the cloud for processing, known as “cloud bursting”) or hybrid HPC scenarios. For demonstration purposes, we are creating a Slurm Scheduler node in Azure as an external Slurm Scheduler in a different VNET and the execute nodes are in CycleCloud in a separate VNET. We are not discussing the complexities of networking involved in Hybrid scenarios.
Prerequisites
Before we start, we need to have the following items ready:
- An Azure subscription
- CycleCloud Version: 8.6.0-3223
- OS version in Scheduler and execute nodes: Alma Linux release 8.7 (almalinux:almalinux-hpc:8_7-hpc-gen2:latest)
- Slurm Version: 23.02.7-1
- cyclecloud-slurm Project: 3.0.6
- An external Slurm Scheduler node in Azure or on-premises. in this example we are using Azure VM running with Alma Linux 8.7.
- A network connection between the external Slurm Scheduler node and the CycleCloud cluster. You can use Azure Virtual Network peering, VPN gateway, ExpressRoute, or other methods to establish the connection. In this example, we are using a very basic network setup.
- A shared file system between the external Slurm Scheduler node and the CycleCloud cluster. You can use Azure NetApp Files, Azure Files, NFS, or other methods to mount the same file system on both sides. In this example, we are using a Scheduler VM as a NFS server.
Steps
After we have the prerequisites ready, we can follow these steps to integrate the external Slurm Scheduler node with the CycleCloud cluster:
1. On CycleCloud VM:
- Ensure CycleCloud 8.6 VM is running and accessible via
cyclecloud
CLI. - Clone this repository and import a cluster using the provided CycleCloud template (
slurm-headless.txt
). - We are importing a cluster named
hpc1
using theslurm-headless.txt
template.
git clone https://github.com/vinil-v/slurm-cloud-bursting-using-cyclecloud.git
cyclecloud import_cluster hpc1 -c Slurm-HL -f slurm-cloud-bursting-using-cyclecloud/templates/slurm-headless.txt
Output:
[vinil@cc86 ~]$ cyclecloud import_cluster hpc1 -c Slurm-HL -f slurm-cloud-bursting-using-cyclecloud/cyclecloud-template/slurm-headless.txt
Importing cluster Slurm-HL and creating cluster hpc1....
----------
hpc1 : off
----------
Resource group:
Cluster nodes:
Total nodes: 0
2. Preparing Scheduler VM:
- Deploy a VM using the specified AlmaLinux image (If you have an existing Slurm Scheduler, you can skip this).
- Run the Slurm scheduler installation script (
slurm-scheduler-builder.sh
) and provide the cluster name (hpc1
) when prompted. - This script will install and configure Slurm Scheduler.
git clone https://github.com/vinil-v/slurm-cloud-bursting-using-cyclecloud.git
cd slurm-cloud-bursting-using-cyclecloud/scripts
sh slurm-scheduler-builder.sh
Output:
------------------------------------------------------------------------------------------------------------------------------
Building Slurm scheduler for cloud bursting with Azure CycleCloud
------------------------------------------------------------------------------------------------------------------------------
Enter Cluster Name: hpc1
------------------------------------------------------------------------------------------------------------------------------
Summary of entered details:
Cluster Name: hpc1
Scheduler Hostname: masternode2
NFSServer IP Address: 10.222.1.26
3. CycleCloud UI:
- Access the CycleCloud UI, edit the
hpc1
cluster settings, and configure VM SKUs and networking settings. - Enter the NFS server IP address for
/sched
and/shared
mounts in the Network Attached Storage section. - Save & Start
hpc1
cluster
4. On Slurm Scheduler Node:
- Integrate External Slurm Scheduler with CycleCloud using the
cyclecloud-integrator.sh
script. - Provide CycleCloud details (username, password, and URL) when prompted. (Try entering the details manually instead of copy and paste. The copy & paste might contain some whitespaces and it might create issues in building the connection.)
cd slurm-cloud-bursting-using-cyclecloud/scripts
sh cyclecloud-integrator.sh
Output:
[root@masternode2 scripts]# sh cyclecloud-integrator.sh
Please enter the CycleCloud details to integrate with the Slurm scheduler
Enter Cluster Name: hpc1
Enter CycleCloud Username: vinil
Enter CycleCloud Password:
Enter CycleCloud URL (e.g., https://10.222.1.19): https://10.222.1.19
------------------------------------------------------------------------------------------------------------------------------
Summary of entered details:
Cluster Name: hpc1
CycleCloud Username: vinil
CycleCloud URL: https://10.222.1.19
------------------------------------------------------------------------------------------------------------------------------
5. User and Group Setup:
- Ensure consistent user and group IDs across all nodes.
- Better to use a centralized User Management system like LDAP to ensure the UID and GID are consistent across all the nodes.
- In this example we are using the
users.sh
script to create a test uservinil
and group for job submission. (Uservinil
exists in CycleCloud)
cd slurm-cloud-bursting-using-cyclecloud/scripts
sh users.sh
6. Testing & Job Submission:
- Log in as a test user (
vinil
in this example) on the Scheduler node. - Submit a test job to verify the setup.
su - vinil
srun hostname &
Output:
[root@masternode2 scripts]# su - vinil
Last login: Tue May 14 04:54:51 UTC 2024 on pts/0
[vinil@masternode2 ~]$ srun hostname &
[1] 43448
[vinil@masternode2 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 hpc hostname vinil CF 0:04 1 hpc1-hpc-1
[vinil@masternode2 ~]$ hpc1-hpc-1
You will see a new node getting created in hpc1 cluster.
Congratulations! You have successfully set up Slurm bursting with CycleCloud on Azure.
Conclusion
In this blog post, we have shown how to integrate an external Slurm Scheduler node with Azure CycleCloud for cloud bursting or hybrid HPC scenarios. This enables users to leverage the power and flexibility of the cloud for their HPC workloads, while maintaining their existing Slurm workflows and tools. We hope this guide helps you to get started with your HPC journey on Azure.
Reference:
GitHub repo – slurm-cloud-bursting-using-cyclecloud
Azure CycleCloud Documentation
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.
Recent Comments