/cyclecloud-symphony

CycleCloud project to enable use of Spectrum Symphony job scheduler in Azure CycleCloud HPC clusters.

Primary LanguagePythonMIT LicenseMIT

IBM Spectrum Symphony

This project installs and configures IBM Spectrum Symphony.

Use of IBM Spectrum Symphony requires a license agreement and Symphony binaries obtained directly from IBM Spectrum Analytics.

NOTE: Currently, this project only supports Linux-only Symphony clusters. Windows workers are not yet supported.

Table of Contents

Pre-Requisites

This project requires running Azure CycleCloud version 7.7.1 and Symphony 7.2.0 or later.

This project requires the following:

  1. A license to use IBM Spectrum Symphony from IBM Spectrum Analytics.

  2. The IBM Spectrum Symphony installation binaries.

    a. Download the binaries from IBM and place them in the ./blobs/symphony/ directory.

    b. If the version is not 7.3.0.0 (the project default), then update the version number in the Files list in ./project.ini and in the cluster template: ./templates/symphony.txt

  3. CycleCloud must be installed and running.

    a. If this is not the case, see the CycleCloud QuickStart Guide for assistance.

  4. The CycleCloud CLI must be installed and configured for use.

    a. If this is not the case, see the CycleCloud CLI Install Guide for assistance.

Configuring the Project

The first step is to configure the project for use with your storage locker:

  1. Open a terminal or Azure Cloud Shell session with the CycleCloud CLI enabled.

  2. Clone this repo into a new directory and change to the new directory:

   $ git clone https://github.com/Azure/cyclecloud-symphony.git
   $ cd ./cyclecloud-symphony
   
  1. Download the IBM Spectrum Symphony installation binaries and license entitlement file from IBM and place them in the ./blobs/symphony directory. * ./blobs/symphony/sym-7.3.0.0.exe * ./blobs/symphony/sym-7.3.0.0_x86_64.bin * ./blobs/symphony/sym_adv_entitlement.dat

    Or, if using an eval edition:

* ./blobs/symphony/sym_adv_ev_entitlement.dat
* ./blobs/symphony/symeval-7.3.0.0_x86_64.bin
* ./blobs/symphony/symeval-7.3.0.0.exe
  1. If the version number is not 7.3.0.0, update the version numbers in project.ini and templates/symphony.txt

Deploying the Project

To upload the project (including any local changes) to your target locker, run the cyclecloud project upload command from the project directory. The expected output looks like this:

   $ cyclecloud project upload my_locker
   Sync completed!

IMPORTANT

For the upload to succeed, you must have a valid Pogo configuration for your target Locker.

Importing the Cluster Template

To import the cluster:

  1. Open a terminal session with the CycleCloud CLI enabled.

  2. Switch to the Symphony directory.

  3. Run cyclecloud import_template symphony -f templates/symphony.txt. The expected output looks like this:

    $ cyclecloud import_template symphony -f templates/symphony.txt
    Importing template symphony....
    ----------------
    symphony : *template*
    ----------------
    Keypair: $Keypair
    Cluster nodes:
        master: off
    Total nodes: 1

Host Factory Provider for Azure CycleCloud

This project extends the Symphony Host Factory with an Azure CycleCloud resource provider: azurecc.

The Host Factory will be configured as the default autoscaler for the cluster.

Installing the azurecc HostFactory

It is also possible to configure an existing Symphony installation to use the azurecc HostFactory to burst into Azure.

Please contact azure support for help with this configuration.

Guide for using Static Templates with HostFactory

Follow below steps if you are not using the chef in project:

  1. Copy cyclecloud-symphony-pkg-{version}.zip to /tmp directory in master node.
  2. Unzip the file in /tmp directory
  3. Ensure python3 is installed
  4. You should have following environment variables set as per your environment otherwise install.sh script under hostfactory/host_provider directory take default values: EGO_TOP HF_TOP HF_VERSION HF_CONFDIR HF_WORKDIR HF_LOGDIR You can run install.sh within the folder and it will install python virtual environment at plugin path like below: $HF_TOP/$HF_VERSION/providerplugins/azurecc/scripts/venv If you also need to generate symphony configuration then you can run install.sh with argument generate_config and other required arguments. This will set all the configurations assuming you have only azurecc provider and enable it.Example: ./install.sh generate_config --cluster <cluster_name> --username --password --web_server

Guide to using scripts

These scripts can be found under $HF_TOP/$HF_VERSION/providerplugins/azurecc/scripts.

  1. generateWeightedTemplates.sh This script is used to generated weighted template. You need to run this script as root. ./generateWeightedTemplates.sh This will create a template based on current cyclecloud template selections and print it. If there are errors please check in /tmp/template_generate.out You must then store template in $HF_TOP/$HF_VERSION/hostfactory/providers/conf/azurecc_templates.json file. After storing the file the change will take effect only after you stop and start HostFactory.
  2. testDryRunWeightedTemplates.sh These are test scripts which has 2 options: a. validate_templates - this will check the template in default path where azurecc_templates.json is stored ./testDryRunWeightedTemplates.sh validate_templates b. create_machines - for this you need to pass input json as an example: input.json: {"template": {"machineCount": 205, "templateId": "execute"}, "dry-run": true} ./testDryRunWeightedTemplates.sh create_machines input.json This will not create machines but show what machines would have been created with given input. Make sure "dry-run" is true.

Testing capacity issues

You can trigger a capacity issue using following python script. Run LastFailureTime.py under hostfactory/host_provider/test/ with follwing arguments:

  1. Number of seconds before current time
  2. Account name
  3. region
  4. machine name Update the offset compared to UTC time to match the time cycle_server is in, like below matches Central US Standard time, that is 5 hrs behind UTC: def format_time(timestamp): return datetime.datetime.strftime(timestamp, "%Y-%m-%d %H:%M:%S.%f-05:00") python LastFailureTime.py 1 westus2 Standard_D8_v5

Output is like below:

AdType = "Cloud.Capacity" ExpirationTime = 2024-03-01 16:34:54.045927-06:00 AccountName = "" StatusUpdatedTime = 2024-03-01 15:34:53.045927-06:00 Region = "westus2" HasCapacity = False Provider = "azure" Name = "region//westus2/Standard_D8_v5" MachineType = "Standard_D8_v5"

Above output is stored in /tmp/test.dat file and run below command to update the Cloud.Capacity record

curl -k -u ${CC_USER} -X POST -H "Content-Type:application/text" --data-binary "@/tmp/test.dat" localhost:8080/db/Cloud.Capacity?format=text

You can also verify this on CC GUI, HasCapacity is false alt text

Note: You should run this script from machine where Cyclecloud is installed.

Additional configs for symphony templates

Following 2 configuration have been added:

  1. enable_weighted_templates -> default value is true. Used to generate template in which templateId corresponds to nodearray and vmTypes are in dictionary format with weight. But you can change it under configuration symphony for master node. In that case you need to use default format of templateId as nodearray+ sku name.
  2. ncpus_use_vcpus -> default value is true. It assumes you want slot ncpu attribute to be based on vcpu count. You can change in symphony configuration for master node.
  3. capacity-failure-backoff -> default value is 300 (unit is seconds). This is the time after which scalelib will wait to make next attempt at allocation after failure occurs.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.