Loading plugins/slurm/README.md +59 −12 Original line number Diff line number Diff line Loading @@ -60,31 +60,52 @@ them, the following lines should be added to Slurm's configuration file ## Usage ### Ad-hoc storage services The plugin extends Slurm's command line arguments to allow users to request the deployment of adhoc storage services for their jobs. The following arguments are available for `srun`/`sbatch`: - `--adm-adhoc`: The job requires an adhoc storage service. The following types are supported: - `--adm-adhoc`: The job requires an adhoc storage service. By default `--adm-adhoc-overlap` is assumed. The type of adhoc storage service can be one of: - `gekkofs`: The job requires the GekkoFS adhoc file system. - `expand`: The job requires the Expand adhoc file system. - `hercules`: The job requires the Hercules adhoc file system. - `dataclay`: The job requires the dataClay adhoc object store. - `--adm-adhoc-local`: The adhoc storage service will be deployed on the same nodes as the compute nodes, but the adhoc nodes will not be shared with the application. This is the default mode. - `--adm-adhoc-overlap`: The adhoc storage service will be deployed on the same nodes as the compute nodes and its nodes will be shared with the application. - `--adm-adhoc-remote <adhoc_id>`: The job will use a remote and independent remote adhoc storage service that must have been previously deployed with the `--adm-adhoc-exclusive` option. An identifier for that adhoc storage service must be provided as an argument. - `--adm-adhoc-overlap`: The requested adhoc storage service will be deployed on the same nodes as the application and the nodes will be shared. The number of nodes assigned for the adhoc storage service can be controlled with the `--adm-adhoc-nodes` option. If not specified, the deployed adhoc storage service will share all the nodes assigned to the job. - `--adm-adhoc-exclusive`: The adhoc storage service will be deployed on the same nodes as the application, but the adhoc nodes will not be shared with the application. The number of nodes assigned for the adhoc storage service MUST be specified with the `--adm-adhoc-nodes` option and cannot be greater than the number of nodes assigned to the job. Note, however, that the value of `--adm-adhoc-nodes` must be smaller than the value of `--nodes` (or `--ntasks`). Otherwise, the application would have no resources to run on. - `--adm-adhoc-dedicated`: The ad-hoc storage service will be deployed in an independent job allocation and all the nodes for the allocation will be available for it. An `adhoc-id` will be generated for it and will be returned to the user so that other jobs can use the deployed ad-hoc storage service. In this mode, the resources assigned to the ad-hoc storage service can be controlled with the usual Slurm options (e.g. `--nodes`, `--ntasks`, `--time`, etc.). - `--adm-adhoc-remote <adhoc-id>`: The job will use a remote and dedicated ad-hoc storage service that must have been previously requested in a different submission with the `--adm-adhoc-dedicated` option. An identifier for that ad-hoc storage service must be provided as an argument. Users can request and control the automatic deployment of a remote adhoc storage service using the following `srun`/`sbatch` arguments: - `--adm-adhoc-exclusive`: The job allocation will be used exclusively for an - `--adm-adhoc-dedicated`: The job allocation will be used exclusively for an adhoc storage service. - `--adm-adhoc-nodes`: The number of nodes to use for the adhoc storage service. The nodes will be allocated from the same partition as the Loading @@ -104,9 +125,35 @@ $ sbatch --adm-adhoc gekkofs --adm-adhoc-overlap script.sh $ sbatch --adm-adhoc gekkofs --adm-adhoc-exclusive --adm-adhoc-nodes 10 --adm-adhoc-walltime 00:10:00 noop.sh Submitted batch job 42 Will deploy adhoc storage 123456 # Wait for the adhoc storage service to be started $ sbatch --adm-adhoc-remote 123456 --dependency=after:42 script.sh ``` ### Dataset management The plugin also provides a set of options to manage datasets: - `--adm-input <dataset-routing>`: Define datasets that should be transferred between the PFS and the ad-hoc storage service. The `dataset-routing` is defined as `ORIGIN-TIER:PATH TARGET-TIER:PATH`. For example, to transfer the file `input000.dat` from the Lustre PFS to the an on-demand GekkoFS ad-hoc storage service, the option could be specified in the following manner: `"lustre:/input.dat gekkofs:/input.dat"` - `--adm-output <dataset-routing>`: Define datasets that should be automatically transferred between the ad-hoc storage system and the PFS. The ad-hoc storage will guarantee that the dataset is not transferred while there are processes accessing the file. The datasets will be transferred before the job allocation finishes if at all possible, but no hard guarantees are made. - `--adm-expect-output <dataset-routing>`: Define datasets that should be automatically transferred between the ad-hoc storage system and the PFS. The ad-hoc storage will guarantee that the dataset is not transferred while there are processes accessing the file. The datasets will be transferred before the job allocation finishes. If the transfer cannot be completed before the job allocation finishes, the job will be cancelled. - `--adm-inout <dataset-routing>`: Define datasets that should be transferred INTO the ad-hoc storage AND BACK when finished. ## References 1. See manual page spank(7) and <slurm/spank.h> Loading plugins/slurm/slurmadmcli.c +89 −33 Original line number Diff line number Diff line Loading @@ -50,10 +50,14 @@ #define TAG_NNODES 0 #define TAG_ADHOC_TYPE 1 #define TAG_ADHOC_LOCAL 2 #define TAG_ADHOC_OVERLAP 3 #define TAG_ADHOC_REMOTE 4 #define TAG_ADHOC_EXCLUSIVE 5 #define TAG_ADHOC_OVERLAP 2 #define TAG_ADHOC_EXCLUSIVE 3 #define TAG_ADHOC_DEDICATED 4 #define TAG_ADHOC_REMOTE 5 #define TAG_DATASET_INPUT 6 #define TAG_DATASET_OUTPUT 7 #define TAG_DATASET_EXPECT_OUTPUT 8 #define TAG_DATASET_INOUT 9 // clang-format off SPANK_PLUGIN (admire-cli, 1) Loading @@ -64,7 +68,7 @@ static int scord_flag = 0; /* scord adhoc options */ static long adhoc_nnodes = 0; static long adhoc_walltime = 0; static ADM_adhoc_mode_t adhoc_mode = 0; static ADM_adhoc_mode_t adhoc_mode = ADM_ADHOC_MODE_IN_JOB_SHARED; static ADM_adhoc_storage_type_t adhoc_type = 0; static char adhoc_id[ADHOCID_LEN] = {0}; Loading Loading @@ -101,33 +105,53 @@ process_opts(int tag, const char* optarg, int remote); struct spank_option spank_opts[] = { { "adm-adhoc", "type", "Deploy an ad-hoc storage of type 'type' for this job. " "Deploy an ad-hoc storage of type `type` for this job. " "Supported ad-hoc storages are: gekkofs, expand, hercules, and " "dataclay.", "dataclay. By default, it implies `--adm-adhoc-overlap`, but " "this behavior can be modified with the " "`--adm-adhoc-exclusive` or `--adm-adhoc-dedicated flags`.", 1, /* option takes an argument */ TAG_ADHOC_TYPE, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-local", NULL, "adm-adhoc-overlap", NULL, "Deploy the requested ad-hoc storage on the same nodes as the " "compute nodes. The ad-hoc nodes WILL NOT BE SHARED with the " "application. This is the default behavior.", "compute nodes, but request ad-hoc nodes to BE SHARED " "with the application. The number of nodes assigned to the " "ad-hoc storage CAN be specified with the " "`--adm-adhoc-nodes` option.", 0, /* option takes an argument */ TAG_ADHOC_LOCAL, /* option tag */ TAG_ADHOC_OVERLAP, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-overlap", NULL, "adm-adhoc-exclusive", NULL, "Deploy the requested ad-hoc storage on the same nodes as the " "compute nodes. The ad-hoc nodes WILL BE SHARED with the " "application.", "compute nodes, but request ad-hoc nodes to NOT BE SHARED " "with the application. The number of nodes assigned to the " "ad-hoc storage MUST be specified with the " "`--adm-adhoc-nodes` option.", 0, /* option takes an argument */ TAG_ADHOC_OVERLAP, /* option tag */ TAG_ADHOC_EXCLUSIVE, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-remote", "adhoc_id", "adm-adhoc-dedicated", NULL, "Deploy the requested ad-hoc storage service will be deployed " "in an independent job allocation and all the nodes in this " "allocation will be available for it. A specific `adhoc-id` " "will be generated for it and will be returned to the user " "so that other jobs can refer to this deployed ad-hoc storage " "service. In this mode, the resources assigned to the ad-hoc " "storage service can be controlled with the normal Slurm " "options.", 0, /* option takes an argument */ TAG_ADHOC_DEDICATED, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-remote", "adhoc-id", "Use an independent ad-hoc storage already running in its own" "allocation. The service must have been previously deployed " "with the `--adm-adhoc-exclusive` option.", Loading @@ -136,19 +160,55 @@ struct spank_option spank_opts[] = { (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-exclusive", NULL, "The job allocation will be used exclusively to deploy an " "ad-hoc storage service.", 0, /* option takes an argument */ TAG_ADHOC_EXCLUSIVE, /* option tag */ "adm-adhoc-nodes", "nnodes", "Dedicate `nnodes` to the ad-hoc storage service. Only " "valid if paired with `--adm-adhoc-overlap` and " "`--adm-adhoc-exclusive`. Ignored otherwise.", 1, /* option takes an argument */ TAG_NNODES, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-nodes", "nnodes", "In `local` and `overlap` modes, dedicate `nnodes` to the " "ad-hoc storage service.", "adm-input", "dataset-routing", "Define datasets that should be transferred between the PFS " "and the ad-hoc storage service. The `dataset-routing` is " "defined as `ORIGIN-TIER:PATH TARGET-TIER:PATH`. For example," "to transfer the file `input000.dat` from the Lustre PFS to " "the an on-demand GekkoFS ad-hoc storage service, the option " "could be specified in the following manner: " " \"lustre:/input.dat gekkofs:/input.dat\"", 1, /* option takes an argument */ TAG_NNODES, /* option tag */ TAG_DATASET_INPUT, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-output", "dataset-routing", "Define datasets that should be automatically transferred " "between the ad-hoc storage system and the PFS. The ad-hoc " "storage will guarantee that the dataset is not transferred " "while there are processes accessing the file. The datasets " "will be transferred before the job allocation finishes if at " "all possible, but no hard guarantees are made.", 1, /* option takes an argument */ TAG_DATASET_OUTPUT, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-expect-output", "dataset-routing", "Define datasets that are expected to be generated by the " "application. When using this option, the application itself " "MUST use the programmatic APIs defined in `scord-user.h`to " "explicitly request the transfer of the datasets.", 1, /* option takes an argument */ TAG_DATASET_EXPECT_OUTPUT, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-expect-inout", "dataset-routing", "Define the datasets that should be transferred INTO " "the ad-hoc storage AND BACK when finished.", 1, /* option takes an argument */ TAG_DATASET_INOUT, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, SPANK_OPTIONS_TABLE_END}; Loading Loading @@ -204,15 +264,11 @@ process_opts(int tag, const char* optarg, int remote) { return -1; case TAG_ADHOC_LOCAL: case TAG_ADHOC_EXCLUSIVE: adhoc_mode = ADM_ADHOC_MODE_IN_JOB_DEDICATED; return 0; case TAG_ADHOC_OVERLAP: adhoc_mode = ADM_ADHOC_MODE_IN_JOB_SHARED; return 0; case TAG_ADHOC_EXCLUSIVE: case TAG_ADHOC_DEDICATED: adhoc_mode = ADM_ADHOC_MODE_SEPARATE_NEW; return 0; Loading Loading
plugins/slurm/README.md +59 −12 Original line number Diff line number Diff line Loading @@ -60,31 +60,52 @@ them, the following lines should be added to Slurm's configuration file ## Usage ### Ad-hoc storage services The plugin extends Slurm's command line arguments to allow users to request the deployment of adhoc storage services for their jobs. The following arguments are available for `srun`/`sbatch`: - `--adm-adhoc`: The job requires an adhoc storage service. The following types are supported: - `--adm-adhoc`: The job requires an adhoc storage service. By default `--adm-adhoc-overlap` is assumed. The type of adhoc storage service can be one of: - `gekkofs`: The job requires the GekkoFS adhoc file system. - `expand`: The job requires the Expand adhoc file system. - `hercules`: The job requires the Hercules adhoc file system. - `dataclay`: The job requires the dataClay adhoc object store. - `--adm-adhoc-local`: The adhoc storage service will be deployed on the same nodes as the compute nodes, but the adhoc nodes will not be shared with the application. This is the default mode. - `--adm-adhoc-overlap`: The adhoc storage service will be deployed on the same nodes as the compute nodes and its nodes will be shared with the application. - `--adm-adhoc-remote <adhoc_id>`: The job will use a remote and independent remote adhoc storage service that must have been previously deployed with the `--adm-adhoc-exclusive` option. An identifier for that adhoc storage service must be provided as an argument. - `--adm-adhoc-overlap`: The requested adhoc storage service will be deployed on the same nodes as the application and the nodes will be shared. The number of nodes assigned for the adhoc storage service can be controlled with the `--adm-adhoc-nodes` option. If not specified, the deployed adhoc storage service will share all the nodes assigned to the job. - `--adm-adhoc-exclusive`: The adhoc storage service will be deployed on the same nodes as the application, but the adhoc nodes will not be shared with the application. The number of nodes assigned for the adhoc storage service MUST be specified with the `--adm-adhoc-nodes` option and cannot be greater than the number of nodes assigned to the job. Note, however, that the value of `--adm-adhoc-nodes` must be smaller than the value of `--nodes` (or `--ntasks`). Otherwise, the application would have no resources to run on. - `--adm-adhoc-dedicated`: The ad-hoc storage service will be deployed in an independent job allocation and all the nodes for the allocation will be available for it. An `adhoc-id` will be generated for it and will be returned to the user so that other jobs can use the deployed ad-hoc storage service. In this mode, the resources assigned to the ad-hoc storage service can be controlled with the usual Slurm options (e.g. `--nodes`, `--ntasks`, `--time`, etc.). - `--adm-adhoc-remote <adhoc-id>`: The job will use a remote and dedicated ad-hoc storage service that must have been previously requested in a different submission with the `--adm-adhoc-dedicated` option. An identifier for that ad-hoc storage service must be provided as an argument. Users can request and control the automatic deployment of a remote adhoc storage service using the following `srun`/`sbatch` arguments: - `--adm-adhoc-exclusive`: The job allocation will be used exclusively for an - `--adm-adhoc-dedicated`: The job allocation will be used exclusively for an adhoc storage service. - `--adm-adhoc-nodes`: The number of nodes to use for the adhoc storage service. The nodes will be allocated from the same partition as the Loading @@ -104,9 +125,35 @@ $ sbatch --adm-adhoc gekkofs --adm-adhoc-overlap script.sh $ sbatch --adm-adhoc gekkofs --adm-adhoc-exclusive --adm-adhoc-nodes 10 --adm-adhoc-walltime 00:10:00 noop.sh Submitted batch job 42 Will deploy adhoc storage 123456 # Wait for the adhoc storage service to be started $ sbatch --adm-adhoc-remote 123456 --dependency=after:42 script.sh ``` ### Dataset management The plugin also provides a set of options to manage datasets: - `--adm-input <dataset-routing>`: Define datasets that should be transferred between the PFS and the ad-hoc storage service. The `dataset-routing` is defined as `ORIGIN-TIER:PATH TARGET-TIER:PATH`. For example, to transfer the file `input000.dat` from the Lustre PFS to the an on-demand GekkoFS ad-hoc storage service, the option could be specified in the following manner: `"lustre:/input.dat gekkofs:/input.dat"` - `--adm-output <dataset-routing>`: Define datasets that should be automatically transferred between the ad-hoc storage system and the PFS. The ad-hoc storage will guarantee that the dataset is not transferred while there are processes accessing the file. The datasets will be transferred before the job allocation finishes if at all possible, but no hard guarantees are made. - `--adm-expect-output <dataset-routing>`: Define datasets that should be automatically transferred between the ad-hoc storage system and the PFS. The ad-hoc storage will guarantee that the dataset is not transferred while there are processes accessing the file. The datasets will be transferred before the job allocation finishes. If the transfer cannot be completed before the job allocation finishes, the job will be cancelled. - `--adm-inout <dataset-routing>`: Define datasets that should be transferred INTO the ad-hoc storage AND BACK when finished. ## References 1. See manual page spank(7) and <slurm/spank.h> Loading
plugins/slurm/slurmadmcli.c +89 −33 Original line number Diff line number Diff line Loading @@ -50,10 +50,14 @@ #define TAG_NNODES 0 #define TAG_ADHOC_TYPE 1 #define TAG_ADHOC_LOCAL 2 #define TAG_ADHOC_OVERLAP 3 #define TAG_ADHOC_REMOTE 4 #define TAG_ADHOC_EXCLUSIVE 5 #define TAG_ADHOC_OVERLAP 2 #define TAG_ADHOC_EXCLUSIVE 3 #define TAG_ADHOC_DEDICATED 4 #define TAG_ADHOC_REMOTE 5 #define TAG_DATASET_INPUT 6 #define TAG_DATASET_OUTPUT 7 #define TAG_DATASET_EXPECT_OUTPUT 8 #define TAG_DATASET_INOUT 9 // clang-format off SPANK_PLUGIN (admire-cli, 1) Loading @@ -64,7 +68,7 @@ static int scord_flag = 0; /* scord adhoc options */ static long adhoc_nnodes = 0; static long adhoc_walltime = 0; static ADM_adhoc_mode_t adhoc_mode = 0; static ADM_adhoc_mode_t adhoc_mode = ADM_ADHOC_MODE_IN_JOB_SHARED; static ADM_adhoc_storage_type_t adhoc_type = 0; static char adhoc_id[ADHOCID_LEN] = {0}; Loading Loading @@ -101,33 +105,53 @@ process_opts(int tag, const char* optarg, int remote); struct spank_option spank_opts[] = { { "adm-adhoc", "type", "Deploy an ad-hoc storage of type 'type' for this job. " "Deploy an ad-hoc storage of type `type` for this job. " "Supported ad-hoc storages are: gekkofs, expand, hercules, and " "dataclay.", "dataclay. By default, it implies `--adm-adhoc-overlap`, but " "this behavior can be modified with the " "`--adm-adhoc-exclusive` or `--adm-adhoc-dedicated flags`.", 1, /* option takes an argument */ TAG_ADHOC_TYPE, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-local", NULL, "adm-adhoc-overlap", NULL, "Deploy the requested ad-hoc storage on the same nodes as the " "compute nodes. The ad-hoc nodes WILL NOT BE SHARED with the " "application. This is the default behavior.", "compute nodes, but request ad-hoc nodes to BE SHARED " "with the application. The number of nodes assigned to the " "ad-hoc storage CAN be specified with the " "`--adm-adhoc-nodes` option.", 0, /* option takes an argument */ TAG_ADHOC_LOCAL, /* option tag */ TAG_ADHOC_OVERLAP, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-overlap", NULL, "adm-adhoc-exclusive", NULL, "Deploy the requested ad-hoc storage on the same nodes as the " "compute nodes. The ad-hoc nodes WILL BE SHARED with the " "application.", "compute nodes, but request ad-hoc nodes to NOT BE SHARED " "with the application. The number of nodes assigned to the " "ad-hoc storage MUST be specified with the " "`--adm-adhoc-nodes` option.", 0, /* option takes an argument */ TAG_ADHOC_OVERLAP, /* option tag */ TAG_ADHOC_EXCLUSIVE, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-remote", "adhoc_id", "adm-adhoc-dedicated", NULL, "Deploy the requested ad-hoc storage service will be deployed " "in an independent job allocation and all the nodes in this " "allocation will be available for it. A specific `adhoc-id` " "will be generated for it and will be returned to the user " "so that other jobs can refer to this deployed ad-hoc storage " "service. In this mode, the resources assigned to the ad-hoc " "storage service can be controlled with the normal Slurm " "options.", 0, /* option takes an argument */ TAG_ADHOC_DEDICATED, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-remote", "adhoc-id", "Use an independent ad-hoc storage already running in its own" "allocation. The service must have been previously deployed " "with the `--adm-adhoc-exclusive` option.", Loading @@ -136,19 +160,55 @@ struct spank_option spank_opts[] = { (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-exclusive", NULL, "The job allocation will be used exclusively to deploy an " "ad-hoc storage service.", 0, /* option takes an argument */ TAG_ADHOC_EXCLUSIVE, /* option tag */ "adm-adhoc-nodes", "nnodes", "Dedicate `nnodes` to the ad-hoc storage service. Only " "valid if paired with `--adm-adhoc-overlap` and " "`--adm-adhoc-exclusive`. Ignored otherwise.", 1, /* option takes an argument */ TAG_NNODES, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-adhoc-nodes", "nnodes", "In `local` and `overlap` modes, dedicate `nnodes` to the " "ad-hoc storage service.", "adm-input", "dataset-routing", "Define datasets that should be transferred between the PFS " "and the ad-hoc storage service. The `dataset-routing` is " "defined as `ORIGIN-TIER:PATH TARGET-TIER:PATH`. For example," "to transfer the file `input000.dat` from the Lustre PFS to " "the an on-demand GekkoFS ad-hoc storage service, the option " "could be specified in the following manner: " " \"lustre:/input.dat gekkofs:/input.dat\"", 1, /* option takes an argument */ TAG_NNODES, /* option tag */ TAG_DATASET_INPUT, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-output", "dataset-routing", "Define datasets that should be automatically transferred " "between the ad-hoc storage system and the PFS. The ad-hoc " "storage will guarantee that the dataset is not transferred " "while there are processes accessing the file. The datasets " "will be transferred before the job allocation finishes if at " "all possible, but no hard guarantees are made.", 1, /* option takes an argument */ TAG_DATASET_OUTPUT, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-expect-output", "dataset-routing", "Define datasets that are expected to be generated by the " "application. When using this option, the application itself " "MUST use the programmatic APIs defined in `scord-user.h`to " "explicitly request the transfer of the datasets.", 1, /* option takes an argument */ TAG_DATASET_EXPECT_OUTPUT, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, { "adm-expect-inout", "dataset-routing", "Define the datasets that should be transferred INTO " "the ad-hoc storage AND BACK when finished.", 1, /* option takes an argument */ TAG_DATASET_INOUT, /* option tag */ (spank_opt_cb_f) process_opts /* callback */ }, SPANK_OPTIONS_TABLE_END}; Loading Loading @@ -204,15 +264,11 @@ process_opts(int tag, const char* optarg, int remote) { return -1; case TAG_ADHOC_LOCAL: case TAG_ADHOC_EXCLUSIVE: adhoc_mode = ADM_ADHOC_MODE_IN_JOB_DEDICATED; return 0; case TAG_ADHOC_OVERLAP: adhoc_mode = ADM_ADHOC_MODE_IN_JOB_SHARED; return 0; case TAG_ADHOC_EXCLUSIVE: case TAG_ADHOC_DEDICATED: adhoc_mode = ADM_ADHOC_MODE_SEPARATE_NEW; return 0; Loading