Unverified Commit effa8110 authored by Alberto Miranda's avatar Alberto Miranda ♨️
Browse files

Update README.md

parent 2453bd4e
Loading
Loading
Loading
Loading
+201 −86
Original line number Diff line number Diff line
# docker-cluster



## Getting started

To make it easy for you to get started with GitLab, here's a list of recommended next steps.

Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)!

## Add your files

- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files
- [ ] [Add files using the command line](https://docs.gitlab.com/ee/gitlab-basics/add-file.html#add-a-file-using-the-command-line) or push an existing Git repository with the following command:
# Slurm Docker Cluster

This is a multi-container Slurm cluster using  `docker-compose` with `sshd` and `systemd` enabled.
The compose file creates named volumes for persistent storage of MySQL data files as well as
Slurm state and log directories. It is heavily based on work by [giovtorres/docker-slurm-cluster](
https://github.com/giovtorres/slurm-docker-cluster).

## Containers, Networks, and Volumes

The compose file will run the following containers:

* `mysql`
* `slurmdbd`
* `slurmctld`
* `login` (slurmd)
* `c1`, `c2`, `c3`. `c4` (slurmd)

The compose file will create the following named volumes:

* `etc_munge`         ( -> `/etc/munge`     )
* `slurm_jobdir`      ( -> `/data`          )
* `var_lib_mysql`     ( -> `/var/lib/mysql` )

The compose file will create the `slurm_cluster` network for all containers and will assign the
following IPv4 static addresses:

* slurmctld: 192.18.0.129
* c1: 192.18.0.10
* c2: 192.18.0.11
* c3: 192.18.0.12
* c4: 192.18.0.13
* login: 192.18.0.128


## Package contents

- `docker-compose.yml`: docker-compose file for running the cluster
- `slurm-docker-cluster/Dockerfile`: dockerfile for building the main cluster services.
- `slurm-docker-cluster-node/Dockerfile`: dockerfile with specific software for
  the compute nodes. Specific for scord. NEEDS TO BE BUILT BEFORE RUNNING THE
  CLUSTER.
- `refresh.sh`: script for refreshing the scord installation in the cluster.
  This script uses the `slurm-docker-cluster-node` image to generate the
  binaries so that there are no compatibility issues with dependencies.

  The script relies on the following variables:
    - `REPO`: The repository where the `scord` source code is located.
    - `VOLUMES`: The host directory where the output of the build process will
      be placed.
    - `USER`: The container user that should be used to run the build
      process (so that ownership matches with the host and container user).
      The `scord` build process relies on a CMake Preset for Rocky Linux that
      has been configured to match the container environment:

        ```json
          {
              "name": "rocky",
              "displayName": "Rocky Linux",
              "description": "Build options for Rocky Linux",
              "inherits": "base",
              "environment" : {
                "PKG_CONFIG_PATH": "/usr/lib/pkgconfig;/usr/lib64/pkgconfig"
              },
              "generator": "Unix Makefiles",
              "cacheVariables": {
                "CMAKE_CXX_COMPILER_LAUNCHER": "",
                "CMAKE_C_COMPILER_LAUNCHER": "",
                "CMAKE_CXX_FLAGS": "-fdiagnostics-color=always",
                "CMAKE_C_FLAGS": "-fdiagnostics-color=always",
                "CMAKE_PREFIX_PATH": "/usr/lib;/usr/lib64",
                "CMAKE_INSTALL_PREFIX": "/scord_prefix",
                "SCORD_BUILD_EXAMPLES": true,
                "SCORD_BUILD_TESTS": true,
                "SCORD_BIND_ADDRESS": "192.18.0.128"
              }
          }
        ```

- `volumes`: directory for the volumes used by the cluster:
    - `etc_munge`: munge configuration files. A shared `munge.key` needs to be
      generated and placed here.
    - `etc_slurm`: slurm configuration files. At least a `slurm.conf` file needs
      to be placed here. The `slurm.conf` file should be configured with the
      compute node and partition information. For example:
        ```conf
          # COMPUTE NODES
          NodeName=c[1-4] RealMemory=1000 State=UNKNOWN

          # PARTITIONS
          PartitionName=normal Default=yes Nodes=c[1-4] Priority=50 DefMemPerCPU=500 Shared=NO MaxNodes=4 MaxTime=5-00:00:00 DefaultTime=5-00:00:00 State=UP
        ```
cd existing_repo
git remote add origin https://storage.bsc.es/gitlab/hpc/docker-cluster.git
git branch -M main
git push -uf origin main
    - `etc_ssh`: ssh configuration files. Server keys and configuration files
      should be placed here.
    - `ld.so.conf.d`: ld.so configuration files.
    - `scord_prefix`: scord installation directory. The scord installation
      should
      be placed here. This should match with the directory outside the
      container where we are generating the binaries.
    - `user_home`: user home directory. Any files and directories that we want to
      have available in all compute nodes (e.g. `.ssh`), should be added here.
    - `docker-entrypoint.sh`: Overridden entry point

## Build arguments

The following build arguments are available:

* `SLURM_TAG`: The Slurm Git tag to build. Defaults to `slurm-21-08-6-1`.
* `GOSU_VERSION`: The gosu version to install. Defaults to `1.11`.
* `SHARED_USER_NAME`: The name of the user that will be shared with the cluster. Defaults to `user`.
* `SHARED_USER_UID`: The UID of the user that will be shared with the cluster. Defaults to `1000`.
* `SHARED_GROUP_NAME`: The name of the group that will be shared with the cluster. Defaults to `user`.
* `SHARED_GROUP_GID`: The GID of the group that will be shared with the cluster. Defaults to `1000`.


## Configuration

To run, the cluster services expect some files to be present in the host system. The simpler way to do this is to
provide the files in the `volumes` directory with the correct ownership and permissions so that they can be
mounted in the containers. The `volumes` directory should be placed in the same directory as the `docker-compose.yml`
file. The `volumes` directory should have the following structure:

```bash
volumes/
├── docker-entrypoint.sh         -> /usr/local/bin/docker-entrypoint.sh
├── etc_munge                    -> /etc/munge
├── etc_slurm                    -> /etc/slurm
├── etc_ssh                      -> /etc/ssh
├── ld.so.conf.d                 -> /etc/ld.so.conf.d
└── user_home                    -> /home/$SHARED_USER_NAME
```

## Integrate with your tools

- [ ] [Set up project integrations](https://storage.bsc.es/gitlab/hpc/docker-cluster/-/settings/integrations)

## Collaborate with your team

- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/)
- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html)
- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically)
- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/)
- [ ] [Set auto-merge](https://docs.gitlab.com/ee/user/project/merge_requests/merge_when_pipeline_succeeds.html)

## Test and Deploy

Use the built-in continuous integration in GitLab.

- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/index.html)
- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing(SAST)](https://docs.gitlab.com/ee/user/application_security/sast/)
- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html)
- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/)
- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html)

***

# Editing this README

When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thank you to [makeareadme.com](https://www.makeareadme.com/) for this template.

## Suggestions for a good README
Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information.

## Name
Choose a self-explaining name for your project.

## Description
Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors.

## Badges
On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge.

## Visuals
Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method.

## Installation
Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection.
The following ownership and permissions should be set for the cluster to work properly.
The `slurm` and `munge` users are not actually required to exist in the host system
as they are created automatically whie building the images, though it helps to actually
create them rather than having a weird number pop up each time `ls` is called.
Note however, that if created in the host, the `slurm` and `munge` users/groups
need to have the same UIDs/GIDs in the host and container systems.

```bash
volumes
├── [-rwxrwxr-x example-user example-user 1.9K Jun 29 16:30]  docker-entrypoint.sh
├── [drwxrwxr-x munge        munge    4.0K Jun 17 09:11]  etc_munge
│   └── [-r-------- munge    munge    1.0K Jun 17 09:11]  munge.key
├── [drwxrwxr-x slurm    slurm    4.0K Jul  4 09:49]  etc_slurm
│   ├── [-rw-r--r-- slurm    slurm     216 Jun 16 15:48]  cgroup.conf.example
│   ├── [-rw-r--r-- slurm    slurm     213 Jun 30 14:28]  plugstack.conf
│   ├── [drwxrwxr-x slurm    slurm    4.0K Jun 16 16:13]  plugstack.conf.d
│   ├── [-rw-r--r-- slurm    slurm    2.2K Jun 23 15:24]  slurm.conf
│   ├── [-rw-r--r-- slurm    slurm    3.0K Jun 16 15:48]  slurm.conf.example
│   ├── [-rw------- slurm    slurm     722 Jun 16 15:48]  slurmdbd.conf
│   └── [-rw-r--r-- slurm    slurm     745 Jun 16 15:48]  slurmdbd.conf.example
├── [drwxrwxr-x example-user example-user 4.0K Jun 29 12:46]  etc_ssh
│   ├── [-rw------- root     root     3.6K May  9 19:14]  sshd_config
│   ├── [drwx------ root     root     4.0K Jun 29 12:46]  sshd_config.d [error opening dir]
│   ├── [-rw------- root     root     1.4K Jun 29 11:17]  ssh_host_dsa_key
│   ├── [-rw-r--r-- root     root      600 Jun 29 11:17]  ssh_host_dsa_key.pub
│   ├── [-rw------- root     root      505 Jun 29 11:26]  ssh_host_ecdsa_key
│   ├── [-rw-r--r-- root     root      172 Jun 29 11:26]  ssh_host_ecdsa_key.pub
│   ├── [-rw------- root     root      399 Jun 29 11:26]  ssh_host_ed25519_key
│   ├── [-rw-r--r-- root     root       92 Jun 29 11:26]  ssh_host_ed25519_key.pub
│   ├── [-rw------- root     root     2.5K Jun 29 11:26]  ssh_host_rsa_key
│   └── [-rw-r--r-- root     root      564 Jun 29 11:26]  ssh_host_rsa_key.pub
├── [drwxrwxr-x example-user example-user 4.0K Jun 19 10:46]  ld.so.conf.d
├── [drwxrwxr-x example-user example-user 4.0K Jun 20 11:20]  scord_prefix
└── [drwxr-xr-x example-user example-user 4.0K Jul  7 08:27]  user_home

42 directories, 149 files
```

## Usage
Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README.

## Support
Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc.

## Roadmap
If you have ideas for releases in the future, it is a good idea to list them in the README.

## Contributing
State if you are open to contributions and what your requirements are for accepting them.

For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self.

You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser.

## Authors and acknowledgment
Show your appreciation to those who have contributed to the project.

## License
For open source projects, say how it is licensed.
1. Find out the UID and GID of the host user that will be shared with the
   cluster. This can be done by running `id` in the host machine.
2. Build the services, making sure sure to set the `SHARED_USER_NAME`,
   `SHARED_USER_UID`, `SHARED_GROUP_NAME`, and `SHARED_GROUP_GID` build arguments to the
   values obtained in Step 1:
   ```shell
   $ docker compose build \
       --build-arg SHARED_USER_NAME=example-user \
       --build-arg SHARED_USER_UID=1000 \
       --build-arg SHARED_GROUP_NAME=example-user \
       --build-arg=SHARED_GROUP_GID=1000
   ```
3. Run the cluster with `docker compose up -d`.
4. You can log into the cluster containers as root with
   `docker compose exec <container> bash`.
5. Alternatively, if ssh keys for the shared user have been configured in the `user_home` volume and the host's
   `/etc/hosts` file has been updated to include the cluster's IP addresses and hostnames, you can log into
   the cluster login or compute nodes as `$SHARED_USER_NAME` with ssh. For example, if the shared user is `example-user`:

      ```bash
      [example-user@host]$ ssh example-user@login
      ```
6. Jobs can be submitted to the cluster by ssh-ing into the `login` container and
   using the typical Slurm commands:
    ```bash
      [example-user@host]$ ssh example-user@login
      [example-user@login]$ sinfo
      PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
      normal*      up 5-00:00:00      4   idle c[1-4]
      [example-user@login]$ srun -N 4 hostname
      c2
      c3
      c1
      c4
    ```
## Project status
If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.