Update README.md (effa8110) · Commits · hpc / docker-cluster

README.md

+201 −86

Original line number	Diff line number	Diff line
		# docker-cluster



		## Getting started

		To make it easy for you to get started with GitLab, here's a list of recommended next steps.

		Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)!

		## Add your files

		- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files
		- [ ] [Add files using the command line](https://docs.gitlab.com/ee/gitlab-basics/add-file.html#add-a-file-using-the-command-line) or push an existing Git repository with the following command:
		# Slurm Docker Cluster

		This is a multi-container Slurm cluster using `docker-compose` with `sshd` and `systemd` enabled.
		The compose file creates named volumes for persistent storage of MySQL data files as well as
		Slurm state and log directories. It is heavily based on work by [giovtorres/docker-slurm-cluster](
		https://github.com/giovtorres/slurm-docker-cluster).

		## Containers, Networks, and Volumes

		The compose file will run the following containers:

		* `mysql`
		* `slurmdbd`
		* `slurmctld`
		* `login` (slurmd)
		* `c1`, `c2`, `c3`. `c4` (slurmd)

		The compose file will create the following named volumes:

		* `etc_munge` ( -> `/etc/munge` )
		* `slurm_jobdir` ( -> `/data` )
		* `var_lib_mysql` ( -> `/var/lib/mysql` )

		The compose file will create the `slurm_cluster` network for all containers and will assign the
		following IPv4 static addresses:

		* slurmctld: 192.18.0.129
		* c1: 192.18.0.10
		* c2: 192.18.0.11
		* c3: 192.18.0.12
		* c4: 192.18.0.13
		* login: 192.18.0.128


		## Package contents

		- `docker-compose.yml`: docker-compose file for running the cluster
		- `slurm-docker-cluster/Dockerfile`: dockerfile for building the main cluster services.
		- `slurm-docker-cluster-node/Dockerfile`: dockerfile with specific software for
		the compute nodes. Specific for scord. NEEDS TO BE BUILT BEFORE RUNNING THE
		CLUSTER.
		- `refresh.sh`: script for refreshing the scord installation in the cluster.
		This script uses the `slurm-docker-cluster-node` image to generate the
		binaries so that there are no compatibility issues with dependencies.

		The script relies on the following variables:
		- `REPO`: The repository where the `scord` source code is located.
		- `VOLUMES`: The host directory where the output of the build process will
		be placed.
		- `USER`: The container user that should be used to run the build
		process (so that ownership matches with the host and container user).
		The `scord` build process relies on a CMake Preset for Rocky Linux that
		has been configured to match the container environment:

		```json
		{
		"name": "rocky",
		"displayName": "Rocky Linux",
		"description": "Build options for Rocky Linux",
		"inherits": "base",
		"environment" : {
		"PKG_CONFIG_PATH": "/usr/lib/pkgconfig;/usr/lib64/pkgconfig"
		},
		"generator": "Unix Makefiles",
		"cacheVariables": {
		"CMAKE_CXX_COMPILER_LAUNCHER": "",
		"CMAKE_C_COMPILER_LAUNCHER": "",
		"CMAKE_CXX_FLAGS": "-fdiagnostics-color=always",
		"CMAKE_C_FLAGS": "-fdiagnostics-color=always",
		"CMAKE_PREFIX_PATH": "/usr/lib;/usr/lib64",
		"CMAKE_INSTALL_PREFIX": "/scord_prefix",
		"SCORD_BUILD_EXAMPLES": true,
		"SCORD_BUILD_TESTS": true,
		"SCORD_BIND_ADDRESS": "192.18.0.128"
		}
		}
		```

		- `volumes`: directory for the volumes used by the cluster:
		- `etc_munge`: munge configuration files. A shared `munge.key` needs to be
		generated and placed here.
		- `etc_slurm`: slurm configuration files. At least a `slurm.conf` file needs
		to be placed here. The `slurm.conf` file should be configured with the
		compute node and partition information. For example:
		```conf
		# COMPUTE NODES
		NodeName=c[1-4] RealMemory=1000 State=UNKNOWN

		# PARTITIONS
		PartitionName=normal Default=yes Nodes=c[1-4] Priority=50 DefMemPerCPU=500 Shared=NO MaxNodes=4 MaxTime=5-00:00:00 DefaultTime=5-00:00:00 State=UP
		```
		cd existing_repo
		git remote add origin https://storage.bsc.es/gitlab/hpc/docker-cluster.git
		git branch -M main
		git push -uf origin main
		- `etc_ssh`: ssh configuration files. Server keys and configuration files
		should be placed here.
		- `ld.so.conf.d`: ld.so configuration files.
		- `scord_prefix`: scord installation directory. The scord installation
		should
		be placed here. This should match with the directory outside the
		container where we are generating the binaries.
		- `user_home`: user home directory. Any files and directories that we want to
		have available in all compute nodes (e.g. `.ssh`), should be added here.
		- `docker-entrypoint.sh`: Overridden entry point

		## Build arguments

		The following build arguments are available:

		* `SLURM_TAG`: The Slurm Git tag to build. Defaults to `slurm-21-08-6-1`.
		* `GOSU_VERSION`: The gosu version to install. Defaults to `1.11`.
		* `SHARED_USER_NAME`: The name of the user that will be shared with the cluster. Defaults to `user`.
		* `SHARED_USER_UID`: The UID of the user that will be shared with the cluster. Defaults to `1000`.
		* `SHARED_GROUP_NAME`: The name of the group that will be shared with the cluster. Defaults to `user`.
		* `SHARED_GROUP_GID`: The GID of the group that will be shared with the cluster. Defaults to `1000`.


		## Configuration

		To run, the cluster services expect some files to be present in the host system. The simpler way to do this is to
		provide the files in the `volumes` directory with the correct ownership and permissions so that they can be
		mounted in the containers. The `volumes` directory should be placed in the same directory as the `docker-compose.yml`
		file. The `volumes` directory should have the following structure:

		```bash
		volumes/
		├── docker-entrypoint.sh -> /usr/local/bin/docker-entrypoint.sh
		├── etc_munge -> /etc/munge
		├── etc_slurm -> /etc/slurm
		├── etc_ssh -> /etc/ssh
		├── ld.so.conf.d -> /etc/ld.so.conf.d
		└── user_home -> /home/$SHARED_USER_NAME
		```

		## Integrate with your tools

		- [ ] [Set up project integrations](https://storage.bsc.es/gitlab/hpc/docker-cluster/-/settings/integrations)

		## Collaborate with your team

		- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/)
		- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html)
		- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically)
		- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/)
		- [ ] [Set auto-merge](https://docs.gitlab.com/ee/user/project/merge_requests/merge_when_pipeline_succeeds.html)

		## Test and Deploy

		Use the built-in continuous integration in GitLab.

		- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/index.html)
		- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing(SAST)](https://docs.gitlab.com/ee/user/application_security/sast/)
		- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html)
		- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/)
		- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html)

		***

		# Editing this README

		When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thank you to [makeareadme.com](https://www.makeareadme.com/) for this template.

		## Suggestions for a good README
		Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information.

		## Name
		Choose a self-explaining name for your project.

		## Description
		Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors.

		## Badges
		On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge.

		## Visuals
		Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method.

		## Installation
		Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection.
		The following ownership and permissions should be set for the cluster to work properly.
		The `slurm` and `munge` users are not actually required to exist in the host system
		as they are created automatically whie building the images, though it helps to actually
		create them rather than having a weird number pop up each time `ls` is called.
		Note however, that if created in the host, the `slurm` and `munge` users/groups
		need to have the same UIDs/GIDs in the host and container systems.

		```bash
		volumes
		├── [-rwxrwxr-x example-user example-user 1.9K Jun 29 16:30] docker-entrypoint.sh
		├── [drwxrwxr-x munge munge 4.0K Jun 17 09:11] etc_munge
		│ └── [-r-------- munge munge 1.0K Jun 17 09:11] munge.key
		├── [drwxrwxr-x slurm slurm 4.0K Jul 4 09:49] etc_slurm
		│ ├── [-rw-r--r-- slurm slurm 216 Jun 16 15:48] cgroup.conf.example
		│ ├── [-rw-r--r-- slurm slurm 213 Jun 30 14:28] plugstack.conf
		│ ├── [drwxrwxr-x slurm slurm 4.0K Jun 16 16:13] plugstack.conf.d
		│ ├── [-rw-r--r-- slurm slurm 2.2K Jun 23 15:24] slurm.conf
		│ ├── [-rw-r--r-- slurm slurm 3.0K Jun 16 15:48] slurm.conf.example
		│ ├── [-rw------- slurm slurm 722 Jun 16 15:48] slurmdbd.conf
		│ └── [-rw-r--r-- slurm slurm 745 Jun 16 15:48] slurmdbd.conf.example
		├── [drwxrwxr-x example-user example-user 4.0K Jun 29 12:46] etc_ssh
		│ ├── [-rw------- root root 3.6K May 9 19:14] sshd_config
		│ ├── [drwx------ root root 4.0K Jun 29 12:46] sshd_config.d [error opening dir]
		│ ├── [-rw------- root root 1.4K Jun 29 11:17] ssh_host_dsa_key
		│ ├── [-rw-r--r-- root root 600 Jun 29 11:17] ssh_host_dsa_key.pub
		│ ├── [-rw------- root root 505 Jun 29 11:26] ssh_host_ecdsa_key
		│ ├── [-rw-r--r-- root root 172 Jun 29 11:26] ssh_host_ecdsa_key.pub
		│ ├── [-rw------- root root 399 Jun 29 11:26] ssh_host_ed25519_key
		│ ├── [-rw-r--r-- root root 92 Jun 29 11:26] ssh_host_ed25519_key.pub
		│ ├── [-rw------- root root 2.5K Jun 29 11:26] ssh_host_rsa_key
		│ └── [-rw-r--r-- root root 564 Jun 29 11:26] ssh_host_rsa_key.pub
		├── [drwxrwxr-x example-user example-user 4.0K Jun 19 10:46] ld.so.conf.d
		├── [drwxrwxr-x example-user example-user 4.0K Jun 20 11:20] scord_prefix
		└── [drwxr-xr-x example-user example-user 4.0K Jul 7 08:27] user_home

		42 directories, 149 files
		```

		## Usage
		Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README.

		## Support
		Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc.

		## Roadmap
		If you have ideas for releases in the future, it is a good idea to list them in the README.

		## Contributing
		State if you are open to contributions and what your requirements are for accepting them.

		For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self.

		You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser.

		## Authors and acknowledgment
		Show your appreciation to those who have contributed to the project.

		## License
		For open source projects, say how it is licensed.
		1. Find out the UID and GID of the host user that will be shared with the
		cluster. This can be done by running `id` in the host machine.
		2. Build the services, making sure sure to set the `SHARED_USER_NAME`,
		`SHARED_USER_UID`, `SHARED_GROUP_NAME`, and `SHARED_GROUP_GID` build arguments to the
		values obtained in Step 1:
		```shell
		$ docker compose build \
		--build-arg SHARED_USER_NAME=example-user \
		--build-arg SHARED_USER_UID=1000 \
		--build-arg SHARED_GROUP_NAME=example-user \
		--build-arg=SHARED_GROUP_GID=1000
		```
		3. Run the cluster with `docker compose up -d`.
		4. You can log into the cluster containers as root with
		`docker compose exec <container> bash`.
		5. Alternatively, if ssh keys for the shared user have been configured in the `user_home` volume and the host's
		`/etc/hosts` file has been updated to include the cluster's IP addresses and hostnames, you can log into
		the cluster login or compute nodes as `$SHARED_USER_NAME` with ssh. For example, if the shared user is `example-user`:

		```bash
		[example-user@host]$ ssh example-user@login
		```
		6. Jobs can be submitted to the cluster by ssh-ing into the `login` container and
		using the typical Slurm commands:
		```bash
		[example-user@host]$ ssh example-user@login
		[example-user@login]$ sinfo
		PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
		normal* up 5-00:00:00 4 idle c[1-4]
		[example-user@login]$ srun -N 4 hostname
		c2
		c3
		c1
		c4
		```

		## Project status
		If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.