README.md 14.3 KiB
Newer Older
# GekkoFS
Alberto Miranda's avatar
Alberto Miranda committed

[![License: GPL3](https://img.shields.io/badge/License-GPL3-blue.svg)](https://opensource.org/licenses/GPL-3.0)
Alberto Miranda's avatar
Alberto Miranda committed
[![pipeline status](https://storage.bsc.es/gitlab/hpc/gekkofs/badges/master/pipeline.svg)](https://storage.bsc.es/gitlab/hpc/gekkofs/commits/master)
[![coverage report](https://storage.bsc.es/gitlab/hpc/gekkofs/badges/master/coverage.svg)](https://storage.bsc.es/gitlab/hpc/gekkofs/-/commits/master)
Alberto Miranda's avatar
Alberto Miranda committed

Marc Vef's avatar
Marc Vef committed
GekkoFS is a file system capable of aggregating the local I/O capacity and performance of each compute node
in a HPC cluster to produce a high-performance storage space that can be accessed in a distributed manner.
This storage space allows HPC applications and simulations to run in isolation from each other with regards
Marc Vef's avatar
Marc Vef committed
to I/O, which reduces interferences and improves performance.
Marc Vef's avatar
Marc Vef committed

# Dependencies

Marc Vef's avatar
Marc Vef committed
- \>gcc-8 (including g++) for C++11 support
- General build tools: Git, Curl, CMake >3.6 (>3.11 for GekkoFS testing), Autoconf, Automake
- Miscellaneous: Libtool, Libconfig 
Marc Vef's avatar
Marc Vef committed
### Debian/Ubuntu
GekkoFS base dependencies: `apt install git curl cmake autoconf automake libtool libconfig-dev`
Marc Vef's avatar
Marc Vef committed
GekkoFS testing support: `apt install python3-dev python3 python3-venv`
Marc Vef's avatar
Marc Vef committed
With testing
### CentOS/Red Hat
GekkoFS base dependencies: `yum install gcc-c++ git curl cmake autoconf automake libtool libconfig`
Marc Vef's avatar
Marc Vef committed
GekkoFS testing support: `python38-devel` (**>Python-3.6 required**)
Marc Vef's avatar
Marc Vef committed

Marc Vef's avatar
Marc Vef committed
# Step-by-step installation
Marc Vef's avatar
Marc Vef committed
1. Make sure the above listed dependencies are available on your machine
2. Clone GekkoFS: `git clone --recurse-submodules https://storage.bsc.es/gitlab/hpc/gekkofs.git`
   - (Optional) (Optional) If you checked out the sources using `git` without the `--recursive` option, you need to
     execute the following command from the root of the source directory: `git submodule update --init`
3. Set up the necessary environment variables where the compiled direct GekkoFS dependencies will be installed at (we
   assume the path `/home/foo/gekkofs_deps/install` in the following)
Marc Vef's avatar
Marc Vef committed
   - `export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/home/foo/gekkofs_deps/install/lib:/home/foo/gekkofs_deps/install/lib64`
4. Download and compile the direct dependencies, e.g.,
   - Download example: `gekkofs/scripts/dl_dep.sh /home/foo/gekkofs_deps/git`
   - Compilation example: `gekkofs/scripts/compile_dep.sh /home/foo/gekkofs_deps/git /home/foo/gekkofs_deps/install`
   - Consult `-h` for additional arguments for each script
5. Compile GekkoFS and run optional tests
   - Create build directory: `mkdir gekkofs/build && cd gekkofs/build`
   - Configure GekkoFS: `cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=/home/foo/gekkofs_deps/install ..`
       - add `-DCMAKE_INSTALL_PREFIX=<install_path>` where the GekkoFS client library and server executable should be available 
       - add `-DGKFS_BUILD_TESTS=ON` if tests should be build
   - Build and install GekkoFS: `make -j8 install`
   - Run tests: `make test`
Marc Vef's avatar
Marc Vef committed

Marc Vef's avatar
Marc Vef committed
GekkoFS is now available at:
- GekkoFS daemon (server): `<install_path>/bin/gkfs_daemon`
- GekkoFS client interception library: `<install_path>/lib64/libgkfs_intercept.so`
Marc Vef's avatar
Marc Vef committed
# Run GekkoFS
Marc Vef's avatar
Marc Vef committed
## General
Marc Vef's avatar
Marc Vef committed
On each node a daemon (`gkfs_daemon` binary) has to be started. Other tools can be used to execute
the binary on many nodes, e.g., `srun`, `mpiexec/mpirun`, `pdsh`, or `pssh`.

You need to decide what Mercury NA plugin you want to use for network communication. `ofi+sockets` is the default.
The `-P` argument is used for setting another RPC protocol. See below.

- `ofi+sockets` for using the libfabric plugin with TCP (stable)
- `ofi+tcp` for using the libfabric plugin with TCP (slower than sockets)
- `ofi+verbs` for using the libfabric plugin with Infiniband verbs (reasonably stable) and requires
  the [rdma-core (formerly libibverbs)](https://github.com/linux-rdma/rdma-core) library
- `ofi+psm2` for using the libfabric plugin with Intel Omni-Path (unstable) and requires
  the [opa-psm2](https://github.com/cornelisnetworks/opa-psm2>) library
Marc Vef's avatar
Marc Vef committed
## The GekkoFS hostsfile

Each GekkoFS daemon needs to register itself in a shared file (*hostsfile*) which needs to be accessible to _all_ GekkoFS clients and daemons.
Therefore, the hostsfile describes a file system and which node is part of that specific GekkoFS file system instance.
In a typical cluster environment this hostsfile should be placed within a POSIX-compliant parallel file system, such as GPFS or Lustre.

*Note: NFS is not strongly consistent and cannot be used for the hosts file!*

## GekkoFS daemon start and shut down

tl;dr example: `<install_path>/bin/gkfs_daemon -r <fs_data_path> -m <pseudo_gkfs_mount_dir_path> -H <hostsfile_path>`
Marc Vef's avatar
Marc Vef committed
Run the GekkoFS daemon on each node specifying its locally used directory where the file system data and metadata is stored (`-r/--rootdir <fs_data_path>`), e.g., the node-local SSD;
2. the pseudo mount directory used by clients to access GekkoFS (`-m/--mountdir <pseudo_gkfs_mount_dir_path>`); and
3. the hostsfile path (`-H/--hostsfile <hostfile_path>`).

Further options are available:
Marc Vef's avatar
Marc Vef committed

```bash
Marc Vef's avatar
Marc Vef committed
Allowed options
Usage: src/daemon/gkfs_daemon [OPTIONS]
Marc Vef's avatar
Marc Vef committed

Options:
  -h,--help                   Print this help message and exit
  -m,--mountdir TEXT REQUIRED Virtual mounting directory where GekkoFS is available.
  -r,--rootdir TEXT REQUIRED  Local data directory where GekkoFS data for this daemon is stored.
  -s,--rootdir-suffix TEXT    Creates an additional directory within the rootdir, allowing multiple daemons on one node.
Marc Vef's avatar
Marc Vef committed
  -i,--metadir TEXT           Metadata directory where GekkoFS RocksDB data directory is located. If not set, rootdir is used.
  -l,--listen TEXT            Address or interface to bind the daemon to. Default: local hostname.
                              When used with ofi+verbs the FI_VERBS_IFACE environment variable is set accordingly which associates the verbs device with the network interface. In case FI_VERBS_IFACE is already defined, the argument is ignored. Default 'ib'.
  -H,--hosts-file TEXT        Shared file used by deamons to register their endpoints. (default './gkfs_hosts.txt')
  -P,--rpc-protocol TEXT      Used RPC protocol for inter-node communication.
                              Available: {ofi+sockets, ofi+verbs, ofi+psm2} for TCP, Infiniband, and Omni-Path, respectively. (Default ofi+sockets)
                              Libfabric must have enabled support verbs or psm2.
  --auto-sm                   Enables intra-node communication (IPCs) via the `na+sm` (shared memory) protocol, instead of using the RPC protocol. (Default off)
  -c,--clean-rootdir          Cleans Rootdir >before< launching the deamon
  -f, --clean-rootdir-finish Cleans Rootdir >After< the deamon finishes
Marc Vef's avatar
Marc Vef committed
  --version                   Print version and exit.
Ramon Nou's avatar
Ramon Nou committed

  --dbbackend               'rocksdb' (default) or 'parallaxdb' can be specified as
                            metadata backend, in that case a file in 'metadir' named
                            rocksdbx is created. Parallaxdb support is experimental.
  --parallax_size               'parallaxdb' specific, size of the metadata file in GB. Minimal is 8 GB
                            (default 8, 8 GB)
Ramon Nou's avatar
Ramon Nou committed
  --version                 Print version and exit.
Marc Vef's avatar
Marc Vef committed
```
It is possible to run multiple independent GekkoFS instances on the same node. Note, that when these GekkoFS instances
are part of the same file system, use the same `rootdir` with different `rootdir-suffixe`s.

Marc Vef's avatar
Marc Vef committed
Shut it down by gracefully killing the process (SIGTERM).
Marc Vef's avatar
Marc Vef committed
## Use the GekkoFS client library

Marc Vef's avatar
Marc Vef committed
```bash
export LIBGKFS_ HOSTS_FILE=<hostfile_path>
LD_PRELOAD=<install_path>/lib64/libgkfs_intercept.so cp ~/some_input_data <pseudo_gkfs_mount_dir_path>/some_input_data
LD_PRELOAD=<install_path>/lib64/libgkfs_intercept.so md5sum ~/some_input_data <pseudo_gkfs_mount_dir_path>/some_input_data
```
Clients read the hostsfile to determine which daemons are part of the GekkoFS instance. Because the client is an
interposition library that is loaded within the context of the application, this information is passed via the
environment variable `LIBGKFS_HOSTS_FILE` pointing to the hostsfile path. The client library itself is loaded for each
application process via the `LD_PRELOAD` environment variable intercepting file system related calls. If they are
within (or hierarchically under) the GekkoFS mount directory they are processed in the library, otherwise they are
passed to the kernel.
Note, if `LD_PRELOAD` is not pointing to the library and, hence the client is not loaded, the mounting directory appears
to be empty.
Marc Vef's avatar
Marc Vef committed

For MPI application, the `LD_PRELOAD` variable can be passed with the `-x` argument for `mpirun/mpiexec`.
Alberto Miranda's avatar
Alberto Miranda committed
The following environment variables can be used to enable logging in the client
library: `LIBGKFS_LOG=<module>` and `LIBGKFS_LOG_OUTPUT=<path/to/file>` to
configure the output module and set the path to the log file of the client
library. If not path is specified in `LIBGKFS_LOG_OUTPUT`, the client library
will send log messages to `/tmp/gkfs_client.log`.
Alberto Miranda's avatar
Alberto Miranda committed

The following modules are available:

 - `none`: don't print any messages
 - `syscalls`: Trace system calls: print the name of each system call, its
   arguments, and its return value. All system calls are printed after being
   executed save for those that may not return, such as `execve()`,
   `execve_at()`, `exit()`, and `exit_group()`. This module will only be
   available if the client library is built in `Debug` mode.
 - `syscalls_at_entry`: Trace system calls: print the name of each system call
   and its arguments. All system calls are printed before being executed and
   therefore their return values are not available in the log. This module will
   only be available if the client library is built in `Debug` mode.
 - `info`: Print information messages.
 - `critical`: Print critical errors.
 - `errors`: Print errors.
 - `warnings`: Print warnings.
 - `mercury`: Print Mercury messages.
 - `debug`: Print debug messages.  This module will only be available if the
   client library is built in `Debug` mode.
 - `most`: All previous options combined except `syscalls_at_entry`. This
   module will only be available if the client library is built in `Debug`
   mode.
 - `all`: All previous options combined.
 - `trace_reads`: Generate log line with extra information in read operations for guided distributor
Alberto Miranda's avatar
Alberto Miranda committed
 - `help`: Print a help message and exit.

When tracing sytem calls, specific syscalls can be removed from log messages by
setting the `LIBGKFS_LOG_SYSCALL_FILTER` environment variable. For instance,
setting it to `LIBGKFS_LOG_SYSCALL_FILTER=epoll_wait,epoll_create` will filter
out any log entries from the `epoll_wait()` and `epoll_create()` system calls.

Alberto Miranda's avatar
Alberto Miranda committed
Additionally, setting the `LIBGKFS_LOG_OUTPUT_TRUNC` environment variable with
a value different from `0` will instruct the logging subsystem to truncate
Alberto Miranda's avatar
Alberto Miranda committed
the file used for logging, rather than append to it.

For the daemon, the `GKFS_DAEMON_LOG_PATH=<path/to/file>` environment variable
can be provided to set the path to the log file, and the log module can be
Alberto Miranda's avatar
Alberto Miranda committed
selected with the `GKFS_LOG_LEVEL={off,critical,err,warn,info,debug,trace}`
environment variable.
Tommaso Tocci's avatar
Tommaso Tocci committed

Marc Vef's avatar
Marc Vef committed
# Miscellaneous
Marc Vef's avatar
Marc Vef committed
## External functions

GekkoFS allows to use external functions on your client code, via LD_PRELOAD. 
Source code needs to be compiled with -fPIC. We include a pfind io500 substitution,
 `examples/gfind/gfind.cpp` and a non-mpi version `examples/gfind/sfind.cpp`
Tommaso Tocci's avatar
Tommaso Tocci committed

Marc Vef's avatar
Marc Vef committed
## Data distributors
The data distribution can be selected at compilation time, we have 2 distributors available:

Marc Vef's avatar
Marc Vef committed
### Simple Hash (Default)
Chunks are distributed randomly to the different GekkoFS servers.

Marc Vef's avatar
Marc Vef committed
### Guided Distributor
Marc Vef's avatar
Marc Vef committed
The guided distributor allows defining a specific distribution of data on a per directory or file basis. 
The distribution configurations are defined within a shared file (called `guided_config.txt` henceforth) with the following format:
`<path> <chunk_number> <host>`
To enable the distributor, the following CMake compilation flags are required:
Marc Vef's avatar
Marc Vef committed
* `GKFS_USE_GUIDED_DISTRIBUTION` ON
* `GKFS_USE_GUIDED_DISTRIBUTION_PATH` `<path_guided_config.txt>`
Marc Vef's avatar
Marc Vef committed
To use a custom distribution, a path needs to have the prefix `#` (e.g., `#/mdt-hard 0 0`), in which all the data of all files in that directory goes to the same place as the metadata.
Note, that a chunk/host configuration is inherited to all children files automatically even if not using the prefix. 
In this example, `/mdt-hard/file1` is therefore also using the same distribution as the `/mdt-hard` directory.
If no prefix is used, the Simple Hash distributor is used.
Marc Vef's avatar
Marc Vef committed
#### Guided configuration file
Marc Vef's avatar
Marc Vef committed
Creating a guided configuration file is based on an I/O trace file of a previous execution of the application.
For this the `trace_reads` tracing module is used (see above).
Marc Vef's avatar
Marc Vef committed
The `trace_reads` module enables a `TRACE_READS` level log at the clients writing the I/O information of the client which is used as the input for a script that creates the guided distributor setting.
Note that capturing the necessary trace records can involve performance degradation.
To capture the I/O of each client within a SLURM environment, i.e., enabling the `trace_reads` module and print its output to a user-defined path, the following example can be used:
`srun -N 10 -n 320 --export="ALL" /bin/bash -c "export LIBGKFS_LOG=trace_reads;LIBGKFS_LOG_OUTPUT=${HOME}/test/GLOBAL.txt;LD_PRELOAD=${GKFS_PRLD} <app>"`
Marc Vef's avatar
Marc Vef committed
Then, the `examples/distributors/guided/generate.py` scrpt is used to create the guided distributor configuration file:
* `python examples/distributors/guided/generate.py ~/test/GLOBAL.txt >> guided_config.txt`
Marc Vef's avatar
Marc Vef committed
Finally, modify `guided_config.txt` to your distribution requirements.
### Metadata Backends
There are two different metadata backends in GekkoFS. The default one uses `rocksdb`, however an alternative based on `PARALLAX` from `FORTH` 
is available.
To enable it use the `-DGKFS_ENABLE_PARALLAX:BOOL=ON` option, you can also disable `rocksdb` with `-DGKFS_ENABLE_ROCKSDB:BOOL=OFF`.

Once it is enabled, `--dbbackend` option will be functional.


Ramon Nou's avatar
Ramon Nou committed
### Acknowledgment
Tommaso Tocci's avatar
Tommaso Tocci committed

Marc Vef's avatar
Marc Vef committed
This software was partially supported by the EC H2020 funded NEXTGenIO project (Project ID: 671951, www.nextgenio.eu).

This software was partially supported by the ADA-FS project under the SPPEXA project (http://www.sppexa.de/) funded by the DFG.

This software is partially supported by the FIDIUM project funded by the DFG.
Tommaso Tocci's avatar
Tommaso Tocci committed

Ramon Nou's avatar
Ramon Nou committed
This software is partially supported by the ADMIRE project (https://www.admire-eurohpc.eu/) funded by the European Union’s Horizon 2020 JTI-EuroHPC Research and Innovation Programme (Grant 956748).