Loading CHANGELOG.md +6 −0 Original line number Diff line number Diff line Loading @@ -8,6 +8,12 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [Unreleased] ### New - Added file system expansion support ([!196](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/196)). - Added the tool `gkfs_malleability` to steer start, status, and finalize requests for expansion operations. - `-DGKFS_BUILD_TOOLS=ON` must be set for CMake to build the tool. - Overhauled the `gkfs` run script to accommodate the new tool. - During expansion, redistribution of data is performed by the daemons. Therefore, an RPC client for daemons was added. - See Readme for usage details. - Propagate PKG_CONFIG_PATH to dependency scripts ([!185](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/185)). - Added syscall support for listxattr family ([!186](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_request/186)). - Remove optimization, removing one RPC per operation ([!195](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_request/195)). Loading README.md +63 −9 Original line number Diff line number Diff line Loading @@ -159,7 +159,7 @@ to be empty. For MPI application, the `LD_PRELOAD` variable can be passed with the `-x` argument for `mpirun/mpiexec`. ## Run GekkoFS daemons on multiple nodes (beta version!) ## Run GekkoFS daemons on multiple nodes The `scripts/run/gkfs` script can be used to simplify starting the GekkoFS daemon on one or multiple nodes. To start GekkoFS on multiple nodes, a Slurm environment that can execute `srun` is required. Users can further Loading @@ -168,9 +168,9 @@ modify `scripts/run/gkfs.conf` to mold default configurations to their environme The following options are available for `scripts/run/gkfs`: ```bash usage: gkfs [-h/--help] [-r/--rootdir <path>] [-m/--mountdir <path>] [-a/--args <daemon_args>] [-f/--foreground <false>] [--srun <false>] [-n/--numnodes <jobsize>] [--cpuspertask <64>] [--numactl <false>] [-v/--verbose <false>] {start,stop} usage: gkfs [-h/--help] [-r/--rootdir <path>] [-m/--mountdir <path>] [-a/--args <daemon_args>] [--proxy <false>] [-f/--foreground <false>] [--srun <false>] [-n/--numnodes <jobsize>] [--cpuspertask <64>] [-v/--verbose <false>] {start,expand,stop} This script simplifies the starting and stopping GekkoFS daemons. If daemons are started on multiple nodes, Loading @@ -178,21 +178,23 @@ usage: gkfs [-h/--help] [-r/--rootdir <path>] [-m/--mountdir <path>] [-a/--args additional permanent configurations can be set. positional arguments: command Command to execute: 'start' and 'stop' COMMAND Command to execute: 'start', 'stop', 'expand' optional arguments: -h, --help Shows this help message and exits -r, --rootdir <path> Providing the rootdir path for GekkoFS daemons. -m, --mountdir <path> Providing the mountdir path for GekkoFS daemons. -a, --args <daemon_arguments> -r, --rootdir <path> The rootdir path for GekkoFS daemons. -m, --mountdir <path> The mountdir path for GekkoFS daemons. -d, --daemon_args <daemon_arguments> --proxy Start proxy after the daemons are running. Add various additional daemon arguments, e.g., "-l ib0 -P ofi+psm2". -p, --proxy_args <proxy_arguments> -f, --foreground Starts the script in the foreground. Daemons are stopped by pressing 'q'. --srun Use srun to start daemons on multiple nodes. -n, --numnodes <n> GekkoFS daemons are started on n nodes. Nodelist is extracted from Slurm via the SLURM_JOB_ID env variable. --cpuspertask <#cores> Set the number of cores the daemons can use. Must use '--srun'. --numactl Use numactl for the daemon. Modify gkfs.conf for further numactl configurations. -c, --config Path to configuration file. By defaults looks for a 'gkfs.conf' in this directory. -e, --expand_hostfile Path to the hostfile with new nodes where GekkoFS should be extended to (hostfile contains one line per node). -v, --verbose Increase verbosity ``` Loading Loading @@ -415,6 +417,58 @@ Press 'q' to exit Please consult `include/config.hpp` for additional configuration options. Note, GekkoFS proxy does not support replication. ### File system expansion GekkoFS supports extending the current daemon configuration to additional compute nodes. This includes redistribution of the existing data and metadata and therefore scales file system performance and capacity of existing data. Note, that it is the user's responsibility to not access the GekkoFS file system during redistribution. A corresponding feature that is transparent to the user is planned. Note also, if the GekkoFS proxy is used, they need to be manually restarted, after expansion. To enable this feature, the following CMake compilation flags are required to build the `gkfs_malleability` tool: `-DGKFS_BUILD_TOOLS=ON`. The `gkfs_malleability` tool is then available in the `build/tools` directory. Please consult `-h` for its arguments. While the tool can be used manually to expand the file system, the `scripts/run/gkfs` script should be used instead which invokes the `gkfs_malleability` tool. The only requirement for extending the file system is a hostfile containing the hostnames/IPs of the new nodes (one line per host). Example starting the file system. The `DAEMON_NODELIST` in the `gkfs.conf` is set to a hostfile containing the initial set of file system nodes.: ```bash ~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf start * [gkfs] Starting GekkoFS daemons (4 nodes) ... * [gkfs] GekkoFS daemons running * [gkfs] Startup time: 10.853 seconds ``` ... Some computation ... Expanding the file system. Using `-e <hostfile>` to specify the new nodes. Redistribution is done automatically with a progress bar. When finished, the file system is ready to use in the new configuration: ```bash ~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf -e ~/hostfile_expand expand * [gkfs] Starting GekkoFS daemons (8 nodes) ... * [gkfs] GekkoFS daemons running * [gkfs] Startup time: 1.058 seconds Expansion process from 4 nodes to 12 nodes launched... * [gkfs] Expansion progress: [####################] 0/4 left * [gkfs] Redistribution process done. Finalizing ... * [gkfs] Expansion done. ``` Stop the file system: ```bash ~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf stop * [gkfs] Stopping daemon with pid 16462 srun: sending Ctrl-C to StepId=282378.1 * [gkfs] Stopping daemon with pid 16761 srun: sending Ctrl-C to StepId=282378.2 * [gkfs] Shutdown time: 1.032 seconds ``` ## Acknowledgment This software was partially supported by the EC H2020 funded NEXTGenIO project (Project ID: 671951, www.nextgenio.eu). Loading scripts/run/gkfs +0 −2 Original line number Diff line number Diff line Loading @@ -449,8 +449,6 @@ add_daemons() { NODE_CNT_EXPAND=$((${node_cnt_initial}+$(cat ${EXPAND_NODELIST} | wc -l))) # start new set of daemons start_daemons # TODO REMOVE # sed -i '0,/evie/! s/evie/evie2/' ${HOSTSFILE} export LIBGKFS_HOSTS_FILE=${HOSTSFILE} # start expansion which redistributes metadata and data ${GKFS_MALLEABILITY_BIN_} expand start Loading scripts/run/gkfs.conf +4 −1 Original line number Diff line number Diff line Loading @@ -6,7 +6,10 @@ DAEMON_BIN=../../build/src/daemon/gkfs_daemon PROXY_BIN=../../build/src/proxy/gkfs_proxy # client configuration (needs to be set for all clients) LIBGKFS_HOSTS_FILE=/home/evie/workdir/gkfs_hosts.txt LIBGKFS_HOSTS_FILE=/home/XXX/workdir/gkfs_hosts.txt # tools (if build) GKFS_MALLEABILITY_BIN=../../build/tools/gkfs_malleability ## daemon configuration #DAEMON_ROOTDIR=/dev/shm/vef_gkfs_rootdir Loading src/daemon/malleability/malleable_manager.cpp +7 −3 Original line number Diff line number Diff line Loading @@ -51,6 +51,8 @@ namespace fs = std::filesystem; namespace gkfs::malleable { // TODO The following three functions are almost identical to the proxy code // They should be moved to a common and shared between the proxy and the daemon vector<pair<string, string>> MalleableManager::load_hostfile(const std::string& path) { Loading Loading @@ -198,7 +200,7 @@ int MalleableManager::redistribute_metadata() { uint64_t count = 0; auto estimate_db_size = GKFS_DATA->mdb()->db_size(); auto percent_interval = estimate_db_size / 1000; auto percent_interval = estimate_db_size / 100; GKFS_DATA->spdlogger()->info( "{}() Starting metadata redistribution for '{}' estimated number of KV pairs...", __func__, estimate_db_size); Loading @@ -206,6 +208,7 @@ MalleableManager::redistribute_metadata() { string key, value; auto iter = static_cast<rocksdb::Iterator*>(GKFS_DATA->mdb()->iterate_all()); // TODO parallelize for(iter->SeekToFirst(); iter->Valid(); iter->Next()) { key = iter->key().ToString(); value = iter->value().ToString(); Loading @@ -213,11 +216,11 @@ MalleableManager::redistribute_metadata() { continue; } auto dest_id = RPC_DATA->distributor()->locate_file_metadata(key, 0); GKFS_DATA->spdlogger()->info( GKFS_DATA->spdlogger()->trace( "{}() Migration: key {} and value {}. From host {} to host {}", __func__, key, value, RPC_DATA->local_host_id(), dest_id); if(dest_id == RPC_DATA->local_host_id()) { GKFS_DATA->spdlogger()->info("{}() SKIPPERS", __func__); GKFS_DATA->spdlogger()->trace("{}() SKIP", __func__); continue; } auto err = gkfs::malleable::rpc::forward_metadata(key, value, dest_id); Loading Loading @@ -248,6 +251,7 @@ MalleableManager::redistribute_data() { auto chunk_dir = fs::path(GKFS_DATA->storage()->get_chunk_directory()); auto dir_iterator = GKFS_DATA->storage()->get_all_chunk_files(); // TODO this can be parallelized, e.g., async chunk I/O for(const auto& entry : dir_iterator) { if(!entry.is_regular_file()) { continue; Loading Loading
CHANGELOG.md +6 −0 Original line number Diff line number Diff line Loading @@ -8,6 +8,12 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [Unreleased] ### New - Added file system expansion support ([!196](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/196)). - Added the tool `gkfs_malleability` to steer start, status, and finalize requests for expansion operations. - `-DGKFS_BUILD_TOOLS=ON` must be set for CMake to build the tool. - Overhauled the `gkfs` run script to accommodate the new tool. - During expansion, redistribution of data is performed by the daemons. Therefore, an RPC client for daemons was added. - See Readme for usage details. - Propagate PKG_CONFIG_PATH to dependency scripts ([!185](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/185)). - Added syscall support for listxattr family ([!186](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_request/186)). - Remove optimization, removing one RPC per operation ([!195](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_request/195)). Loading
README.md +63 −9 Original line number Diff line number Diff line Loading @@ -159,7 +159,7 @@ to be empty. For MPI application, the `LD_PRELOAD` variable can be passed with the `-x` argument for `mpirun/mpiexec`. ## Run GekkoFS daemons on multiple nodes (beta version!) ## Run GekkoFS daemons on multiple nodes The `scripts/run/gkfs` script can be used to simplify starting the GekkoFS daemon on one or multiple nodes. To start GekkoFS on multiple nodes, a Slurm environment that can execute `srun` is required. Users can further Loading @@ -168,9 +168,9 @@ modify `scripts/run/gkfs.conf` to mold default configurations to their environme The following options are available for `scripts/run/gkfs`: ```bash usage: gkfs [-h/--help] [-r/--rootdir <path>] [-m/--mountdir <path>] [-a/--args <daemon_args>] [-f/--foreground <false>] [--srun <false>] [-n/--numnodes <jobsize>] [--cpuspertask <64>] [--numactl <false>] [-v/--verbose <false>] {start,stop} usage: gkfs [-h/--help] [-r/--rootdir <path>] [-m/--mountdir <path>] [-a/--args <daemon_args>] [--proxy <false>] [-f/--foreground <false>] [--srun <false>] [-n/--numnodes <jobsize>] [--cpuspertask <64>] [-v/--verbose <false>] {start,expand,stop} This script simplifies the starting and stopping GekkoFS daemons. If daemons are started on multiple nodes, Loading @@ -178,21 +178,23 @@ usage: gkfs [-h/--help] [-r/--rootdir <path>] [-m/--mountdir <path>] [-a/--args additional permanent configurations can be set. positional arguments: command Command to execute: 'start' and 'stop' COMMAND Command to execute: 'start', 'stop', 'expand' optional arguments: -h, --help Shows this help message and exits -r, --rootdir <path> Providing the rootdir path for GekkoFS daemons. -m, --mountdir <path> Providing the mountdir path for GekkoFS daemons. -a, --args <daemon_arguments> -r, --rootdir <path> The rootdir path for GekkoFS daemons. -m, --mountdir <path> The mountdir path for GekkoFS daemons. -d, --daemon_args <daemon_arguments> --proxy Start proxy after the daemons are running. Add various additional daemon arguments, e.g., "-l ib0 -P ofi+psm2". -p, --proxy_args <proxy_arguments> -f, --foreground Starts the script in the foreground. Daemons are stopped by pressing 'q'. --srun Use srun to start daemons on multiple nodes. -n, --numnodes <n> GekkoFS daemons are started on n nodes. Nodelist is extracted from Slurm via the SLURM_JOB_ID env variable. --cpuspertask <#cores> Set the number of cores the daemons can use. Must use '--srun'. --numactl Use numactl for the daemon. Modify gkfs.conf for further numactl configurations. -c, --config Path to configuration file. By defaults looks for a 'gkfs.conf' in this directory. -e, --expand_hostfile Path to the hostfile with new nodes where GekkoFS should be extended to (hostfile contains one line per node). -v, --verbose Increase verbosity ``` Loading Loading @@ -415,6 +417,58 @@ Press 'q' to exit Please consult `include/config.hpp` for additional configuration options. Note, GekkoFS proxy does not support replication. ### File system expansion GekkoFS supports extending the current daemon configuration to additional compute nodes. This includes redistribution of the existing data and metadata and therefore scales file system performance and capacity of existing data. Note, that it is the user's responsibility to not access the GekkoFS file system during redistribution. A corresponding feature that is transparent to the user is planned. Note also, if the GekkoFS proxy is used, they need to be manually restarted, after expansion. To enable this feature, the following CMake compilation flags are required to build the `gkfs_malleability` tool: `-DGKFS_BUILD_TOOLS=ON`. The `gkfs_malleability` tool is then available in the `build/tools` directory. Please consult `-h` for its arguments. While the tool can be used manually to expand the file system, the `scripts/run/gkfs` script should be used instead which invokes the `gkfs_malleability` tool. The only requirement for extending the file system is a hostfile containing the hostnames/IPs of the new nodes (one line per host). Example starting the file system. The `DAEMON_NODELIST` in the `gkfs.conf` is set to a hostfile containing the initial set of file system nodes.: ```bash ~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf start * [gkfs] Starting GekkoFS daemons (4 nodes) ... * [gkfs] GekkoFS daemons running * [gkfs] Startup time: 10.853 seconds ``` ... Some computation ... Expanding the file system. Using `-e <hostfile>` to specify the new nodes. Redistribution is done automatically with a progress bar. When finished, the file system is ready to use in the new configuration: ```bash ~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf -e ~/hostfile_expand expand * [gkfs] Starting GekkoFS daemons (8 nodes) ... * [gkfs] GekkoFS daemons running * [gkfs] Startup time: 1.058 seconds Expansion process from 4 nodes to 12 nodes launched... * [gkfs] Expansion progress: [####################] 0/4 left * [gkfs] Redistribution process done. Finalizing ... * [gkfs] Expansion done. ``` Stop the file system: ```bash ~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf stop * [gkfs] Stopping daemon with pid 16462 srun: sending Ctrl-C to StepId=282378.1 * [gkfs] Stopping daemon with pid 16761 srun: sending Ctrl-C to StepId=282378.2 * [gkfs] Shutdown time: 1.032 seconds ``` ## Acknowledgment This software was partially supported by the EC H2020 funded NEXTGenIO project (Project ID: 671951, www.nextgenio.eu). Loading
scripts/run/gkfs +0 −2 Original line number Diff line number Diff line Loading @@ -449,8 +449,6 @@ add_daemons() { NODE_CNT_EXPAND=$((${node_cnt_initial}+$(cat ${EXPAND_NODELIST} | wc -l))) # start new set of daemons start_daemons # TODO REMOVE # sed -i '0,/evie/! s/evie/evie2/' ${HOSTSFILE} export LIBGKFS_HOSTS_FILE=${HOSTSFILE} # start expansion which redistributes metadata and data ${GKFS_MALLEABILITY_BIN_} expand start Loading
scripts/run/gkfs.conf +4 −1 Original line number Diff line number Diff line Loading @@ -6,7 +6,10 @@ DAEMON_BIN=../../build/src/daemon/gkfs_daemon PROXY_BIN=../../build/src/proxy/gkfs_proxy # client configuration (needs to be set for all clients) LIBGKFS_HOSTS_FILE=/home/evie/workdir/gkfs_hosts.txt LIBGKFS_HOSTS_FILE=/home/XXX/workdir/gkfs_hosts.txt # tools (if build) GKFS_MALLEABILITY_BIN=../../build/tools/gkfs_malleability ## daemon configuration #DAEMON_ROOTDIR=/dev/shm/vef_gkfs_rootdir Loading
src/daemon/malleability/malleable_manager.cpp +7 −3 Original line number Diff line number Diff line Loading @@ -51,6 +51,8 @@ namespace fs = std::filesystem; namespace gkfs::malleable { // TODO The following three functions are almost identical to the proxy code // They should be moved to a common and shared between the proxy and the daemon vector<pair<string, string>> MalleableManager::load_hostfile(const std::string& path) { Loading Loading @@ -198,7 +200,7 @@ int MalleableManager::redistribute_metadata() { uint64_t count = 0; auto estimate_db_size = GKFS_DATA->mdb()->db_size(); auto percent_interval = estimate_db_size / 1000; auto percent_interval = estimate_db_size / 100; GKFS_DATA->spdlogger()->info( "{}() Starting metadata redistribution for '{}' estimated number of KV pairs...", __func__, estimate_db_size); Loading @@ -206,6 +208,7 @@ MalleableManager::redistribute_metadata() { string key, value; auto iter = static_cast<rocksdb::Iterator*>(GKFS_DATA->mdb()->iterate_all()); // TODO parallelize for(iter->SeekToFirst(); iter->Valid(); iter->Next()) { key = iter->key().ToString(); value = iter->value().ToString(); Loading @@ -213,11 +216,11 @@ MalleableManager::redistribute_metadata() { continue; } auto dest_id = RPC_DATA->distributor()->locate_file_metadata(key, 0); GKFS_DATA->spdlogger()->info( GKFS_DATA->spdlogger()->trace( "{}() Migration: key {} and value {}. From host {} to host {}", __func__, key, value, RPC_DATA->local_host_id(), dest_id); if(dest_id == RPC_DATA->local_host_id()) { GKFS_DATA->spdlogger()->info("{}() SKIPPERS", __func__); GKFS_DATA->spdlogger()->trace("{}() SKIP", __func__); continue; } auto err = gkfs::malleable::rpc::forward_metadata(key, value, dest_id); Loading Loading @@ -248,6 +251,7 @@ MalleableManager::redistribute_data() { auto chunk_dir = fs::path(GKFS_DATA->storage()->get_chunk_directory()); auto dir_iterator = GKFS_DATA->storage()->get_all_chunk_files(); // TODO this can be parallelized, e.g., async chunk I/O for(const auto& entry : dir_iterator) { if(!entry.is_regular_file()) { continue; Loading