Commit cddedd6f authored by Marc Vef's avatar Marc Vef
Browse files

Merge branch 'marc/294-file-system-expansion-during-runtime' into 'master'

Resolve "File system expansion during runtime"

# Description

GekkoFS supports extending the current daemon configuration to additional compute nodes. This includes redistribution
of the existing data and metadata and therefore scales file system performance and capacity of existing data. Note,
that it is the user's responsibility to not access the GekkoFS file system during redistribution. A corresponding feature
that is transparent to the user is planned. Note also, if the GekkoFS proxy is used, they need to be manually restarted, after expansion.

To enable this feature, the following CMake compilation flags are required to build the `gkfs_malleability` tool: `-DGKFS_BUILD_TOOLS=ON`.
The `gkfs_malleability` tool is then available in the `build/tools` directory. Please consult `-h` for its arguments.
While the tool can be used manually to expand the file system, the `scripts/run/gkfs` script should be used instead which invokes the `gkfs_malleability` tool.

The only requirement for extending the file system is a hostfile containing the hostnames/IPs of the new nodes (one line per host).
Example starting the file system. The `DAEMON_NODELIST` in the `gkfs.conf` is set to a hostfile containing the initial set of file system nodes.:
```bash
~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf start
* [gkfs] Starting GekkoFS daemons (4 nodes) ...
* [gkfs] GekkoFS daemons running
* [gkfs] Startup time: 10.853 seconds
```
... Some computation ...

Expanding the file system. Using `-e <hostfile>` to specify the new nodes. Redistribution is done automatically with a progress bar. 
When finished, the file system is ready to use in the new configuration:
```bash
~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf -e ~/hostfile_expand expand
* [gkfs] Starting GekkoFS daemons (8 nodes) ...
* [gkfs] GekkoFS daemons running
* [gkfs] Startup time: 1.058 seconds
Expansion process from 4 nodes to 12 nodes launched...
* [gkfs] Expansion progress:
[####################] 0/4 left
* [gkfs] Redistribution process done. Finalizing ...
* [gkfs] Expansion done.
```
Stop the file system:
```bash
~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf stop
* [gkfs] Stopping daemon with pid 16462
srun: sending Ctrl-C to StepId=282378.1
* [gkfs] Stopping daemon with pid 16761
srun: sending Ctrl-C to StepId=282378.2
* [gkfs] Shutdown time: 1.032 seconds
```

# Results
IOR results for writing/reading 768 GiB sequentially (192 procs) before and after expansion

![image](/uploads/57bd8f3a07a56c496b1ae0b096da24ef/image.png)

MDTest results for creating, stating, removing, 19200000 (192 procs) before and after expansion

![image](/uploads/7e2f58d864789e657140ced3e9e9716e/image.png)

Closes #294

Closes #294

See merge request !196
parents 318e5c76 49263be8
Loading
Loading
Loading
Loading
Loading
+6 −0
Original line number Diff line number Diff line
@@ -8,6 +8,12 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### New

- Added file system expansion support ([!196](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/196)).
  - Added the tool `gkfs_malleability` to steer start, status, and finalize requests for expansion operations.
  - `-DGKFS_BUILD_TOOLS=ON` must be set for CMake to build the tool.
  - Overhauled the `gkfs` run script to accommodate the new tool.
  - During expansion, redistribution of data is performed by the daemons. Therefore, an RPC client for daemons was added.
  - See Readme for usage details.
- Propagate PKG_CONFIG_PATH to dependency scripts ([!185](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/185)).
- Added syscall support for listxattr family ([!186](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_request/186)).
- Remove optimization, removing one RPC per operation ([!195](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_request/195)).
+63 −9
Original line number Diff line number Diff line
@@ -159,7 +159,7 @@ to be empty.

For MPI application, the `LD_PRELOAD` variable can be passed with the `-x` argument for `mpirun/mpiexec`.

## Run GekkoFS daemons on multiple nodes (beta version!)
## Run GekkoFS daemons on multiple nodes

The `scripts/run/gkfs` script can be used to simplify starting the GekkoFS daemon on one or multiple nodes. To start
GekkoFS on multiple nodes, a Slurm environment that can execute `srun` is required. Users can further
@@ -168,9 +168,9 @@ modify `scripts/run/gkfs.conf` to mold default configurations to their environme
The following options are available for `scripts/run/gkfs`:

```bash
usage: gkfs [-h/--help] [-r/--rootdir <path>] [-m/--mountdir <path>] [-a/--args <daemon_args>] [-f/--foreground <false>]
        [--srun <false>] [-n/--numnodes <jobsize>] [--cpuspertask <64>] [--numactl <false>] [-v/--verbose <false>]
        {start,stop}
usage: gkfs [-h/--help] [-r/--rootdir <path>] [-m/--mountdir <path>] [-a/--args <daemon_args>] [--proxy <false>] [-f/--foreground <false>]
        [--srun <false>] [-n/--numnodes <jobsize>] [--cpuspertask <64>] [-v/--verbose <false>]
        {start,expand,stop}


    This script simplifies the starting and stopping GekkoFS daemons. If daemons are started on multiple nodes,
@@ -178,21 +178,23 @@ usage: gkfs [-h/--help] [-r/--rootdir <path>] [-m/--mountdir <path>] [-a/--args
    additional permanent configurations can be set.

    positional arguments:
            command                 Command to execute: 'start' and 'stop'
            COMMAND                 Command to execute: 'start', 'stop', 'expand'

    optional arguments:
            -h, --help              Shows this help message and exits
            -r, --rootdir <path>    Providing the rootdir path for GekkoFS daemons.
            -m, --mountdir <path>   Providing the mountdir path for GekkoFS daemons.
            -a, --args <daemon_arguments>
            -r, --rootdir <path>    The rootdir path for GekkoFS daemons.
            -m, --mountdir <path>   The mountdir path for GekkoFS daemons.
            -d, --daemon_args <daemon_arguments>
            --proxy                 Start proxy after the daemons are running.
                                    Add various additional daemon arguments, e.g., "-l ib0 -P ofi+psm2".
            -p, --proxy_args <proxy_arguments>
            -f, --foreground        Starts the script in the foreground. Daemons are stopped by pressing 'q'.
            --srun                  Use srun to start daemons on multiple nodes.
            -n, --numnodes <n>      GekkoFS daemons are started on n nodes.
                                    Nodelist is extracted from Slurm via the SLURM_JOB_ID env variable.
            --cpuspertask <#cores>  Set the number of cores the daemons can use. Must use '--srun'.
            --numactl               Use numactl for the daemon. Modify gkfs.conf for further numactl configurations.
            -c, --config            Path to configuration file. By defaults looks for a 'gkfs.conf' in this directory.
            -e, --expand_hostfile   Path to the hostfile with new nodes where GekkoFS should be extended to (hostfile contains one line per node).
            -v, --verbose           Increase verbosity
```

@@ -415,6 +417,58 @@ Press 'q' to exit
Please consult `include/config.hpp` for additional configuration options. Note, GekkoFS proxy does not support
replication.

### File system expansion

GekkoFS supports extending the current daemon configuration to additional compute nodes. This includes redistribution of
the existing data and metadata and therefore scales file system performance and capacity of existing data. Note,
that it is the user's responsibility to not access the GekkoFS file system during redistribution. A corresponding
feature that is transparent to the user is planned. Note also, if the GekkoFS proxy is used, they need to be manually
restarted, after expansion.

To enable this feature, the following CMake compilation flags are required to build the `gkfs_malleability` tool:
`-DGKFS_BUILD_TOOLS=ON`. The `gkfs_malleability` tool is then available in the `build/tools` directory. Please consult
`-h` for its arguments. While the tool can be used manually to expand the file system, the `scripts/run/gkfs` script
should be used instead which invokes the `gkfs_malleability` tool.

The only requirement for extending the file system is a hostfile containing the hostnames/IPs of the new nodes (one line
per host). Example starting the file system. The `DAEMON_NODELIST` in the `gkfs.conf` is set to a hostfile containing
the initial set of file system nodes.:

```bash
~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf start
* [gkfs] Starting GekkoFS daemons (4 nodes) ...
* [gkfs] GekkoFS daemons running
* [gkfs] Startup time: 10.853 seconds
```

... Some computation ...

Expanding the file system. Using `-e <hostfile>` to specify the new nodes. Redistribution is done automatically with a
progress bar. When finished, the file system is ready to use in the new configuration:

```bash
~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf -e ~/hostfile_expand expand
* [gkfs] Starting GekkoFS daemons (8 nodes) ...
* [gkfs] GekkoFS daemons running
* [gkfs] Startup time: 1.058 seconds
Expansion process from 4 nodes to 12 nodes launched...
* [gkfs] Expansion progress:
[####################] 0/4 left
* [gkfs] Redistribution process done. Finalizing ...
* [gkfs] Expansion done.
```

Stop the file system:

```bash
~/gekkofs/scripts/run/gkfs -c ~/run/gkfs_verbs_expandtest.conf stop
* [gkfs] Stopping daemon with pid 16462
srun: sending Ctrl-C to StepId=282378.1
* [gkfs] Stopping daemon with pid 16761
srun: sending Ctrl-C to StepId=282378.2
* [gkfs] Shutdown time: 1.032 seconds
```

## Acknowledgment

This software was partially supported by the EC H2020 funded NEXTGenIO project (Project ID: 671951, www.nextgenio.eu).
+1 −0
Original line number Diff line number Diff line
@@ -72,6 +72,7 @@ target_sources(
         rpc/forward_management.hpp
         rpc/forward_metadata.hpp
         rpc/forward_data.hpp
    rpc/forward_malleability.hpp
         syscalls/args.hpp
         syscalls/decoder.hpp
         syscalls/errno.hpp
+45 −0
Original line number Diff line number Diff line
/*
  Copyright 2018-2024, Barcelona Supercomputing Center (BSC), Spain
  Copyright 2015-2024, Johannes Gutenberg Universitaet Mainz, Germany

  This software was partially supported by the
  EC H2020 funded project NEXTGenIO (Project ID: 671951, www.nextgenio.eu).

  This software was partially supported by the
  ADA-FS project under the SPPEXA project funded by the DFG.

  This file is part of GekkoFS' POSIX interface.

  GekkoFS' POSIX interface is free software: you can redistribute it and/or
  modify it under the terms of the GNU Lesser General Public License as
  published by the Free Software Foundation, either version 3 of the License,
  or (at your option) any later version.

  GekkoFS' POSIX interface is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  GNU Lesser General Public License for more details.

  You should have received a copy of the GNU Lesser General Public License
  along with GekkoFS' POSIX interface.  If not, see
  <https://www.gnu.org/licenses/>.

  SPDX-License-Identifier: LGPL-3.0-or-later
*/

#ifndef GEKKOFS_CLIENT_FORWARD_MALLEABILITY_HPP
#define GEKKOFS_CLIENT_FORWARD_MALLEABILITY_HPP

namespace gkfs::malleable::rpc {

int
forward_expand_start(int old_server_conf, int new_server_conf);

int
forward_expand_status();

int
forward_expand_finalize();
} // namespace gkfs::malleable::rpc

#endif // GEKKOFS_CLIENT_FORWARD_MALLEABILITY_HPP
+326 −2
Original line number Diff line number Diff line
@@ -63,7 +63,9 @@ hg_proc_void_t(hg_proc_t proc, void* data) {

} // namespace hermes::detail

namespace gkfs::rpc {
namespace gkfs {

namespace rpc {

//==============================================================================
// definitions for fs_config
@@ -3693,8 +3695,330 @@ struct get_dirents_extended_proxy {
        size_t m_dirents_size;
    };
};
} // namespace rpc
namespace malleable::rpc {

//==============================================================================
// definitions for expand_start
struct expand_start {

    // forward declarations of public input/output types for this RPC
    class input;

    class output;

    // traits used so that the engine knows what to do with the RPC
    using self_type = expand_start;
    using handle_type = hermes::rpc_handle<self_type>;
    using input_type = input;
    using output_type = output;
    using mercury_input_type = rpc_expand_start_in_t;
    using mercury_output_type = rpc_err_out_t;

    // RPC public identifier
    // (N.B: we reuse the same IDs assigned by Margo so that the daemon
    // understands Hermes RPCs)
    constexpr static const uint64_t public_id = 50;

    // RPC internal Mercury identifier
    constexpr static const hg_id_t mercury_id = 0;

    // RPC name
    constexpr static const auto name = gkfs::malleable::rpc::tag::expand_start;

    // requires response?
    constexpr static const auto requires_response = true;

    // Mercury callback to serialize input arguments
    constexpr static const auto mercury_in_proc_cb =
            HG_GEN_PROC_NAME(rpc_expand_start_in_t);

    // Mercury callback to serialize output arguments
    constexpr static const auto mercury_out_proc_cb =
            HG_GEN_PROC_NAME(rpc_err_out_t);

    class input {

        template <typename ExecutionContext>
        friend hg_return_t
        hermes::detail::post_to_mercury(ExecutionContext*);

    public:
        input(const uint32_t old_server_conf, uint32_t new_server_conf)
            : m_old_server_conf(old_server_conf),
              m_new_server_conf(new_server_conf) {}

        input(input&& rhs) = default;

        input(const input& other) = default;

        input&
        operator=(input&& rhs) = default;

        input&
        operator=(const input& other) = default;

        uint32_t
        old_server_conf() const {
            return m_old_server_conf;
        }

        uint32_t
        new_server_conf() const {
            return m_new_server_conf;
        }

        explicit input(const rpc_expand_start_in_t& other)
            : m_old_server_conf(other.old_server_conf),
              m_new_server_conf(other.new_server_conf) {}

        explicit operator rpc_expand_start_in_t() {
            return {m_old_server_conf, m_new_server_conf};
        }

    private:
        uint32_t m_old_server_conf;
        uint32_t m_new_server_conf;
    };

    class output {

        template <typename ExecutionContext>
        friend hg_return_t
        hermes::detail::post_to_mercury(ExecutionContext*);

    public:
        output() : m_err() {}

        output(int32_t err) : m_err(err) {}

        output(output&& rhs) = default;

        output(const output& other) = default;

        output&
        operator=(output&& rhs) = default;

        output&
        operator=(const output& other) = default;

        explicit output(const rpc_err_out_t& out) {
            m_err = out.err;
        }

        int32_t
        err() const {
            return m_err;
        }

    private:
        int32_t m_err;
    };
};

//==============================================================================
// definitions for expand_status
struct expand_status {

    // forward declarations of public input/output types for this RPC
    class input;

    class output;

    // traits used so that the engine knows what to do with the RPC
    using self_type = expand_status;
    using handle_type = hermes::rpc_handle<self_type>;
    using input_type = input;
    using output_type = output;
    using mercury_input_type = hermes::detail::hg_void_t;
    using mercury_output_type = rpc_err_out_t;

    // RPC public identifier
    // (N.B: we reuse the same IDs assigned by Margo so that the daemon
    // understands Hermes RPCs)
    constexpr static const uint64_t public_id = 51;

    // RPC internal Mercury identifier
    constexpr static const hg_id_t mercury_id = 0;

    // RPC name
    constexpr static const auto name = gkfs::malleable::rpc::tag::expand_status;

    // requires response?
    constexpr static const auto requires_response = true;

    // Mercury callback to serialize input arguments
    constexpr static const auto mercury_in_proc_cb =
            hermes::detail::hg_proc_void_t;

    // Mercury callback to serialize output arguments
    constexpr static const auto mercury_out_proc_cb =
            HG_GEN_PROC_NAME(rpc_err_out_t);

    class input {

        template <typename ExecutionContext>
        friend hg_return_t
        hermes::detail::post_to_mercury(ExecutionContext*);

    public:
        input() {}

        input(input&& rhs) = default;

        input(const input& other) = default;

        input&
        operator=(input&& rhs) = default;

        input&
        operator=(const input& other) = default;

        explicit input(const hermes::detail::hg_void_t& other) {}

        explicit operator hermes::detail::hg_void_t() {
            return {};
        }
    };

    class output {

        template <typename ExecutionContext>
        friend hg_return_t
        hermes::detail::post_to_mercury(ExecutionContext*);

    public:
        output() : m_err() {}

        output(int32_t err) : m_err(err) {}

        output(output&& rhs) = default;

        output(const output& other) = default;

        output&
        operator=(output&& rhs) = default;

        output&
        operator=(const output& other) = default;

        explicit output(const rpc_err_out_t& out) {
            m_err = out.err;
        }

        int32_t
        err() const {
            return m_err;
        }

    private:
        int32_t m_err;
    };
};

//==============================================================================
// definitions for expand_finalize
struct expand_finalize {

    // forward declarations of public input/output types for this RPC
    class input;

    class output;

    // traits used so that the engine knows what to do with the RPC
    using self_type = expand_finalize;
    using handle_type = hermes::rpc_handle<self_type>;
    using input_type = input;
    using output_type = output;
    using mercury_input_type = hermes::detail::hg_void_t;
    using mercury_output_type = rpc_err_out_t;

    // RPC public identifier
    // (N.B: we reuse the same IDs assigned by Margo so that the daemon
    // understands Hermes RPCs)
    constexpr static const uint64_t public_id = 52;

    // RPC internal Mercury identifier
    constexpr static const hg_id_t mercury_id = 0;

    // RPC name
    constexpr static const auto name =
            gkfs::malleable::rpc::tag::expand_finalize;

    // requires response?
    constexpr static const auto requires_response = true;

    // Mercury callback to serialize input arguments
    constexpr static const auto mercury_in_proc_cb =
            hermes::detail::hg_proc_void_t;

    // Mercury callback to serialize output arguments
    constexpr static const auto mercury_out_proc_cb =
            HG_GEN_PROC_NAME(rpc_err_out_t);

    class input {

        template <typename ExecutionContext>
        friend hg_return_t
        hermes::detail::post_to_mercury(ExecutionContext*);

    public:
        input() {}

        input(input&& rhs) = default;

        input(const input& other) = default;

        input&
        operator=(input&& rhs) = default;

        input&
        operator=(const input& other) = default;

        explicit input(const hermes::detail::hg_void_t& other) {}

        explicit operator hermes::detail::hg_void_t() {
            return {};
        }
    };

    class output {

        template <typename ExecutionContext>
        friend hg_return_t
        hermes::detail::post_to_mercury(ExecutionContext*);

    public:
        output() : m_err() {}

        output(int32_t err) : m_err(err) {}

        output(output&& rhs) = default;

        output(const output& other) = default;

        output&
        operator=(output&& rhs) = default;

        output&
        operator=(const output& other) = default;

        explicit output(const rpc_err_out_t& out) {
            m_err = out.err;
        }

        int32_t
        err() const {
            return m_err;
        }

    private:
        int32_t m_err;
    };
};

} // namespace gkfs::rpc
} // namespace malleable::rpc
} // namespace gkfs


#endif // GKFS_RPCS_TYPES_HPP
Loading