Commit c300a359 authored by Marc Vef's avatar Marc Vef
Browse files

Merge branch 'marc/62-shared-file-metadata-congestion-2' into 'master'

Resolve "Shared file metadata congestion"

During write operations, the client must update the file size on the responsible metadata daemon. The write size cache
can reduce the metadata load on the daemon and reduce the number of RPCs during write operations, especially for many
small I/O operations. In the past, we have observed that a daemon can become network-congested, especially for single shared files, many processes, and small I/O operations, which bottlenecks the overall I/O throughput. Nevertheless, the cache can have a broad impact on small I/O operations as 1 RPC for updating the size is removed which already improves small file I/O on a single node.

Note that this cache may impact file size consistency in which stat operations may not reflect the actual file size
until the file is closed. The cache does not impact the consistency of the file data itself. We did not observe any issues with the cache for HPC applications and benchmarks, but it technically breaks POSIX. So, for now, I suggest it to be experimental and opt-in.

- `LIBGKFS_WRITE_SIZE_CACHE` - Enable caching the write size of files (default: OFF).
- `LIBGKFS_WRITE_SIZE_CACHE_THRESHOLD` - Set the number of write operations after which the file size is synchronized
  with the corresponding daemon (default: 1000). The file size is further synchronized when the file is `close()`d or
  when `fsync()` is called.

Depends on !194

Closes #62

Closes #62

See merge request !193
parents 950ba459 680fe6b5
Loading
Loading
Loading
Loading
Loading
+10 −1
Original line number Diff line number Diff line
@@ -8,9 +8,18 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### New

- Added a write size cache to the file system client to reduce potential metadata network bottlenecks during small I/O
  operations ([!193](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/193)).
  - The cache is experimental and thus disabled by default. Added the following environment variables.
  - `LIBGKFS_WRITE_SIZE_CACHE` - Enable caching the write size of files (default: OFF).
  - `LIBGKFS_WRITE_SIZE_CACHE_THRESHOLD` - Set the number of write operations after which the file size is synchronized
    with the corresponding daemon (default: 1000). The file size is further synchronized when the file is `close()`d or
    when `fsync()` is called.
- Added a directory cache for the file system client to improve `ls -l` type operations by avoiding consecutive stat calls
  ([!194](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/194)).
  - The cache is experimental and thus disabled by default and can be enabled with the env variable `LIBGKFS_DISABLE_DIR_CACHE` set to `ON`.
  - The cache is experimental and thus disabled by default. Added the following environment variables.
  - `LIBGKFS_DENTRY_CACHE` - Enable caching directory entries until closing the directory (default: OFF).
      Further compile-time settings available at `include/config.hpp`.
- Added file system expansion support ([!196](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/196)).
  - Added the tool `gkfs_malleability` to steer start, status, and finalize requests for expansion operations.
  - `-DGKFS_BUILD_TOOLS=ON` must be set for CMake to build the tool.
+18 −2
Original line number Diff line number Diff line
@@ -517,8 +517,24 @@ Client-metrics require the CMake argument `-DGKFS_ENABLE_CLIENT_METRICS=ON` (see
- `LIBGKFS_PROXY_PID_FILE` - Path to the proxy pid file (when using the GekkoFS proxy).
- `LIBGKFS_NUM_REPL` - Number of replicas for data.
#### Caching
##### Dentry cache
Improves performance for `ls -l` type operations by caching file metadata for subsequent `stat()` operations during
`readdir()`. Dependening on the size of the directory, this can avoid a signficant number of stat RPCs.
- `LIBGKFS_DENTRY_CACHE` - Enable caching directory entries until closing the directory (default: OFF).
Improves performance for `ls -l` type operations. Further compile-time settings available at `include/config.hpp`.
  Further compile-time settings available at `include/config.hpp`.

##### Write size cache
During write operations, the client must update the file size on the responsible metadata daemon. The write size cache
can reduce the metadata load on the daemon and reduce the number of RPCs during write operations, especially for many
small I/O operations.

Note that this cache may impact file size consistency in which stat operations may not reflect the actual file size
until the file is closed. The cache does not impact the consistency of the file data itself.

- `LIBGKFS_WRITE_SIZE_CACHE` - Enable caching the write size of files (default: OFF).
- `LIBGKFS_WRITE_SIZE_CACHE_THRESHOLD` - Set the number of write operations after which the file size is synchronized
  with the corresponding daemon (default: 1000). The file size is further synchronized when the file is `close()`d or
  when `fsync()` is called.

### Daemon
#### Logging
+55 −0
Original line number Diff line number Diff line
@@ -39,6 +39,7 @@
#include <mutex>
#include <optional>
#include <cstdint>
#include <utility>

namespace gkfs::cache {

@@ -132,6 +133,60 @@ public:
};
} // namespace dir

namespace file {
class WriteSizeCache {
private:
    // <path<cnt, size>>
    std::unordered_map<std::string, std::pair<size_t, size_t>> size_cache;
    std::mutex mtx_;

    // Flush threshold in number of write ops per file
    size_t flush_threshold_{0};

public:
    WriteSizeCache() = default;

    virtual ~WriteSizeCache() = default;

    /**
     * @brief Record the size of a file and add it to the cache
     * @param path gekkofs path
     * @param size current size to set for given path
     * @return [size_update counter, current cached size]
     */
    std::pair<size_t, size_t>
    record(std::string path, size_t size);

    /**
     * @brief reset entry from the cache
     * @param path
     * @param evict if true, entry is removed from cache, reseted to cnt 0
     * otherwise
     * @return [size_update counter, current cached size]
     */
    std::pair<size_t, size_t>
    reset(const std::string& path, bool evict);

    /**
     * @brief Flush the cache for a given path contacting the corresponding
     * daemon
     * @param path
     * @param evict during flush: if true, entry is removed from cache, reseted
     * to cnt 0 otherwise
     * @return error code and flushed size
     */
    std::pair<int, off64_t>
    flush(const std::string& path, bool evict = true);


    // GETTER/SETTER
    size_t
    flush_threshold() const;

    void
    flush_threshold(size_t flush_threshold);
};
} // namespace file
} // namespace gkfs::cache

#endif // GKFS_CLIENT_CACHE
+6 −1
Original line number Diff line number Diff line
@@ -60,7 +60,12 @@ static constexpr auto METRICS_IP_PORT = ADD_PREFIX("METRICS_IP_PORT");

static constexpr auto NUM_REPL = ADD_PREFIX("NUM_REPL");
static constexpr auto PROXY_PID_FILE = ADD_PREFIX("PROXY_PID_FILE");
static constexpr auto DENTRY_CACHE = ADD_PREFIX("DENTRY_CACHE");
namespace cache {
static constexpr auto DENTRY = ADD_PREFIX("DENTRY_CACHE");
static constexpr auto WRITE_SIZE = ADD_PREFIX("WRITE_SIZE_CACHE");
static constexpr auto WRITE_SIZE_THRESHOLD =
        ADD_PREFIX("WRITE_SIZE_CACHE_THRESHOLD");
} // namespace cache

} // namespace gkfs::env

+3 −0
Original line number Diff line number Diff line
@@ -157,6 +157,9 @@ gkfs_getdents64(unsigned int fd, struct linux_dirent64* dirp,
int
gkfs_rmdir(const std::string& path);

int
gkfs_fsync(unsigned int fd);

int
gkfs_close(unsigned int fd);

Loading