Commit 8930cfc7 authored by Marc Vef's avatar Marc Vef
Browse files

Merge branch 'marc/254-support-for-parallel-append-operations' into 'master'

Resolve "Support for (parallel) append operations"

This MR adds (parallel) append support for write operations. There was already some append code available that was run for each `pwrite` when the file size was updated. As a result, parts of strings were serialized and deserialized within RocksDB's merge operation even if not needed. Previously, `open()` was returning `ENOTSUP` when `O_APPEND` was used. When removing this statement, append was not functional due to how size updates and the append case worked. Overall, `gkfs_pwrite()` which first updates the file size and then writes the file was quite messy with unused return values and arguments. Further, the server calculated the updated size without regard on what occurred in the KV store. Therefore, as part of this MR, the entire update size process within `pwrite()` was refactored. 

Parallel appends are achieved by hooking into RocksDB's `Merge Operator` which is triggered at some point (e.g., during `Get()`). Without append being used, the offset is known to the client already and therefore the file size is updated to `offset + count` set in `gkfs_pwrite()`. There is no further coordination required since overlapping offsets are the user's responsibility. The code path for non-append operations was slightly optimized but largely remains the same. 

Append operations are treated differently because it is not clear during a write operation where a process calling `write()` should start writing. Using the EOF information that is loaded during open may be outdated when multiple processes try to append at the same time -> causing a race condition. Since the size update on the daemon is atomic, a process (updating the size before performing a write) can be reserved a corresponding byte interval `[EOF, EOF + count]`. Now, calling `Merge()` on RocksDB does not trigger a Merge operation since multiple Merges are batched before the operation is run. For append, the Merge operation is forced by running `Get()` on RocksDB. The corresponding Merge operation then responds the starting write offset to the updating size process. Therefore, appends are more expensive than non-appends.

Lastly, some missing documentation was added.

As reported, this MR adds support for the DASI application, used in IO-SEA.

Note: This MR does not consider failing writes which would require us to collapse a reserved interval and tie up the hole in the file. 

Closes #254
Closes #12

Closes #12 and #254

See merge request !164
parents a2b88702 6c264285
Loading
Loading
Loading
Loading
Loading
+8 −2
Original line number Diff line number Diff line
@@ -17,8 +17,10 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
- Support for increasing file size via `truncate()`
  added ([!159](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/159)
- Added PowerPC support ([!151](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/151)).
- GKFS_RENAME_SUPPORT adds support for renaming files. It includes the use case of renaming opened files using the fd
- FLOCK and fcntl functions for locks, are not supported, but they are available.
- GKFS_RENAME_SUPPORT added to support renaming files. This specifically targets the use case for opened files using an
  existing file descriptor ([!133](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/133)).
- Added `FLOCK` and `fcntl` functions for locks to interception albeit not supported by GekkoFS and returning the
  corresponding error code ([!133](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/133)).
- Added support for [CMake presets](https://cmake.org/cmake/help/latest/manual/cmake-presets.7.html) to simplify build 
  configurations ([!163](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/163#note_8179)).
- Several improvements to CMake scripts ([!143](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/143))):
@@ -33,6 +35,8 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
  - Adds the `gkfs_feature_summary()` to allow printing a summary of all
    GekkoFS configuration options and their values. This should help users
    when building to precisely see how a GekkoFS instance has been configured.
- Added (parallel) append support for consecutive writes with file descriptor opened
  with `O_APPEND` ([!164](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/164)).

### Changed

@@ -72,6 +76,8 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
  version ([!162](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/162))
- Fixed an issue where nlohmann json failed to download in
  CMake ([!167](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/167)).
- Refactored update file size during write
  operations ([!164](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/164)).

## [0.9.1] - 2022-04-29

+1 −1
Original line number Diff line number Diff line
@@ -106,7 +106,7 @@ gkfs_readlink(const std::string& path, char* buf, int bufsize);

ssize_t
gkfs_pwrite(std::shared_ptr<gkfs::filemap::OpenFile> file, const char* buf,
            size_t count, off64_t offset);
            size_t count, off64_t offset, bool update_pos = false);

ssize_t
gkfs_pwrite_ws(int fd, const void* buf, size_t count, off64_t offset);
+2 −3
Original line number Diff line number Diff line
@@ -42,9 +42,8 @@ struct ChunkStat {
// an exception.

std::pair<int, ssize_t>
forward_write(const std::string& path, const void* buf, bool append_flag,
              off64_t in_offset, size_t write_size,
              int64_t updated_metadentry_size);
forward_write(const std::string& path, const void* buf, off64_t offset,
              size_t write_size);

std::pair<int, ssize_t>
forward_read(const std::string& path, void* buf, off64_t offset,
+5 −5
Original line number Diff line number Diff line
@@ -1158,10 +1158,10 @@ struct update_metadentry_size {
        hermes::detail::post_to_mercury(ExecutionContext*);

    public:
        output() : m_err(), m_ret_size() {}
        output() : m_err(), m_ret_offset() {}

        output(int32_t err, int64_t ret_size)
            : m_err(err), m_ret_size(ret_size) {}
            : m_err(err), m_ret_offset(ret_size) {}

        output(output&& rhs) = default;

@@ -1175,7 +1175,7 @@ struct update_metadentry_size {

        explicit output(const rpc_update_metadentry_size_out_t& out) {
            m_err = out.err;
            m_ret_size = out.ret_size;
            m_ret_offset = out.ret_offset;
        }

        int32_t
@@ -1185,12 +1185,12 @@ struct update_metadentry_size {

        int64_t
        ret_size() const {
            return m_ret_size;
            return m_ret_offset;
        }

    private:
        int32_t m_err;
        int64_t m_ret_size;
        int64_t m_ret_offset;
    };
};

+3 −0
Original line number Diff line number Diff line
@@ -40,6 +40,9 @@ namespace gkfs::metadata {

constexpr mode_t LINK_MODE = ((S_IRWXU | S_IRWXG | S_IRWXO) | S_IFLNK);

uint16_t
gen_unique_id(const std::string& path);

class Metadata {
private:
    time_t atime_{}; // access time. gets updated on file access unless mounted
Loading