Skip to content

Resolve "Support for (parallel) append operations"

Marc Vef requested to merge marc/254-support-for-parallel-append-operations into master

This MR adds (parallel) append support for write operations. There was already some append code available that was run for each pwrite when the file size was updated. As a result, parts of strings were serialized and deserialized within RocksDB's merge operation even if not needed. Previously, open() was returning ENOTSUP when O_APPEND was used. When removing this statement, append was not functional due to how size updates and the append case worked. Overall, gkfs_pwrite() which first updates the file size and then writes the file was quite messy with unused return values and arguments. Further, the server calculated the updated size without regard on what occurred in the KV store. Therefore, as part of this MR, the entire update size process within pwrite() was refactored.

Parallel appends are achieved by hooking into RocksDB's Merge Operator which is triggered at some point (e.g., during Get()). Without append being used, the offset is known to the client already and therefore the file size is updated to offset + count set in gkfs_pwrite(). There is no further coordination required since overlapping offsets are the user's responsibility. The code path for non-append operations was slightly optimized but largely remains the same.

Append operations are treated differently because it is not clear during a write operation where a process calling write() should start writing. Using the EOF information that is loaded during open may be outdated when multiple processes try to append at the same time -> causing a race condition. Since the size update on the daemon is atomic, a process (updating the size before performing a write) can be reserved a corresponding byte interval [EOF, EOF + count]. Now, calling Merge() on RocksDB does not trigger a Merge operation since multiple Merges are batched before the operation is run. For append, the Merge operation is forced by running Get() on RocksDB. The corresponding Merge operation then responds the starting write offset to the updating size process. Therefore, appends are more expensive than non-appends.

Lastly, some missing documentation was added.

As reported, this MR adds support for the DASI application, used in IO-SEA.

Note: This MR does not consider failing writes which would require us to collapse a reserved interval and tie up the hole in the file.

Closes #254 (closed) Closes #12 (closed)

Edited by Marc Vef

Merge request reports