Skip to content

Resolve "[daemon] chunk storage backend crashes the server on error"

Closes #75 (closed)

The ChunkStorage backend class on the daemon was throwing system_errors without being caught, crashing the server in the process. ChunkStorage now uses a designated error class for errors that might occur. In addition the dependency to Argobots was removed which was used to trigger ABT_eventuals, laying ground work for future non-Argobots IO implementations. Further, the whole class was refactored for consistency and failure resistance.

A new class ChunkOperation is introduced which wraps Argobots' IO task operations which allows the removal of IO queue specific code within RPC handlers, i.e., read and write handlers. The idea is to separate eventuals, tasks and their arguments from handler logic into a designated class. Therefore, an object of an inherited class of ChunkOperation is instantiated within the handlers that drives all IO tasks. The corresponding code was added to the read and write RPC handlers. Note, ChunkOperation is not thread-safe and is supposed to be called by a single thread.

In addition, truncate was reworked for error handling (it crashed the server on error) and that it uses the IO queue as well since truncate causes a write operation and should not overtake IO tasks in the queue.

The chunk stat rpc handler was refactored for error handling and to use error codes as well.

Further minor changes:

  • dead chunk stat code has been removed
  • some namespaces were missing: gkfs::rpc
  • more flexible handler cleanup and response code
  • fixed a bug where the chunk dir wasn't removed when the metadata didn't exist on the same node

Misc: There was some discussion about putting the removal of the chunk directory into the IO queue as well with the same argument as truncate, but I refrain to do so as it would likely notably increase remove performance. I think, we can put this under eventual consistency and call it a day for now. Truncate was another story as glibc makes heavy use of truncate in various operations.

Edited by Marc Vef

Merge request reports