Commit 137dd33f authored by Marc Vef's avatar Marc Vef
Browse files

Merge branch 'rnou/replication' into 'master'

Data replication (client side, synchronous)

This MR adds support for data replication using one environment variable:

`LIBGKFS_NUM_REPL=<num repl>`
The number of replicas should go from 0 to the number of servers-1. The replicas are guided by the 
client, so it reduces write performance but we mantain the same level of consistency. 
On the other hand, it may increase read performance on some corner scenearios. 
Metadata replication is also implemented
The replication environment variable can be set up for each client, independently.

If a server is down, the data will be read from another replica. The metadata management is also done from another replica.

The replication is done in a synchronous way. A new function forward_write is used to sent to the different replicas. The reads are distributed, but this shouldn't produce an performance improvement as the distribution is similar to the original. 

In the case of the write, the original is sent to the target servers, and then the replicas are processed. This is done to avoid issues if a server, that should host a replica, is not available. 

In order to process the replicas a new method to check that a chunk needs to be processed inside a server is included, a bitset of 1024 is sent (coded in base-64 in a string). This represents 1024-chunks per write-read operation. If that is exceeded the normal hash check per chunk is done in the server. Exceeding this value, will disable the replica capabilities and produce unknown behaviours.

This can be potentially increased.

Finally, most of the operations are replica-aware, but some of them are missing yet. i.e., dirent.

See merge request !166
parents 0916c555 2835322a
Loading
Loading
Loading
Loading
Loading
+4 −0
Original line number Diff line number Diff line
@@ -6,6 +6,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]
- Replication without using the server. NUM_REPL (0 < NUM_REPL < num_servers) env variable defines the number of 
replicas ([!166](https://storage.bsc.es/gitlab/hpc/gekkofs/-/merge_requests/141)).
- Modified write and reads to use a bitset instead of the traditional hash per chunk in the server.
- Added reattemp support in get_fs_config to other servers, when the initial server fails.

### New
### Changed
+9 −1
Original line number Diff line number Diff line
@@ -320,6 +320,13 @@ Support for fstat in renamed files is included.

This is disabled by default.

### Replication

The user can enable the data replication feature by setting the replication environment variable:
`LIBGKFS_NUM_REPL=<num repl>`.
The number of replicas should go from `0` to the `number of servers - 1`. The replication environment variable can be
set up for each client independently.

## Acknowledgment

This software was partially supported by the EC H2020 funded NEXTGenIO project (Project ID: 671951, www.nextgenio.eu).
@@ -330,7 +337,8 @@ the DFG.
This software is partially supported by the FIDIUM project funded by the DFG.

This work was partially funded by the European Union’s Horizon 2020 and the German Ministry of Education and Research (
BMBF) under the ``Adaptive multi-tier intelligent data manager for Exascale (ADMIRE)'' project (https://www.admire-eurohpc.eu/); Grant Agreement number:
BMBF) under the ``Adaptive multi-tier intelligent data manager for Exascale (ADMIRE)''
project (https://www.admire-eurohpc.eu/); Grant Agreement number:
956748-ADMIRE-H2020-JTI-EuroHPC-2019-1. Further, this work was partially supported by the Spanish Ministry of Economy
and Competitiveness (MINECO) under grants PID2019-107255GB, and the Generalitat de Catalunya under contract
2021-SGR-00412. This publication is part of the project ADMIRE PCI2021-121952, funded by MCIN/AEI/10.13039/501100011033.
+86 −0
Original line number Diff line number Diff line
@@ -218,3 +218,89 @@ the logging subsystem to truncate the file used for logging, rather than append
For the daemon, the `GKFS_DAEMON_LOG_PATH=<path/to/file>` environment variable can be provided to set the path to the
log file, and the log module can be selected with the `GKFS_DAEMON_LOG_LEVEL={off,critical,err,warn,info,debug,trace}`
environment variable whereas `trace` produces the most trace records while `info` is the default value.

## Miscellaneous

### External functions

GekkoFS allows to use external functions on your client code, via LD_PRELOAD.
Source code needs to be compiled with -fPIC. We include a pfind io500 substitution,
`examples/gfind/gfind.cpp` and a non-mpi version `examples/gfind/sfind.cpp`

### Data distributors

The data distribution can be selected at compilation time, we have 2 distributors available:

#### Simple Hash (Default)

Chunks are distributed randomly to the different GekkoFS servers.

#### Guided Distributor

The guided distributor allows defining a specific distribution of data on a per directory or file basis.
The distribution configurations are defined within a shared file (called `guided_config.txt` henceforth) with the
following format:
`<path> <chunk_number> <host>`

To enable the distributor, the following CMake compilation flags are required:

* `GKFS_USE_GUIDED_DISTRIBUTION` ON
* `GKFS_USE_GUIDED_DISTRIBUTION_PATH` `<path_guided_config.txt>`

To use a custom distribution, a path needs to have the prefix `#` (e.g., `#/mdt-hard 0 0`), in which all the data of all
files in that directory goes to the same place as the metadata.
Note, that a chunk/host configuration is inherited to all children files automatically even if not using the prefix.
In this example, `/mdt-hard/file1` is therefore also using the same distribution as the `/mdt-hard` directory.
If no prefix is used, the Simple Hash distributor is used.

##### Guided configuration file

Creating a guided configuration file is based on an I/O trace file of a previous execution of the application.
For this the `trace_reads` tracing module is used (see above).

The `trace_reads` module enables a `TRACE_READS` level log at the clients writing the I/O information of the client
which is used as the input for a script that creates the guided distributor setting.
Note that capturing the necessary trace records can involve performance degradation.
To capture the I/O of each client within a SLURM environment, i.e., enabling the `trace_reads` module and print its
output to a user-defined path, the following example can be used:
`srun -N 10 -n 320 --export="ALL" /bin/bash -c "export LIBGKFS_LOG=trace_reads;LIBGKFS_LOG_OUTPUT=${HOME}/test/GLOBAL.txt;LD_PRELOAD=${GKFS_PRLD} <app>"`

Then, the `examples/distributors/guided/generate.py` scrpt is used to create the guided distributor configuration file:

* `python examples/distributors/guided/generate.py ~/test/GLOBAL.txt >> guided_config.txt`

Finally, modify `guided_config.txt` to your distribution requirements.

### Metadata Backends

There are two different metadata backends in GekkoFS. The default one uses `rocksdb`, however an alternative based
on `PARALLAX` from `FORTH`
is available. To enable it use the `-DGKFS_ENABLE_PARALLAX:BOOL=ON` option, you can also disable `rocksdb`
with `-DGKFS_ENABLE_ROCKSDB:BOOL=OFF`.

Once it is enabled, `--dbbackend` option will be functional.

### Statistics

GekkoFS daemons are able to output general operations (`--enable-collection`) and data chunk
statistics (`--enable-chunkstats`) to a specified output file via `--output-stats <FILE>`. Prometheus can also be used
instead or in addition to the output file. It must be enabled at compile time via the CMake
argument `-DGKFS_ENABLE_PROMETHEUS` and the daemon argument `--enable-prometheus`. The corresponding statistics are then
pushed to the Prometheus instance.

### Advanced experimental features

#### Rename

`-DGKFS_RENAME_SUPPORT` allows the application to rename files.
This is an experimental feature, and some scenarios may not work properly.
Support for fstat in renamed files is included.

This is disabled by default.

#### Replication

The user can enable the data replication feature by setting the replication environment variable:
`LIBGKFS_NUM_REPL=<num repl>`.
The number of replicas should go from `0` to the `number of servers - 1`. The replication environment variable can be
set up for each client independently.
 No newline at end of file
+1 −1
Original line number Diff line number Diff line
@@ -52,7 +52,7 @@ static constexpr auto HOSTS_FILE = ADD_PREFIX("HOSTS_FILE");
#ifdef GKFS_ENABLE_FORWARDING
static constexpr auto FORWARDING_MAP_FILE = ADD_PREFIX("FORWARDING_MAP_FILE");
#endif

static constexpr auto NUM_REPL = ADD_PREFIX("NUM_REPL");
} // namespace gkfs::env

#undef ADD_PREFIX
+7 −0
Original line number Diff line number Diff line
@@ -105,6 +105,7 @@ private:
    bool internal_fds_must_relocate_;
    std::bitset<MAX_USER_FDS> protected_fds_;
    std::string hostname;
    int replicas_;

public:
    static PreloadContext*
@@ -216,6 +217,12 @@ public:

    std::string
    get_hostname();

    void
    set_replicas(const int repl);

    int
    get_replicas();
};

} // namespace preload
Loading