hpc issueshttps://storage.bsc.es/gitlab/groups/hpc/-/issues2023-03-21T12:55:05+01:00https://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/207UCX support2023-03-21T12:55:05+01:00Hector WuUCX supportThe latest Mercury RPC framework provides support for UCX beyond OFI as network abstractions. We should support UCX to enable the deployment of GekkoFS in a wider range of networks.The latest Mercury RPC framework provides support for UCX beyond OFI as network abstractions. We should support UCX to enable the deployment of GekkoFS in a wider range of networks.v0.9.3Marc VefMarc Vefhttps://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/284RPCs write/read: host size no longer necessary2024-03-10T19:31:53+01:00Marc VefRPCs write/read: host size no longer necessarySince we moved to using a bitset instead of calculating the chunks manually, we can remove the host_size of the corresponding RPC functions.Since we moved to using a bitset instead of calculating the chunks manually, we can remove the host_size of the corresponding RPC functions.v0.9.3https://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/283Write/read path: Global refactor necessary2024-03-10T19:27:24+01:00Marc VefWrite/read path: Global refactor necessaryThe code path for write/read has become too complex and needs a full rework. The functions are too large and not modular. Replication and chunk calculation for instance should not be part of the RPC sender and receiver functions directly.The code path for write/read has become too complex and needs a full rework. The functions are too large and not modular. Replication and chunk calculation for instance should not be part of the RPC sender and receiver functions directly.v0.9.4https://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/282Fix CMake warnings regarding compatibility with older versions2024-03-08T19:05:34+01:00Marc VefFix CMake warnings regarding compatibility with older versionsWhen running CMake, there are warnings for our external libraries which should be fixed:
```
CMake Deprecation Warning at external/fmt/CMakeLists.txt:1 (cmake_minimum_required):
Compatibility with CMake < 3.5 will be removed from a fu...When running CMake, there are warnings for our external libraries which should be fixed:
```
CMake Deprecation Warning at external/fmt/CMakeLists.txt:1 (cmake_minimum_required):
Compatibility with CMake < 3.5 will be removed from a future version of
CMake.
Update the VERSION argument <min> value or use a ...<max> suffix to tell
CMake that the project does not need compatibility with older versions.
```v0.9.3https://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/281Refactor path resolution2024-03-05T13:44:20+01:00Marc VefRefactor path resolutionCurrently, path resolution is using `lstat()` to ensure that each path component exists. This can add unnecessary overhead, especially when the mount point is located under a long path within a parallel file system like Lustre.
The requ...Currently, path resolution is using `lstat()` to ensure that each path component exists. This can add unnecessary overhead, especially when the mount point is located under a long path within a parallel file system like Lustre.
The requirements for path checking:
1. Detect paths within GekkoFS and pass them (without prefix) to GekkoFS
2. No system calls for checking of path component existence.
3. When a path is not within GekkoFS, the *unmodified* path is passed to the kernel, except a GekkoFS path is part of the path which must be removed first.
Therefore, the following features/changes are necessary:
- A prefix is a defined as the absolute path to the GekkoFS mountpoint. E.g., for `/tmp/gkfs_mount/foofile` the prefix is `/tmp/gkfs_mount`.
- No system call like `lstat()` should be called
- Path checking is done by prefix matching. Therefore a non-absolute path first needs to be resolved:
- if the middle of the path uses `..`, the path needs to be resolved for prefix checking (similar to now)
- relative paths (based on the current working directory) also need to be resolved for prefix checking
- If a path is not within GekkoFS, the path should be passed to the kernel unmodified.
- When part of the GekkoFS namespace is in the middle of a path and then undone via `..`, the path needs to be resolved before passing it to the kernel as the kernel is unaware of GekkoFS
- This also applies to relative paths
- If a path is within GekkoFS, the prefix is cut from the path and passed to GekkoFS.
Part of these improvements are included but unfinished in [this branch](https://storage.bsc.es/gitlab/hpc/gekkofs/-/tree/marc/100-client-fails-when-mountdir-does-not-exist-on-underlying-fs)v0.9.3Julius AthenstaedtJulius Athenstaedthttps://storage.bsc.es/gitlab/hpc/cargo/-/issues/43Sequentiality of the transfers2024-03-04T14:07:26+01:00Ramon NouSequentiality of the transfersNow once we setup the transfers, the first part of the transfer ops()() is done. Although it depends on the backend, normally it reads the file and opens the fd of the output backend. So in the case of gekkofs (or adhocfs) it starts a cl...Now once we setup the transfers, the first part of the transfer ops()() is done. Although it depends on the backend, normally it reads the file and opens the fd of the output backend. So in the case of gekkofs (or adhocfs) it starts a client instance.
It would be better to move the initialization on the pending ops loop, instead of the mpi message receive loop.
This can be easily solved initializing index to -1 (because no pending operation will start with a -1, and if the operation finishes we just delete the op)Ramon NouRamon Nouhttps://storage.bsc.es/gitlab/hpc/cargo/-/issues/42Some issues initializing and closing adhoc instances2024-03-04T13:42:28+01:00Ramon NouSome issues initializing and closing adhoc instancesWe initialize and finalize per file, although it may work on some scenarios... it seems that gekkofs closes also the hosts lists and then an exception happens inside Gekkofs Client that scalates to the mpio_read(et al) exception handler....We initialize and finalize per file, although it may work on some scenarios... it seems that gekkofs closes also the hosts lists and then an exception happens inside Gekkofs Client that scalates to the mpio_read(et al) exception handler.
Solved in !30 , creating a single instance of each adhocfs.Ramon NouRamon Nouhttps://storage.bsc.es/gitlab/hpc/cargo/-/issues/41Scheduling thread launches transfers according conf/prob2024-02-27T17:57:36+01:00Ramon NouScheduling thread launches transfers according conf/probfuturehttps://storage.bsc.es/gitlab/hpc/cargo/-/issues/40Interface with FTIO2024-03-07T09:36:25+01:00Ramon NouInterface with FTIOIn order to get the information from FTIO we will enable a CLI cargo_ftio --server xxxx --confidence <float> --probability <float>.
The information will be loaded inside cargo, and a scheduling thread will talk with gekkofs to do the s...In order to get the information from FTIO we will enable a CLI cargo_ftio --server xxxx --confidence <float> --probability <float>.
The information will be loaded inside cargo, and a scheduling thread will talk with gekkofs to do the stage-in / stage-out needed.
- [x] CLI command to send confidence and probability inside cargo
- [x] RPC gathering info from ftio command
- [x] Scheduling thread
- [x] Scheduling thread launches transfers according conf/prob
- [x] Adding gkfs_stat and gkfs_dir* to obtain the contents of the directory.
- [x] Adding period
- [x] Delete files
- [x] Scheduling thread monitors input file/dir
- [ ] Define which unit is period (now seconds)futureRamon NouRamon Nouhttps://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/280Hooks narrow cast types2024-02-19T09:43:00+01:00Marc VefHooks narrow cast typesCurrently, our client hooks return whatever syscall intercept returns. This is usually a cast from `long` to `int`. This is a narrowing operation and therefore we should us `narrow_cast` from gsl to make clear that this is acceptable (we...Currently, our client hooks return whatever syscall intercept returns. This is usually a cast from `long` to `int`. This is a narrowing operation and therefore we should us `narrow_cast` from gsl to make clear that this is acceptable (we do not need an exception in this case). See here https://github.com/microsoft/GSL/blob/main/docs/headers.md#user-content-H-util-narrow_cast
This issue includes adding GSL to GekkoFS and adding `gsl::narrow_cast` to all hooks that narrow.v0.9.3Julius AthenstaedtJulius Athenstaedthttps://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/279At very large scale, several daemon processes hang with 100%(200%) CPU usage ...2024-02-01T09:43:46+01:00Hector WuAt very large scale, several daemon processes hang with 100%(200%) CPU usage and clients experience timeoutsI tested GekkoFS at a relatively large scale. We had 200 nodes, 100 acted as servers while the rest 100 were clients. On each server node, we ran 4 daemons as we notice running multiple daemons on a single node can improve the performanc...I tested GekkoFS at a relatively large scale. We had 200 nodes, 100 acted as servers while the rest 100 were clients. On each server node, we ran 4 daemons as we notice running multiple daemons on a single node can improve the performance. Each client, on the other hand, has 4 client processes running IOR benchmark. With a very high possibility, we will see some daemon processes hang with 100%(200%) CPU usage and clients experience timeouts, thus the benchmark could not run to finish. We are using UCX BTW.futureHector WuHector Wuhttps://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/277Cleanup client smart pointers2023-12-05T23:45:50+01:00Marc VefCleanup client smart pointersSmart pointers usage in `preload_context.hpp` need be consistent. Further, the `shared_ptr` are not required and should be `unique_ptr` as they are actually not shared.
Preload Context is owning these pointers.Smart pointers usage in `preload_context.hpp` need be consistent. Further, the `shared_ptr` are not required and should be `unique_ptr` as they are actually not shared.
Preload Context is owning these pointers.futurehttps://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/276du shows incorrect size2023-11-28T20:00:18+01:00Marc Vefdu shows incorrect sizeThe `du` command shows an incorrect size. This is because GekkoFS stores data differently in the backend. Directories are in RocksDB and directories hierarchies are a single directory in the chunk directory.
It needs to be discussed how...The `du` command shows an incorrect size. This is because GekkoFS stores data differently in the backend. Directories are in RocksDB and directories hierarchies are a single directory in the chunk directory.
It needs to be discussed how we should represent sizes via `du` in the futurefuturehttps://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/275Using different mercury/margo versions ends with different RPC id2023-11-28T11:30:21+01:00Ramon NouUsing different mercury/margo versions ends with different RPC id```
["mercury"]="v2.1.0"
["margo"]="v0.9.6"
```
Using a different (newer) version ends up with the client <-> server rpcs mixed. This is problematic in spack environments as sometimes version dependencies are relaxed, as the pack...```
["mercury"]="v2.1.0"
["margo"]="v0.9.6"
```
Using a different (newer) version ends up with the client <-> server rpcs mixed. This is problematic in spack environments as sometimes version dependencies are relaxed, as the package reuses other installations.futureMarc VefMarc Vefhttps://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/274rocksdb not found (missing cmake)2024-02-23T16:51:22+01:00Ramon Nourocksdb not found (missing cmake)According to https://storage.bsc.es/gitlab/hpc/gekkofs/-/commit/80b60005c340ffd9fd56fadc30f2c86d39ef253a rocksdb installs a cmake, but this does not happens with spack.
On the other hand, a fallback could be to use the pc file, but then...According to https://storage.bsc.es/gitlab/hpc/gekkofs/-/commit/80b60005c340ffd9fd56fadc30f2c86d39ef253a rocksdb installs a cmake, but this does not happens with spack.
On the other hand, a fallback could be to use the pc file, but then we need to change the target (Rocksdb::xxx).
I suggest restoring the file.Marc VefMarc Vefhttps://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/273Client invalid_argument error if GKFS_RENAMING_SUPPORT is On2024-02-29T12:52:05+01:00Julius AthenstaedtClient invalid_argument error if GKFS_RENAMING_SUPPORT is OnWhen the Cmake flag GKFS_RENAMING_SUPPORT is activated,
`LD_PRELOAD=/home/usrname/gekkofs_deps/install/lib/libgkfs_intercept.so ls`
fails with
```
terminate called after throwing an instance of 'std::invalid_argument'
what(): stol
Ab...When the Cmake flag GKFS_RENAMING_SUPPORT is activated,
`LD_PRELOAD=/home/usrname/gekkofs_deps/install/lib/libgkfs_intercept.so ls`
fails with
```
terminate called after throwing an instance of 'std::invalid_argument'
what(): stol
Aborted (core dumped)
```https://storage.bsc.es/gitlab/hpc/cargo/-/issues/26Function `transfer_datasets` should be a member function of class `cargo::ser...2023-11-03T09:32:10+01:00Alberto MirandaFunction `transfer_datasets` should be a member function of class `cargo::server`futurehttps://storage.bsc.es/gitlab/hpc/cargo/-/issues/24Use asynchronous I/O primitives2023-11-03T09:14:00+01:00Alberto MirandaUse asynchronous I/O primitivesCargo currently relies on synchronous I/O primitives from both Posix and MPI-IO, which severely limits performance and scalability. It would be better to rely on asynchronous I/O (e.g. `uring`, `MPI_File_i(read|write)_all`Cargo currently relies on synchronous I/O primitives from both Posix and MPI-IO, which severely limits performance and scalability. It would be better to rely on asynchronous I/O (e.g. `uring`, `MPI_File_i(read|write)_all`futurehttps://storage.bsc.es/gitlab/hpc/cargo/-/issues/21Add bandwidth control mechanisms to MPI-IO transfers2023-11-03T09:16:52+01:00Alberto MirandaAdd bandwidth control mechanisms to MPI-IO transfersActive control to data transfers can be added to Cargo. Alternatively (or complementary), Lustre's TBF could be leveraged for this purpose as discussed in https://storage.bsc.es/gitlab/eu/admire/io-scheduler/-/issues/157.Active control to data transfers can be added to Cargo. Alternatively (or complementary), Lustre's TBF could be leveraged for this purpose as discussed in https://storage.bsc.es/gitlab/eu/admire/io-scheduler/-/issues/157.0.3.0Marc VefMarc Vefhttps://storage.bsc.es/gitlab/hpc/gekkofs/-/issues/272OPX support2023-10-23T12:40:44+02:00Marc VefOPX supportAdd OPX support to GekkoFS and Hermes. This replaces ofi+psm2 which seems to be dysfunctional at this point.Add OPX support to GekkoFS and Hermes. This replaces ofi+psm2 which seems to be dysfunctional at this point.v0.9.3Marc VefMarc Vef