[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow-IO DAOS Plugin #1603

Open
wants to merge 91 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
7369e32
Skeleton in Place + Build Correctly
Sep 6, 2021
b055a0c
Merge branch 'FT-dfs-skeleton-OM' into 'devel'
omar-hmarzouk Sep 10, 2021
230aadd
Parsing Function Added and Tested Separately
Sep 10, 2021
a9b128b
Merge branch '2-parse-dfs-path' into 'devel'
omar-hmarzouk Sep 10, 2021
17a21d7
DAOS library installed as an http archive and linked
Sep 12, 2021
20900e9
Removed DAOS API call test
Sep 12, 2021
c1970e9
Linking DAOS Shared Libraries
Sep 12, 2021
6441bf0
Merge branch 'FT-daos-lib-integration-OM' into 'devel'
omar-hmarzouk Sep 12, 2021
b1c0128
Added Skeleton + Connect/Disconnect Functionality
Sep 15, 2021
e3654b9
Merge branch '4-plugin-skeleton' into 'devel'
omar-hmarzouk Sep 15, 2021
a45d6e4
Init+Cleanup+Mount+Unmount
Sep 15, 2021
903b89f
Query + Moving Class and Helpers to header file
Sep 16, 2021
477e81a
Added Path Lookup Functionality
Sep 20, 2021
ae6f84a
Support for Multiple Connections
Sep 21, 2021
9a3f111
Merge branch 'FT-filesystem-ops-OM' into 'devel'
omar-hmarzouk Sep 21, 2021
9f29eee
Directory Checking + Creation & Deletion (Singl/Recursive)
Sep 23, 2021
14073c5
Merge branch 'FT-directory-operation-OM' into 'devel'
omar-hmarzouk Sep 23, 2021
8dd9fbb
File Size Calculation
Sep 23, 2021
c8e261a
File Deletion
Sep 23, 2021
958b645
Creation of Random Access + Writable + Appendable Operations
Sep 25, 2021
96647da
Rename/Moving of File
Sep 26, 2021
038dfce
Completed FileSystem Operations Table
Sep 27, 2021
2f29df0
Merge branch 'FT-file-ops-OM' into 'devel'
omar-hmarzouk Sep 27, 2021
5b2d864
Refactor
Sep 28, 2021
b14fc84
Merge branch '10-refactor-of-dfs-plugin-class' into 'devel'
omar-hmarzouk Sep 28, 2021
10b09de
Writable File Ops Done
Sep 29, 2021
26b0f36
Merge branch 'FT-writable-file-ops-OM' into 'devel'
omar-hmarzouk Sep 29, 2021
83fd151
Random Access File Ops Done
Sep 29, 2021
6810f04
Merge branch 'FT-random-access-ops-OM' into 'devel'
omar-hmarzouk Sep 29, 2021
ec328c1
Tests Added (Bug in isDirectory and Wildcard Matching)
Oct 6, 2021
cadd6f6
Tests completed & passed & wildcard matching to be checked
Oct 10, 2021
dc664ae
Added Tutorial
Oct 14, 2021
366d63c
Tutorial tested and configured
Oct 14, 2021
8a8cabf
Implementation of Wildcard Matching
Oct 19, 2021
3614d0c
Merge branch '13-wildcard-matching' into FT-tutorial-example-OM
Oct 19, 2021
866ebd7
Merge branch '13-wildcard-matching' into 'devel'
omar-hmarzouk Oct 19, 2021
36e93cf
Merge branch 'FT-tutorial-example-OM' into 'devel'
omar-hmarzouk Oct 19, 2021
6ee86d0
Added Dummy ROM-Region
Oct 27, 2021
3745d4f
Merge branch 'FT-rom-region-dummy-OM' into 'devel'
omar-hmarzouk Oct 27, 2021
c0ba4c4
Update to DAOS1.3.106 + Decoupling of DAOS API init + Handling Pool a…
Dec 5, 2021
a764fdf
Adjusted Example + Added Build Documentation + Fixed Indentation
Dec 5, 2021
d5016e5
Refactor + Update Tests + Update Docs
Dec 6, 2021
eebe0af
Markdown updated
MernaMoawad Dec 6, 2021
9b3baf0
UNS NO_CHECK_PATH
Dec 6, 2021
7035255
Linking DAOS
Dec 7, 2021
7c1c736
Linking DAOS libraries
Dec 7, 2021
9d72efd
Updated Docs
Dec 8, 2021
b823fa1
Merge branch 'devel' of https://github.com/daos-stack/tensorflow-io-d…
Dec 8, 2021
a2cc29e
Updating Docs and moving them to docs/
Dec 8, 2021
4a8b3ec
Updated Docs
Dec 10, 2021
ba9074f
Merge branch 'devel' into FT-unified-name-space-OM
Dec 10, 2021
ea94561
UNS Supports UID and Label
Dec 12, 2021
a8f0a85
Merge branch 'FT-unified-name-space-OM' into 'devel'
omar-hmarzouk Dec 12, 2021
efbb408
Updated Tutorial + Documentation
Dec 15, 2021
5027ff7
Updated Notebook
Dec 15, 2021
62b092c
Cleared Output
Dec 15, 2021
c83c378
Output Cleared
Dec 15, 2021
b827a7b
Style and Formatting
Dec 16, 2021
de0a601
Merge branch 'tensorflow:master' into devel
omar-hmarzouk Jan 2, 2022
7840b77
PyLint modifications
Jan 4, 2022
9987dd8
Merge branch 'tensorflow:master' into devel
omar-hmarzouk Jan 4, 2022
df84926
Bazel lint
Jan 5, 2022
5ef973c
Merge branch 'devel' of https://github.com/daos-stack/tensorflow-io-d…
Jan 5, 2022
24641ce
Integrating daos build changes
Jan 30, 2022
673ffba
Linting
Jan 30, 2022
0cf005c
Replacing usage of C++ API
Jan 30, 2022
c6dab2a
Formatted DAOS notebook
Feb 2, 2022
5cf7335
Pool Connection Error Handling
Feb 22, 2022
0858a52
Synchronous Read Ahead
Mar 8, 2022
9689f99
Asynchronous read ahed
Mar 10, 2022
db74d80
Merge remote-tracking branch 'original-repo/master' into 17-read-ahea…
Apr 2, 2022
7488654
Finalize Read Ahead
Apr 3, 2022
dbf0bce
Linting
Apr 4, 2022
57ebc1e
Merge branch '17-read-ahead-buffering' into 'devel'
omar-hmarzouk Apr 4, 2022
e8cf9c0
Linting Merge
Apr 4, 2022
e3887d5
Existing File Deletion when Opened in Write Mode
May 6, 2022
3fd58df
Removing Comments
May 6, 2022
7a22a32
Read Ahead Bug Fixes
May 8, 2022
afbc18b
Event Queue De-Centralization
May 10, 2022
79b9d18
Bug Fix
May 10, 2022
fab7f15
Various fixes to the DAOS tensorflow-io plugin. (#2)
krehm Jun 5, 2022
6746017
Merge branch 'tensorflow:master' into devel
omar-hmarzouk Jun 5, 2022
b0d5ad2
Linting
Jun 5, 2022
1af1181
Optimizations Added
Jun 6, 2022
b0a7b7d
Adjustments to Reading, Single Event Queue Handle, Paths Caching, and…
Jun 6, 2022
eb90e69
Add support for dynamically loaded DAOS libraries (#4)
krehm Jun 22, 2022
5f54aee
Linting
Jun 22, 2022
89c39e3
Various additional plugin fixes
krehm Jul 12, 2022
c53f6a0
Various additional plugin fixes (#6)
krehm Jul 25, 2022
66a764a
Adding Patches for Linting and skipping Windows & macOS Build
omar-hmarzouk Jul 26, 2022
83596c8
Merging Upstream
omar-hmarzouk Jul 26, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Various additional plugin fixes
Global Changes:
  * The plugin was using duplicate definitions of internal DAOS client
    structures (dfs_obj_t, dfs_t, dfs_entry_t), and would create malloc'd
    copies of those structs in order to be able to access their private
    fields.  Should DAOS modify those structures in future releases, the
    plugin would break for those releases.  The dependencies on internal
    fields have been removed, the DAOS client API is now strictly followed.
  * The path_map and size_map caches used DFS mount-point-relative
    pathnames as keys.  If more than one DFS filesystem is mounted during
    the same run, then the same relative pathname could be in use in
    both filesystems.  Callers that retrieved values from the caches
    could get results for the wrong filesytem.  Code was changed to create
    path_map and size_map caches per-filesystem.
  * A number of fields (connected, daos_fs, pool, container) were
    stored in the global DFS structure which meant that any time a
    path was presented to the plugin that was in a different DFS
    filesystem than the previous path, the current filesystem would
    have to be unmounted and then the new filesystem would be mounted.
    The application could have had files open at the time in the
    filesystem that was unmounted.  The code was changed to maintain
    filesystem state relative to each pool/container combo, and so
    any number of DFS filesystems can now be mounted simultaneously.
  * None of the code in the DFS Cleanup() function was ever being called.
    This is a known tensorflow issue, see
        tensorflow/tensorflow#27535
    The workaround is to call Cleanup() via the atexit() function.
  * The RenameFile function was enhanced to delete the cached size
    of the source file and store that cached size for the destination
    file.
  * The dfsLookUp() routine required the caller to indicate whether
    or not the object to be looked up was a directory.  This was
    necessary because dfs_open() was being used to open the object,
    and that call requires a different open_mode for directories and
    files.  However, the caller does not always know the type of the
    object being looked up, e.g PathExists() and IsDir().  If the
    caller guesses wrong, then the dfs_open() call fails, either with
    EINVAL or ENOTDIR.  The caller would map these errors to ENOENT,
    which is incorrect.  Code was changed to replace the dfs_open()
    call with dfs_lookup_rel(), which removes the requirement that
    the caller know the object's type a priori, the caller can check
    the type of the object after it has been opened.
  * The dfsLookUp() routine required all callers to implement three
    different behaviors depending upon the type of object being opened.
    1. If a DFS root directory, a null dfs_obj_t would be returned,
       this would have to be special-cased by the caller.
    2. If a non-root directory, a non-null dfs_obj_t would be returned
       which the caller must never release because the dfs_obj_t is
       also an entry in the path_map cache.  Releasing the entry would
       cause future requests that use that cache entry to fail.  There
       were a few cases in the code where this was occurring.
    3. If a non-directory, a non-null dfs_obj_t would be returned
       which the caller must always release when done with it.
    The code was changed so that a DFS root directory returns a
    non-null dfs_obj_t.  Also, whenever a directory that is in the
    path_map cache is referenced, dfs_dup() is used to make a (cheap)
    copy of the dfs_obj_t to return to the caller, so that the cached
    copy is never used outside of the cache.  As a result, dfsLookUp()
    now always returns a non-null dfs_obj_t which must be released when
    no longer in use.  Another advantage of using dfs_dup() is that it
    is then safe at any moment to clear a filesystem's path_map cache,
    there is no possibility that some caller is using a cached dfs_obj_t
    at that time.
  * All relative path references in the code have been replaced with
    references to a dfs_path_t class which encapsulates everything
    known about a particular DFS path, including the filesystem in
    which the path resides.  Member functions make it easy to update
    the correct caches for the correct filesystem for each path.
    Also, there were many places throughout the code where string
    manipulation was being done, e.g. to extract a parent pathname or
    a basename.  That code has been replaced with dfs_path_t member
    functions so that the actual string manipulation only occurs in
    a single place in the plugin.
  * Setup() now initializes a dfs_path_t instead of global pool, cont,
    and rel_path variables.  It also does some minor lexical
    normalization of the rel_path member, as opposed to doing so in
    multiple places in the code downstream.
  * Code was modified in various places so that 100 of the tests in
    the tensorflow modular_filesystem_test' test suite pass.  there are
    three remaining failing tests.  One is an incorrect test, one is
    checking a function not implemented in the plugin.  The third is
    reporting failures in TranslateName() which will be handled in a
    separate PR.
  * The plugin was coded to use 'dfs://' as the filesystem prefix, but
    the DAOS client API is coded internally to use 'daos://' as the
    prefix.  The plugin was changed to use 'daos://' so that pathnames
    used by one application would not have to be munged in order to
    also work with tensorflow.

Per file changes:

dfs_utils.h:
    * The per-container class cont_info_t was added that maintains all
      per-filesystem state.
    * Class dfs_path_t was added that maintains all per-file state.
      The class knows which filesystem the file is in, e.g. to update
      the correct cache maps.
    * The global fields connected, daos_fs, pool, container, path_map,
      and size_map are removed, replaced by the per-filesystem versions.
    * Mounts are now done at the same time as connection to the container,
      filesytems remain mounted until their containers are disconnected.

dfs_filesystem.cc:
  * Many of the functions were made static so that they don't show up
    in the library's symbol table, avoiding potential conflicts with
    other plugins.
  * Changed path references to dfs_path_t references throughout.
  DFSRandomAccessFile()
    * Replaced the dpath string with the dfs_path_t as a constructor
      parameter so that the per-filesystem size cache can be updated.
  DFSWritableFile()
    * Replaced the dpath string with the dfs_path_t as a constructor
      parameter so that the per-filesystem size cache can be updated
      whenever the file is appended to.
  NewWritableFile()
    * Changed file creation mode parameter to include S_IRUSR so that
      files can be read when the filesystem is mounted via fuse.
  NewAppendableFile()
    * Changed file creation mode parameter to include S_IRUSR so that
      files can be read when the filesystem is mounted via fuse.
  PathExists()
    * Reworked the code to work with the new dfsLookUp() behavior.
      dfsPathExists() call was removed as it no longer provided anything
      not already provided by dfsLookUp().  Also, many errors returned
      by dfsPathExists() were mapped to TF_NOT_FOUND, which was
      incorrect.  Also, PathExists() can be called for either files or
      directories, but dfsPathExists internally called dfsLookUp() with
      isDirectory = false, so callers that passed in a directory path
      would get failures.
  Stat()
    * Used to call dfs_get_size(), then called dfs_ostat(), but the file
      size is available  in stbuf, so the dfs_get_size() call was
      extra overhead and was removed.
  FlushCaches()
    * Used to call ClearConnections, which unmounted any filesystem and
      disconnected from its container and pool, when there could be
      files open for read or write.  The ClearConnections call was
      removed.  Code was added to clear the size caches as well as the
      directory caches.

dfs_utils.cc
  * New functions were added for clearing individual filesystem caches
    and all filesystem caches for all mounted filesystems.
  * There was code in many places for path string manipulation, checking
    if an object was a directory, etc.  dfs_path_t member functions were
    created to replace all these so that a particular operation was only
    implemented in one spot in the code.
  DFS::~DFS()
    * The code to clear the directory cache only released the first entry,
      there was no code to iterate through the container.  Replaced with
      function calls to clear all directory and size caches.
  Unmount()
    * Now done automatically as part of disconnecting a container, a
      separate function was no longer needed.
  ParseDFSPath()
    * The code assumed that any path it was given would have both pool
      and container components, it was unable to handle malformed paths.
      Code was changed to let duns_resolve_path() validate the path
      components.  There used to be two calls to duns_resolve_path()
      because DUNS_NO_CHECK_PATH was not set, and so the first path
      would fail if the pool and container components were labels,
      duns_resolv_path() only recognizes uuids if DUNS_NO_CHECK_PATH
      is not set.  When pool and/or container labels were used, the
      duns_resolve_path() code would check against local mounted
      filesystems, and would hopefully fail.  The code then prepended
      dfs:: and tried again, which would be recognized as a
      "direct path".  Paths which only contained uuids were
      successfully parsed with the first duns_resolve_path() call.
      By using the DUNS_NO_CHECK_PATH flag and always including the
      daos:// prefix, only a single system call is needed.
  Setup()
    * Reworked to populate a dfs_path_t instead of separate pool,
      cont, and relpath variables.  A filesystem is now automatically
      mounted as part of connecting to the container, so a separate
      function was no longer needed.
  ClearConnections()
    * The code for looping through pools and containers didn't work
      properly because the subroutines erase their map entries
      internally which invalidates the iterators being used in
      ClearConnections().  Code was changed so that the iterators
      are reinitialized each time through the loop.
    * Code to disconnect all the containers in a pool was moved to
      the DisconnectPool() function, so that it is not possible to
      disconnect a pool without first disconnecting all its containers.
  dfsDeleteObject()
    * Enhanced to only clear the directory cache for the filesystem in
      which the object existed.
    * If the object was a file, the size cache entry for that file is
      deleted.  If a directory was being recursively deleted, the
      filesystem's size cache is now also cleared.
  dfsLookUp() and dfsFindParent()
    * As mentioned at the top, code was rewritten so that cached
      directory entries are never returned to a caller, instead
      a dup reference is returned so that the caller is always
      given an object reference it must release.
   dfsCreateDir()
    * Error exit statuses were enhanced in order to pass the tensorflow
      'modular_filesystem_test' test suite.
  ConnectPool()
    * Simplified somewhat as the pool id_handle_t is no longer needed.
  ConnectContainer()
    * Simplified somewhat as the cont id_handle_t is no longer needed.
    * Added code to immediately mount any container that is connected.
    * Code added to initialize all the per-filesystem state variables
      was added.
  DisconnectPool()
    * Added code to disconnect any containers before disconnecting the
      pool.
  DisconnectContainer()
    * Added code to unmount any filesystem before disconnecting its
      container.
  * Added all the dfs_path_t member function implementations.
  * Included a few new dsym references for dfs function calls that have
    been added.

Signed-off-by: Kevan Rehm <kevan.rehm@hpe.com>
  • Loading branch information
krehm committed Jul 13, 2022
commit 89c39e3ed5061d595310ba10d9a9d2ccb5f77a53
4 changes: 2 additions & 2 deletions docs/daos_tf_docs.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DAOS-TensorFlow IO GUIDE

## Table Of Content
## Table Of Contents

- [Features](#features)
- [Prerequisites](#prerequisites)
Expand All @@ -15,7 +15,7 @@

## Prerequisites

* A valid DAOS installation, currently based on [version v1.3.106](https://github.com/daos-stack/daos/releases/tag/v1.3.106-tb)
* A valid DAOS installation, currently based on [version v2.0.2](https://github.com/daos-stack/daos/releases/tag/v2.0.2)
* An installation guide and steps can be accessed from [here](https://docs.daos.io/admin/installation/)

## Environment Setup
Expand Down
6 changes: 3 additions & 3 deletions docs/tutorials/daos.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -71,8 +71,8 @@
"\n",
"The pool and container id or label are part of the filename uri:\n",
"```\n",
"dfs://<pool_id>/<cont_id>/<path>\n",
"dfs://<pool-label>/cont-label/<path>\n",
"daos://<pool_id>/<cont_id>/<path>\n",
"daos://<pool-label>/cont-label/<path>\n",
"```"
]
},
Expand Down Expand Up @@ -230,7 +230,7 @@
},
"outputs": [],
"source": [
"dfs_url = \"dfs://TEST_POOL/TEST_CONT/\" # This the path you'll be using to load and access the dataset\n",
"dfs_url = \"daos://TEST_POOL/TEST_CONT/\" # This the path you'll be using to load and access the dataset\n",
"pwd = !pwd\n",
"posix_url = pwd[0] + \"/tests/test_dfs/\""
]
Expand Down
Loading