Skip to content

Add a filesystem parameter for cf.read()/cfdm.read() #931

@bnlawrence

Description

@bnlawrence

Context

There are situations where we have high latency in remote file systems, and we don't want cf-python to be opening such file-systems every time. There are also some types of remote file system that cf-python doesn't support, and adding support for each one every time seems to be an unncessary burden now we have a pure python backend available that can take fsspec objects.

Clearly I didn't write all this myself, but it is the result of a conversation with AI about what we need/want to do ...

Summary

Add a filesystem keyword argument to cf.read() (and cfdm.read()) that accepts a
pre-authenticated fsspec AbstractFileSystem
object. When present, cfdm uses filesystem.open(path, "rb") to obtain a file-like
object and passes it directly to h5netcdf.File. This requires no changes to h5netcdf
or pyfive, unlocks SSH/SFTP natively, and allows warm connection reuse for any protocol.


Background

What works today

cf.read("s3://bucket/path.nc", storage_options={...}) works because cfdm's
netcdfread.py has an explicit branch:

if u.scheme == "s3":
    fs = s3fs.S3FileSystem(**storage_options)
    path = fs.open(uri)          # → file-like
    ...open h5netcdf with path...

cf.read("https://server/path.nc") works because h5netcdf/h5py recognise http URLs
and delegate to netCDF4-C's OPeNDAP support.

What does not work

cf.read("ssh://host/path.nc") raises DatasetTypeError.

Verified from source: cfdm has zero ssh/sftp handling in both cf and cfdm
packages (confirmed by exhaustive grep and runtime test).

The actual blockage

The barrier is not in h5netcdf or pyfive. It is entirely in cfdm's _datasets()
generator (cfdm read.py, line ~351):

for datasets1 in datasets:
    datasets1 = expanduser(expandvars(datasets1))   # ← fails on non-strings
    u = urisplit(datasets1)
    if u.scheme not in (None, "file"):
        yield datasets1    # remote URI passed as string — no fs object ever created
        continue
    ...iglob, walk, etc...

Every item in datasets is required to be a str. A file-like object, a
pathlib.Path, an (fs, path) tuple, or an fsspec.core.OpenFile all fail here.

The subsequent NetCDF read path (netcdfread.py lines 520–585) only constructs an
s3fs.S3FileSystem from storage_options; for all other remote schemes the string is
passed verbatim to the netCDF4-C / h5netcdf constructors, which either reject it
(ssh://) or interpret it as an OPeNDAP URL (http://).

h5netcdf and pyfive already support file-like objects

h5netcdf.File(path, ...) explicitly handles three input types (from its source):

if isinstance(path, str):
    h5file = h5py.File(path, mode, **kwargs)       # string path or http URL
elif isinstance(path, h5py.File):
    return path, (mode in {"r", "r+", "a"}), False # already-open h5py handle
else:
    h5file = h5py.File(path, mode, **kwargs)        # ← file-like object

h5py.File itself states in its docstring:

name: Name of the file on disk, or file-like object.

_open_pyfive(path, mode) simply calls pyfive.File(path, mode) — so pyfive receives
whatever h5netcdf passes, including file-like objects.

Conclusion: the entire h5netcdf / pyfive stack already handles file-like objects
today. Only cfdm's string-only input pipeline prevents their use.


Proposed Change

New keyword argument

Add filesystem to both cf.read() and cfdm.read():

cf.read(
    datasets,
    ...,
    storage_options=None,    # existing
    filesystem=None,         # NEW: a pre-authenticated fsspec AbstractFileSystem
)

Semantics

When filesystem is not None:

  1. datasets must be a single path string (or list of path strings) that the given
    filesystem understands.
  2. cfdm bypasses the URI-dispatch and s3fs-construction logic entirely.
  3. For each path, cfdm calls filesystem.open(path, "rb") to obtain a seekable
    file-like object.
  4. The file-like object is passed as the path argument to h5netcdf.File.

This makes the call sites look like:

# SSH (currently impossible)
import fsspec
fs = fsspec.filesystem("ssh", host="hpc.example.ac.uk", username="user",
                       key_filename="~/.ssh/id_rsa")
cf.read("/data/model/run1.nc", filesystem=fs)

# S3 with pre-authenticated, reused connection (warmed up earlier)
import s3fs
fs = s3fs.S3FileSystem(key=KEY, secret=SECRET, endpoint_url=ENDPOINT)
cf.read("s3://bucket/path/run1.nc", filesystem=fs)

# SFTP via ProxyJump (handled entirely by fsspec/asyncssh)
fs = fsspec.filesystem("sftp", host="internal.hpc", username="user",
                       key_filename="...", proxy_jump="gateway.example.ac.uk")
cf.read("/scratch/run1.nc", filesystem=fs)

Scope of changes in cfdm

The change is narrow and self-contained. The only file that needs modification is
cfdm/read_write/read.py (the _datasets() generator) and
cfdm/read_write/netcdf/netcdfread.py (the open logic).

_datasets() — skip string processing when filesystem is given

if kwargs.get("filesystem") is not None:
    # filesystem provided — datasets items are paths on that fs, not local strings
    for path in self._flat(kwargs["datasets"]):
        n_datasets += 1
        yield path
    return

This short-circuits before expanduser, urisplit, and iglob.

netcdfread.py — open via filesystem when provided

In the existing open_netcdf / local-open block (currently if u.scheme == "s3": ...),
add a parallel branch:

filesystem = kwargs.get("filesystem")
if filesystem is not None:
    file_object = filesystem.open(dataset, "rb")
    nc = h5netcdf.File(file_object, mode="r", ...)
else:
    # existing s3 / local / opendap dispatch
    ...

The dataset_type() class method that probes format also needs a guard: when
filesystem is provided, skip the string-based urisplit check and probe by attempting
to open with h5netcdf directly (or assume netCDF4/HDF5 and let the caller specify
dataset_type= explicitly if needed).

Total line count of change: estimated 20–40 lines across two files.


Why the h5netcdf / pyfive Backend Is the Right Target

The netCDF4 (C-library) backend does not natively accept file-like objects (it
has a memory= parameter for in-memory bytes buffers, but that requires a full copy
in memory before reading begins).

The h5netcdf backend (with either h5py or pyfive) accepts file-like objects natively
as shown above.

Since pyfive is a pure-Python HDF5 reader and the intended future preferred backend for
cf-python, and since pyfive's File(path, mode) already accepts anything that h5netcdf
passes, this change leverages the pure-Python stack cleanly with no C-library
constraints.

The proposal can therefore be described as:

File-like input support for the h5netcdf/pyfive backend path.

The netCDF4 backend would continue to require string paths (its existing behaviour is
unchanged).


Motivation: Connection Warm-Up for Remote Files

The immediate motivation comes from latency hiding in applications that browse remote
filesystems before opening a file.

An application (e.g. xconv2) uses fsspec to browse files on S3 or SSH while the user
navigates. When the user finally selects a file, the fsspec filesystem object is already
authenticated and connected. Without filesystem=:

  • S3: must reconstruct s3fs.S3FileSystem from credentials — nearly instant but
    wasteful if credentials need re-validation or a new connection is opened.
  • SSH: impossible without staging or FUSE mount; cf.read("ssh://...") raises
    DatasetTypeError.

With filesystem=:

  • The warm, authenticated AbstractFileSystem is passed directly.
  • No redundant authentication round-trip.
  • SSH, SFTP, and any other fsspec-supported protocol work identically.

The HTTP Case: OPeNDAP vs Plain Range-Get

What happens today with http:// URIs

cf.read("http://server/...") reaches open_netcdf() as a bare string (the
_datasets() generator yields all non-file/None scheme URIs unchanged).
Inside open_netcdf(), the s3 branch is not taken, so the string goes directly
to the backend loop:

  1. h5netcdf/h5pyh5netcdf.File("http://...", "r") calls
    h5py.File("http://...", "r") with no driver= argument. h5py does have a
    ros3 driver in its driver list (h5fd.ROS3D is present) but in the conda
    work26 build ros3 is not compiled in: attempting driver='ros3' raises
    ValueError: h5py was built without ROS3 support. Without ros3, h5py treats
    the URL as a local filesystem path and raises FileNotFoundError. The h5netcdf
    backend therefore fails for any http URL.

  2. netCDF4-CnetCDF4.Dataset("http://...", "r") uses the libnetCDF4-C
    OPeNDAP stack (libdap or libcurl-based DAP client). This succeeds only if the
    server speaks the DAP2/DAP4 protocol (OPeNDAP, THREDDS, Hyrax, etc.). A
    plain nginx-served HDF5 file will fail with an OPeNDAP parse error because the
    server returns raw HDF5 bytes, not a DAP response.

Summary: cf.read("http://...") today means OPeNDAP only. It does not support
plain HTTP servers that serve HDF5/netCDF4 files with byte-range requests (nginx,
Apache, any static file host).

Why filesystem= solves plain HTTP

fsspec's HTTPFileSystem implements the HDF5 access pattern correctly.
HTTPFile (a subclass of AbstractBufferedFile) issues Range: bytes=X-Y requests
and presents a seekable file-like interface to the caller. This can be verified from
the fsspec source: async_fetch_range() sets `headers["Range"] = f"bytes={start}-{end

  • 1}"and validatesContent-Range` in the response.

h5py's registered_drivers() includes 'fileobj' — it accepts seekable file-like
objects via that driver when h5py.File(file_like_obj, "r") is called. h5netcdf
passes non-string paths straight to h5py.File unchanged (the else branch in
_open_h5py), and pyfive similarly receives whatever h5netcdf passes.

Therefore, with the proposed filesystem= parameter:

import fsspec
fs = fsspec.filesystem("http")
# Plain nginx server, no OPeNDAP - works because fsspec issues Range requests
cf.read("/path/to/data.nc", filesystem=fs)

cfdm calls fs.open("http://server/path/to/data.nc", "rb") → returns an HTTPFile
with range-get → passed to h5netcdf.File → h5py fileobj driver → random-access
HDF5 reads over HTTP.

This path does not require ros3 in the h5py build, does not require a DAP server,
and works with any HTTP/HTTPS server that honours Range headers (nginx, Apache,
object storage HTTP endpoints, etc.).

OPeNDAP is unaffected

The existing OPeNDAP path (bare http:// string → netCDF4-C DAP client) continues
to work exactly as before for users who do not supply filesystem=. Supplying
filesystem=fsspec.filesystem("http") explicitly opts in to the range-get path instead.


Relationship to Existing storage_options

storage_options will continue to work as-is for programmatic S3 credential injection
without a pre-built filesystem. The new filesystem parameter is not a replacement;
it is an escape hatch for callers that already hold a live filesystem object.

If both are provided, filesystem takes precedence (the pre-built object is presumably
more up-to-date) and a warning can optionally be emitted.


Out of Scope

  • Zarr datasets via fsspec FSStore / zarr.storage — Zarr already has its own
    fsspec integration and is a separate dispatch path in cfdm.
  • CDL string datasets — unchanged.
  • Write (cf.write) — write paths are not considered here.
  • Validation that the provided filesystem is seekable — left to h5netcdf/h5py/pyfive to
    raise naturally.

Summary of Benefits

Scenario Before After
S3 with pre-authenticated s3fs Must reconstruct fs from storage_options Pass existing fs directly
SSH / SFTP Impossible (DatasetTypeError) Works via fsspec/asyncssh
SFTP via ProxyJump Impossible Works
HTTP OPeNDAP server Works (netCDF4-C DAP client) Unchanged — still works without filesystem=
HTTP plain file server (nginx, range-get) Fails (not OPeNDAP, no ros3) Works via filesystem=fsspec.filesystem("http")
Any future fsspec backend Impossible Works
Change size in cfdm ~20–40 lines across 2 files
Changes to h5netcdf None required
Changes to pyfive None required

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataset readRelating to reading datasetsenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions