-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Context
There are situations where we have high latency in remote file systems, and we don't want cf-python to be opening such file-systems every time. There are also some types of remote file system that cf-python doesn't support, and adding support for each one every time seems to be an unncessary burden now we have a pure python backend available that can take fsspec objects.
Clearly I didn't write all this myself, but it is the result of a conversation with AI about what we need/want to do ...
Summary
Add a filesystem keyword argument to cf.read() (and cfdm.read()) that accepts a
pre-authenticated fsspec AbstractFileSystem
object. When present, cfdm uses filesystem.open(path, "rb") to obtain a file-like
object and passes it directly to h5netcdf.File. This requires no changes to h5netcdf
or pyfive, unlocks SSH/SFTP natively, and allows warm connection reuse for any protocol.
Background
What works today
cf.read("s3://bucket/path.nc", storage_options={...}) works because cfdm's
netcdfread.py has an explicit branch:
if u.scheme == "s3":
fs = s3fs.S3FileSystem(**storage_options)
path = fs.open(uri) # → file-like
...open h5netcdf with path...cf.read("https://server/path.nc") works because h5netcdf/h5py recognise http URLs
and delegate to netCDF4-C's OPeNDAP support.
What does not work
cf.read("ssh://host/path.nc") raises DatasetTypeError.
Verified from source: cfdm has zero ssh/sftp handling in both cf and cfdm
packages (confirmed by exhaustive grep and runtime test).
The actual blockage
The barrier is not in h5netcdf or pyfive. It is entirely in cfdm's _datasets()
generator (cfdm read.py, line ~351):
for datasets1 in datasets:
datasets1 = expanduser(expandvars(datasets1)) # ← fails on non-strings
u = urisplit(datasets1)
if u.scheme not in (None, "file"):
yield datasets1 # remote URI passed as string — no fs object ever created
continue
...iglob, walk, etc...Every item in datasets is required to be a str. A file-like object, a
pathlib.Path, an (fs, path) tuple, or an fsspec.core.OpenFile all fail here.
The subsequent NetCDF read path (netcdfread.py lines 520–585) only constructs an
s3fs.S3FileSystem from storage_options; for all other remote schemes the string is
passed verbatim to the netCDF4-C / h5netcdf constructors, which either reject it
(ssh://) or interpret it as an OPeNDAP URL (http://).
h5netcdf and pyfive already support file-like objects
h5netcdf.File(path, ...) explicitly handles three input types (from its source):
if isinstance(path, str):
h5file = h5py.File(path, mode, **kwargs) # string path or http URL
elif isinstance(path, h5py.File):
return path, (mode in {"r", "r+", "a"}), False # already-open h5py handle
else:
h5file = h5py.File(path, mode, **kwargs) # ← file-like objecth5py.File itself states in its docstring:
name: Name of the file on disk, or file-like object.
_open_pyfive(path, mode) simply calls pyfive.File(path, mode) — so pyfive receives
whatever h5netcdf passes, including file-like objects.
Conclusion: the entire h5netcdf / pyfive stack already handles file-like objects
today. Only cfdm's string-only input pipeline prevents their use.
Proposed Change
New keyword argument
Add filesystem to both cf.read() and cfdm.read():
cf.read(
datasets,
...,
storage_options=None, # existing
filesystem=None, # NEW: a pre-authenticated fsspec AbstractFileSystem
)Semantics
When filesystem is not None:
datasetsmust be a single path string (or list of path strings) that the given
filesystem understands.- cfdm bypasses the URI-dispatch and s3fs-construction logic entirely.
- For each path, cfdm calls
filesystem.open(path, "rb")to obtain a seekable
file-like object. - The file-like object is passed as the
pathargument toh5netcdf.File.
This makes the call sites look like:
# SSH (currently impossible)
import fsspec
fs = fsspec.filesystem("ssh", host="hpc.example.ac.uk", username="user",
key_filename="~/.ssh/id_rsa")
cf.read("/data/model/run1.nc", filesystem=fs)
# S3 with pre-authenticated, reused connection (warmed up earlier)
import s3fs
fs = s3fs.S3FileSystem(key=KEY, secret=SECRET, endpoint_url=ENDPOINT)
cf.read("s3://bucket/path/run1.nc", filesystem=fs)
# SFTP via ProxyJump (handled entirely by fsspec/asyncssh)
fs = fsspec.filesystem("sftp", host="internal.hpc", username="user",
key_filename="...", proxy_jump="gateway.example.ac.uk")
cf.read("/scratch/run1.nc", filesystem=fs)Scope of changes in cfdm
The change is narrow and self-contained. The only file that needs modification is
cfdm/read_write/read.py (the _datasets() generator) and
cfdm/read_write/netcdf/netcdfread.py (the open logic).
_datasets() — skip string processing when filesystem is given
if kwargs.get("filesystem") is not None:
# filesystem provided — datasets items are paths on that fs, not local strings
for path in self._flat(kwargs["datasets"]):
n_datasets += 1
yield path
returnThis short-circuits before expanduser, urisplit, and iglob.
netcdfread.py — open via filesystem when provided
In the existing open_netcdf / local-open block (currently if u.scheme == "s3": ...),
add a parallel branch:
filesystem = kwargs.get("filesystem")
if filesystem is not None:
file_object = filesystem.open(dataset, "rb")
nc = h5netcdf.File(file_object, mode="r", ...)
else:
# existing s3 / local / opendap dispatch
...The dataset_type() class method that probes format also needs a guard: when
filesystem is provided, skip the string-based urisplit check and probe by attempting
to open with h5netcdf directly (or assume netCDF4/HDF5 and let the caller specify
dataset_type= explicitly if needed).
Total line count of change: estimated 20–40 lines across two files.
Why the h5netcdf / pyfive Backend Is the Right Target
The netCDF4 (C-library) backend does not natively accept file-like objects (it
has a memory= parameter for in-memory bytes buffers, but that requires a full copy
in memory before reading begins).
The h5netcdf backend (with either h5py or pyfive) accepts file-like objects natively
as shown above.
Since pyfive is a pure-Python HDF5 reader and the intended future preferred backend for
cf-python, and since pyfive's File(path, mode) already accepts anything that h5netcdf
passes, this change leverages the pure-Python stack cleanly with no C-library
constraints.
The proposal can therefore be described as:
File-like input support for the h5netcdf/pyfive backend path.
The netCDF4 backend would continue to require string paths (its existing behaviour is
unchanged).
Motivation: Connection Warm-Up for Remote Files
The immediate motivation comes from latency hiding in applications that browse remote
filesystems before opening a file.
An application (e.g. xconv2) uses fsspec to browse files on S3 or SSH while the user
navigates. When the user finally selects a file, the fsspec filesystem object is already
authenticated and connected. Without filesystem=:
- S3: must reconstruct
s3fs.S3FileSystemfrom credentials — nearly instant but
wasteful if credentials need re-validation or a new connection is opened. - SSH: impossible without staging or FUSE mount;
cf.read("ssh://...")raises
DatasetTypeError.
With filesystem=:
- The warm, authenticated
AbstractFileSystemis passed directly. - No redundant authentication round-trip.
- SSH, SFTP, and any other fsspec-supported protocol work identically.
The HTTP Case: OPeNDAP vs Plain Range-Get
What happens today with http:// URIs
cf.read("http://server/...") reaches open_netcdf() as a bare string (the
_datasets() generator yields all non-file/None scheme URIs unchanged).
Inside open_netcdf(), the s3 branch is not taken, so the string goes directly
to the backend loop:
-
h5netcdf/h5py —
h5netcdf.File("http://...", "r")calls
h5py.File("http://...", "r")with nodriver=argument. h5py does have a
ros3driver in its driver list (h5fd.ROS3Dis present) but in the conda
work26build ros3 is not compiled in: attemptingdriver='ros3'raises
ValueError: h5py was built without ROS3 support. Withoutros3, h5py treats
the URL as a local filesystem path and raisesFileNotFoundError. The h5netcdf
backend therefore fails for any http URL. -
netCDF4-C —
netCDF4.Dataset("http://...", "r")uses the libnetCDF4-C
OPeNDAP stack (libdap or libcurl-based DAP client). This succeeds only if the
server speaks the DAP2/DAP4 protocol (OPeNDAP, THREDDS, Hyrax, etc.). A
plain nginx-served HDF5 file will fail with an OPeNDAP parse error because the
server returns raw HDF5 bytes, not a DAP response.
Summary: cf.read("http://...") today means OPeNDAP only. It does not support
plain HTTP servers that serve HDF5/netCDF4 files with byte-range requests (nginx,
Apache, any static file host).
Why filesystem= solves plain HTTP
fsspec's HTTPFileSystem implements the HDF5 access pattern correctly.
HTTPFile (a subclass of AbstractBufferedFile) issues Range: bytes=X-Y requests
and presents a seekable file-like interface to the caller. This can be verified from
the fsspec source: async_fetch_range() sets `headers["Range"] = f"bytes={start}-{end
- 1}"
and validatesContent-Range` in the response.
h5py's registered_drivers() includes 'fileobj' — it accepts seekable file-like
objects via that driver when h5py.File(file_like_obj, "r") is called. h5netcdf
passes non-string paths straight to h5py.File unchanged (the else branch in
_open_h5py), and pyfive similarly receives whatever h5netcdf passes.
Therefore, with the proposed filesystem= parameter:
import fsspec
fs = fsspec.filesystem("http")
# Plain nginx server, no OPeNDAP - works because fsspec issues Range requests
cf.read("/path/to/data.nc", filesystem=fs)cfdm calls fs.open("http://server/path/to/data.nc", "rb") → returns an HTTPFile
with range-get → passed to h5netcdf.File → h5py fileobj driver → random-access
HDF5 reads over HTTP.
This path does not require ros3 in the h5py build, does not require a DAP server,
and works with any HTTP/HTTPS server that honours Range headers (nginx, Apache,
object storage HTTP endpoints, etc.).
OPeNDAP is unaffected
The existing OPeNDAP path (bare http:// string → netCDF4-C DAP client) continues
to work exactly as before for users who do not supply filesystem=. Supplying
filesystem=fsspec.filesystem("http") explicitly opts in to the range-get path instead.
Relationship to Existing storage_options
storage_options will continue to work as-is for programmatic S3 credential injection
without a pre-built filesystem. The new filesystem parameter is not a replacement;
it is an escape hatch for callers that already hold a live filesystem object.
If both are provided, filesystem takes precedence (the pre-built object is presumably
more up-to-date) and a warning can optionally be emitted.
Out of Scope
- Zarr datasets via fsspec
FSStore/zarr.storage— Zarr already has its own
fsspec integration and is a separate dispatch path in cfdm. - CDL string datasets — unchanged.
- Write (
cf.write) — write paths are not considered here. - Validation that the provided filesystem is seekable — left to h5netcdf/h5py/pyfive to
raise naturally.
Summary of Benefits
| Scenario | Before | After |
|---|---|---|
S3 with pre-authenticated s3fs |
Must reconstruct fs from storage_options | Pass existing fs directly |
| SSH / SFTP | Impossible (DatasetTypeError) | Works via fsspec/asyncssh |
| SFTP via ProxyJump | Impossible | Works |
| HTTP OPeNDAP server | Works (netCDF4-C DAP client) | Unchanged — still works without filesystem= |
| HTTP plain file server (nginx, range-get) | Fails (not OPeNDAP, no ros3) | Works via filesystem=fsspec.filesystem("http") |
| Any future fsspec backend | Impossible | Works |
| Change size in cfdm | — | ~20–40 lines across 2 files |
| Changes to h5netcdf | — | None required |
| Changes to pyfive | — | None required |