Add a `filesystem` parameter for cf.read()/cfdm.read()

## Context

There are situations where we have high latency in remote file systems, and we don't want cf-python to be opening such file-systems every time. There are also some types of remote file system that cf-python doesn't support, and adding support for each one every time seems to be an unncessary burden now we have a pure python backend available that can take `fsspec` objects. 

Clearly I didn't write all this myself, but it is the result of a conversation with AI about what we need/want to do ... 

## Summary

Add a `filesystem` keyword argument to `cf.read()` (and `cfdm.read()`) that accepts a
pre-authenticated [fsspec](https://filesystem-spec.readthedocs.io/) `AbstractFileSystem`
object.  When present, cfdm uses `filesystem.open(path, "rb")` to obtain a file-like
object and passes it directly to `h5netcdf.File`.  This requires no changes to h5netcdf
or pyfive, unlocks SSH/SFTP natively, and allows warm connection reuse for any protocol.

---

## Background

### What works today

`cf.read("s3://bucket/path.nc", storage_options={...})` works because cfdm's
`netcdfread.py` has an explicit branch:

```python
if u.scheme == "s3":
    fs = s3fs.S3FileSystem(**storage_options)
    path = fs.open(uri)          # → file-like
    ...open h5netcdf with path...
```

`cf.read("https://server/path.nc")` works because h5netcdf/h5py recognise http URLs
and delegate to netCDF4-C's OPeNDAP support.

### What does not work

`cf.read("ssh://host/path.nc")` raises `DatasetTypeError`.

Verified from source: cfdm has **zero** ssh/sftp handling in both `cf` and `cfdm`
packages (confirmed by exhaustive grep and runtime test).

### The actual blockage

The barrier is not in h5netcdf or pyfive.  It is entirely in cfdm's `_datasets()`
generator (cfdm `read.py`, line ~351):

```python
for datasets1 in datasets:
    datasets1 = expanduser(expandvars(datasets1))   # ← fails on non-strings
    u = urisplit(datasets1)
    if u.scheme not in (None, "file"):
        yield datasets1    # remote URI passed as string — no fs object ever created
        continue
    ...iglob, walk, etc...
```

Every item in `datasets` is required to be a `str`.  A file-like object, a
`pathlib.Path`, an `(fs, path)` tuple, or an `fsspec.core.OpenFile` all fail here.

The subsequent NetCDF read path (`netcdfread.py` lines 520–585) only constructs an
`s3fs.S3FileSystem` from `storage_options`; for all other remote schemes the string is
passed verbatim to the netCDF4-C / h5netcdf constructors, which either reject it
(`ssh://`) or interpret it as an OPeNDAP URL (`http://`).

### h5netcdf and pyfive already support file-like objects

`h5netcdf.File(path, ...)` explicitly handles three input types (from its source):

```python
if isinstance(path, str):
    h5file = h5py.File(path, mode, **kwargs)       # string path or http URL
elif isinstance(path, h5py.File):
    return path, (mode in {"r", "r+", "a"}), False # already-open h5py handle
else:
    h5file = h5py.File(path, mode, **kwargs)        # ← file-like object
```

`h5py.File` itself states in its docstring:

> `name`: Name of the file on disk, **or file-like object**.

`_open_pyfive(path, mode)` simply calls `pyfive.File(path, mode)` — so pyfive receives
whatever `h5netcdf` passes, including file-like objects.

**Conclusion**: the entire h5netcdf / pyfive stack already handles file-like objects
today.  Only cfdm's string-only input pipeline prevents their use.

---

## Proposed Change

### New keyword argument

Add `filesystem` to both `cf.read()` and `cfdm.read()`:

```python
cf.read(
    datasets,
    ...,
    storage_options=None,    # existing
    filesystem=None,         # NEW: a pre-authenticated fsspec AbstractFileSystem
)
```

### Semantics

When `filesystem` is not `None`:

1. `datasets` must be a single path string (or list of path strings) that the given
   filesystem understands.
2. cfdm bypasses the URI-dispatch and s3fs-construction logic entirely.
3. For each path, cfdm calls `filesystem.open(path, "rb")` to obtain a seekable
   file-like object.
4. The file-like object is passed as the `path` argument to `h5netcdf.File`.

This makes the call sites look like:

```python
# SSH (currently impossible)
import fsspec
fs = fsspec.filesystem("ssh", host="hpc.example.ac.uk", username="user",
                       key_filename="~/.ssh/id_rsa")
cf.read("/data/model/run1.nc", filesystem=fs)

# S3 with pre-authenticated, reused connection (warmed up earlier)
import s3fs
fs = s3fs.S3FileSystem(key=KEY, secret=SECRET, endpoint_url=ENDPOINT)
cf.read("s3://bucket/path/run1.nc", filesystem=fs)

# SFTP via ProxyJump (handled entirely by fsspec/asyncssh)
fs = fsspec.filesystem("sftp", host="internal.hpc", username="user",
                       key_filename="...", proxy_jump="gateway.example.ac.uk")
cf.read("/scratch/run1.nc", filesystem=fs)
```

### Scope of changes in cfdm

The change is narrow and self-contained.  The only file that needs modification is
`cfdm/read_write/read.py` (the `_datasets()` generator) and
`cfdm/read_write/netcdf/netcdfread.py` (the open logic).

#### `_datasets()` — skip string processing when filesystem is given

```python
if kwargs.get("filesystem") is not None:
    # filesystem provided — datasets items are paths on that fs, not local strings
    for path in self._flat(kwargs["datasets"]):
        n_datasets += 1
        yield path
    return
```

This short-circuits before `expanduser`, `urisplit`, and `iglob`.

#### `netcdfread.py` — open via filesystem when provided

In the existing `open_netcdf` / local-open block (currently `if u.scheme == "s3": ...`),
add a parallel branch:

```python
filesystem = kwargs.get("filesystem")
if filesystem is not None:
    file_object = filesystem.open(dataset, "rb")
    nc = h5netcdf.File(file_object, mode="r", ...)
else:
    # existing s3 / local / opendap dispatch
    ...
```

The `dataset_type()` class method that probes format also needs a guard: when
`filesystem` is provided, skip the string-based `urisplit` check and probe by attempting
to open with h5netcdf directly (or assume netCDF4/HDF5 and let the caller specify
`dataset_type=` explicitly if needed).

Total line count of change: estimated **20–40 lines** across two files.

---

## Why the h5netcdf / pyfive Backend Is the Right Target

The `netCDF4` (C-library) backend does **not** natively accept file-like objects (it
has a `memory=` parameter for in-memory `bytes` buffers, but that requires a full copy
in memory before reading begins).

The `h5netcdf` backend (with either h5py or pyfive) accepts file-like objects natively
as shown above.

Since pyfive is a pure-Python HDF5 reader and the intended future preferred backend for
cf-python, and since pyfive's `File(path, mode)` already accepts anything that h5netcdf
passes, this change leverages the pure-Python stack cleanly with no C-library
constraints.

The proposal can therefore be described as:

> File-like input support for the h5netcdf/pyfive backend path.

The `netCDF4` backend would continue to require string paths (its existing behaviour is
unchanged).

---

## Motivation: Connection Warm-Up for Remote Files

The immediate motivation comes from latency hiding in applications that browse remote
filesystems before opening a file.

An application (e.g. xconv2) uses fsspec to browse files on S3 or SSH while the user
navigates.  When the user finally selects a file, the fsspec filesystem object is already
authenticated and connected.  Without `filesystem=`:

- **S3**: must reconstruct `s3fs.S3FileSystem` from credentials — nearly instant but
  wasteful if credentials need re-validation or a new connection is opened.
- **SSH**: **impossible** without staging or FUSE mount; `cf.read("ssh://...")` raises
  `DatasetTypeError`.

With `filesystem=`:

- The warm, authenticated `AbstractFileSystem` is passed directly.
- No redundant authentication round-trip.
- SSH, SFTP, and any other fsspec-supported protocol work identically.

---

## The HTTP Case: OPeNDAP vs Plain Range-Get

### What happens today with `http://` URIs

`cf.read("http://server/...")` reaches `open_netcdf()` as a bare string (the
`_datasets()` generator yields all non-`file`/`None` scheme URIs unchanged).
Inside `open_netcdf()`, the s3 branch is not taken, so the string goes directly
to the backend loop:

1. **h5netcdf/h5py** — `h5netcdf.File("http://...", "r")` calls
   `h5py.File("http://...", "r")` with no `driver=` argument.  h5py *does* have a
   `ros3` driver in its driver list (`h5fd.ROS3D` is present) but in the conda
   `work26` build **ros3 is not compiled in**: attempting `driver='ros3'` raises
   `ValueError: h5py was built without ROS3 support`.  Without `ros3`, h5py treats
   the URL as a local filesystem path and raises `FileNotFoundError`.  The h5netcdf
   backend therefore **fails** for any http URL.

2. **netCDF4-C** — `netCDF4.Dataset("http://...", "r")` uses the libnetCDF4-C
   OPeNDAP stack (libdap or libcurl-based DAP client).  This succeeds only if the
   server speaks the **DAP2/DAP4 protocol** (OPeNDAP, THREDDS, Hyrax, etc.).  A
   plain nginx-served HDF5 file will fail with an OPeNDAP parse error because the
   server returns raw HDF5 bytes, not a DAP response.

Summary: `cf.read("http://...")` today means **OPeNDAP only**.  It does not support
plain HTTP servers that serve HDF5/netCDF4 files with byte-range requests (nginx,
Apache, any static file host).

### Why `filesystem=` solves plain HTTP

fsspec's `HTTPFileSystem` implements the HDF5 access pattern correctly.
`HTTPFile` (a subclass of `AbstractBufferedFile`) issues `Range: bytes=X-Y` requests
and presents a seekable file-like interface to the caller.  This can be verified from
the fsspec source: `async_fetch_range()` sets `headers["Range"] = f"bytes={start}-{end
- 1}"` and validates `Content-Range` in the response.

h5py's `registered_drivers()` includes `'fileobj'` — it accepts seekable file-like
objects via that driver when `h5py.File(file_like_obj, "r")` is called.  h5netcdf
passes non-string paths straight to `h5py.File` unchanged (the `else` branch in
`_open_h5py`), and pyfive similarly receives whatever h5netcdf passes.

Therefore, with the proposed `filesystem=` parameter:

```python
import fsspec
fs = fsspec.filesystem("http")
# Plain nginx server, no OPeNDAP - works because fsspec issues Range requests
cf.read("/path/to/data.nc", filesystem=fs)
```

cfdm calls `fs.open("http://server/path/to/data.nc", "rb")` → returns an `HTTPFile`
with range-get → passed to `h5netcdf.File` → h5py `fileobj` driver → random-access
HDF5 reads over HTTP.

This path does **not** require ros3 in the h5py build, does not require a DAP server,
and works with any HTTP/HTTPS server that honours `Range` headers (nginx, Apache,
object storage HTTP endpoints, etc.).

### OPeNDAP is unaffected

The existing OPeNDAP path (bare `http://` string → netCDF4-C DAP client) continues
to work exactly as before for users who do not supply `filesystem=`.  Supplying
`filesystem=fsspec.filesystem("http")` explicitly opts in to the range-get path instead.

---

## Relationship to Existing `storage_options`

`storage_options` will continue to work as-is for programmatic S3 credential injection
without a pre-built filesystem.  The new `filesystem` parameter is not a replacement;
it is an escape hatch for callers that already hold a live filesystem object.

If both are provided, `filesystem` takes precedence (the pre-built object is presumably
more up-to-date) and a warning can optionally be emitted.

---

## Out of Scope

- Zarr datasets via fsspec `FSStore` / `zarr.storage` — Zarr already has its own
  fsspec integration and is a separate dispatch path in cfdm.
- CDL string datasets — unchanged.
- Write (`cf.write`) — write paths are not considered here.
- Validation that the provided filesystem is seekable — left to h5netcdf/h5py/pyfive to
  raise naturally.

---

## Summary of Benefits

| Scenario | Before | After |
|---|---|---|
| S3 with pre-authenticated `s3fs` | Must reconstruct fs from storage_options | Pass existing `fs` directly |
| SSH / SFTP | Impossible (DatasetTypeError) | Works via fsspec/asyncssh |
| SFTP via ProxyJump | Impossible | Works |
| HTTP OPeNDAP server | Works (netCDF4-C DAP client) | Unchanged — still works without `filesystem=` |
| HTTP plain file server (nginx, range-get) | Fails (not OPeNDAP, no ros3) | Works via `filesystem=fsspec.filesystem("http")` |
| Any future fsspec backend | Impossible | Works |
| Change size in cfdm | — | ~20–40 lines across 2 files |
| Changes to h5netcdf | — | None required |
| Changes to pyfive | — | None required |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `filesystem` parameter for cf.read()/cfdm.read() #931

Context

Summary

Background

What works today

What does not work

The actual blockage

h5netcdf and pyfive already support file-like objects

Proposed Change

New keyword argument

Semantics

Scope of changes in cfdm

`_datasets()` — skip string processing when filesystem is given

`netcdfread.py` — open via filesystem when provided

Why the h5netcdf / pyfive Backend Is the Right Target

Motivation: Connection Warm-Up for Remote Files

The HTTP Case: OPeNDAP vs Plain Range-Get

What happens today with `http://` URIs

Why `filesystem=` solves plain HTTP

OPeNDAP is unaffected

Relationship to Existing `storage_options`

Out of Scope

Summary of Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	Before	After
S3 with pre-authenticated `s3fs`	Must reconstruct fs from storage_options	Pass existing `fs` directly
SSH / SFTP	Impossible (DatasetTypeError)	Works via fsspec/asyncssh
SFTP via ProxyJump	Impossible	Works
HTTP OPeNDAP server	Works (netCDF4-C DAP client)	Unchanged — still works without `filesystem=`
HTTP plain file server (nginx, range-get)	Fails (not OPeNDAP, no ros3)	Works via `filesystem=fsspec.filesystem("http")`
Any future fsspec backend	Impossible	Works
Change size in cfdm	—	~20–40 lines across 2 files
Changes to h5netcdf	—	None required
Changes to pyfive	—	None required

Add a filesystem parameter for cf.read()/cfdm.read() #931

Description

Context

Summary

Background

What works today

What does not work

The actual blockage

h5netcdf and pyfive already support file-like objects

Proposed Change

New keyword argument

Semantics

Scope of changes in cfdm

_datasets() — skip string processing when filesystem is given

netcdfread.py — open via filesystem when provided

Why the h5netcdf / pyfive Backend Is the Right Target

Motivation: Connection Warm-Up for Remote Files

The HTTP Case: OPeNDAP vs Plain Range-Get

What happens today with http:// URIs

Why filesystem= solves plain HTTP

OPeNDAP is unaffected

Relationship to Existing storage_options

Out of Scope

Summary of Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add a `filesystem` parameter for cf.read()/cfdm.read() #931

`_datasets()` — skip string processing when filesystem is given

`netcdfread.py` — open via filesystem when provided

What happens today with `http://` URIs

Why `filesystem=` solves plain HTTP

Relationship to Existing `storage_options`