-
Notifications
You must be signed in to change notification settings - Fork 117
Remove dataset keyword from code base and use keyword asset #4324
Description
Task Summary
Title: Refactor dataset to generalized asset concept to support new resource types
Context
Currently in Texera, we use the keyword dataset to represent a group of files a user has uploaded into a LakeFS repository. Essentially, a dataset acts as a file system. This concept is heavily coupled throughout our stack: it functions as a resource in the dashboard, defines types and structures during workflow execution, and serves as a direct reference point within UDFs and Python code.
Motivation
We are planning to introduce a new resource type called model. Under the hood, a model is architecturally identical to a dataset: it is simply a repository containing files and folders in a tree structure (similar to an S3 bucket), backed by LakeFS and MinIO.
Because models and datasets share the exact same underlying file-system storage mechanism, the interpretation of the files stored in MinIO should be decoupled from the storage structure itself. Instead, the reading process (e.g., a UDF operator reading a file as a binary) should dictate how the content is interpreted.
Proposed Solution
To generalize our current LakeFS/MinIO storage architecture to support both datasets and models, we need to refactor the codebase to use a broader abstraction.
We propose introducing a new core keyword: asset.
The asset concept will act as the universal pointer to our storage layer, encompassing various specific resource types, including both dataset and model.
Tasks & Acceptance Criteria
To implement this abstraction, we need to replace occurrences of dataset with the generalized asset keyword across the stack.
- Database: Rename all
datasetoccurrences in Postgres table names. - Common Utilities: Update all storage utilities in the
commondirectory that currently refer todataset. - File Service: Refactor all files within
file-serviceto use theassetterminology. - UDF Definitions: Update UDF classes and type definitions that currently hardcode
datasetas the sole reference to storage.
Priority
P2 – Medium
Task Type
- Code Implementation
- Documentation
- Refactor / Cleanup
- Testing / QA
- DevOps / Deployment