Skip to content

Remove dataset keyword from code base and use keyword asset #4324

@aicam

Description

@aicam

Task Summary

Title: Refactor dataset to generalized asset concept to support new resource types

Context

Currently in Texera, we use the keyword dataset to represent a group of files a user has uploaded into a LakeFS repository. Essentially, a dataset acts as a file system. This concept is heavily coupled throughout our stack: it functions as a resource in the dashboard, defines types and structures during workflow execution, and serves as a direct reference point within UDFs and Python code.

Motivation

We are planning to introduce a new resource type called model. Under the hood, a model is architecturally identical to a dataset: it is simply a repository containing files and folders in a tree structure (similar to an S3 bucket), backed by LakeFS and MinIO.

Because models and datasets share the exact same underlying file-system storage mechanism, the interpretation of the files stored in MinIO should be decoupled from the storage structure itself. Instead, the reading process (e.g., a UDF operator reading a file as a binary) should dictate how the content is interpreted.

Proposed Solution

To generalize our current LakeFS/MinIO storage architecture to support both datasets and models, we need to refactor the codebase to use a broader abstraction.

We propose introducing a new core keyword: asset.
The asset concept will act as the universal pointer to our storage layer, encompassing various specific resource types, including both dataset and model.

Tasks & Acceptance Criteria

To implement this abstraction, we need to replace occurrences of dataset with the generalized asset keyword across the stack.

  • Database: Rename all dataset occurrences in Postgres table names.
  • Common Utilities: Update all storage utilities in the common directory that currently refer to dataset.
  • File Service: Refactor all files within file-service to use the asset terminology.
  • UDF Definitions: Update UDF classes and type definitions that currently hardcode dataset as the sole reference to storage.

Priority

P2 – Medium

Task Type

  • Code Implementation
  • Documentation
  • Refactor / Cleanup
  • Testing / QA
  • DevOps / Deployment

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions