Skip to content

[SPARK-36600][SQL] Avoid unnecessary list conversion in createDataFrame#54882

Open
adityaksolves wants to merge 1 commit intoapache:masterfrom
adityaksolves:fix-pyspark-memory-spark-36600
Open

[SPARK-36600][SQL] Avoid unnecessary list conversion in createDataFrame#54882
adityaksolves wants to merge 1 commit intoapache:masterfrom
adityaksolves:fix-pyspark-memory-spark-36600

Conversation

@adityaksolves
Copy link

What changes were proposed in this pull request?

This PR refactors _createFromLocal in python/pyspark/sql/session.py to support lazy evaluation of generators when a StructType schema is provided. Previously, the code forced a list(data) conversion, which caused OutOfMemoryError for large datasets.
Key changes:

  • Used itertools.tee to "peek" at the first element of the input data to maintain compatibility with VariantVal type checks.
  • Replaced list comprehension with map() for internal_data conversion to ensure the pipeline remains a generator until it reaches self._sc.parallelize .

Why are the changes needed?
Currently, createDataFrame consumes the entire input collection into a local Python list before parallelizing it. When users pass a generator containing millions of rows (even with a predefined schema), the driver node exhausts its memory. This change allows Spark to stream the generator data directly into the RDD creation process.

Does this PR introduce any user-facing change?
No.

How was this patch tested?
The patch was tested using the existing PySpark unit test suite.

  • Verified that python/pyspark/sql/tests/test_types.py passes, specifically test_variant_type, which requires early detection of unsupported types.
  • Verified that the logic passes mypy type checking and ruff linting (no unused imports and correct camelCase parameter naming).
  • Manually verified that a large generator no longer triggers a local list conversion in session.py.

Was this patch authored or co-authored using generative AI tooling?
No

@adityaksolves
Copy link
Author

cc @HyukjinKwon - I have fixed the memory issue and all tests are passing (Build #4). This is ready for review and should be linked to [SPARK-36600].

Can the Spark QA bot please trigger the JIRA link?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant