[SPARK-36600][SQL] Avoid unnecessary list conversion in createDataFrame by adityaksolves · Pull Request #54882 · apache/spark

adityaksolves · 2026-03-18T10:31:17Z

What changes were proposed in this pull request?

This PR refactors _createFromLocal in python/pyspark/sql/session.py to support lazy evaluation of generators when a StructType schema is provided. Previously, the code forced a list(data) conversion, which caused OutOfMemoryError for large datasets.
Key changes:

Used itertools.tee to "peek" at the first element of the input data to maintain compatibility with VariantVal type checks.
Replaced list comprehension with map() for internal_data conversion to ensure the pipeline remains a generator until it reaches self._sc.parallelize .

Why are the changes needed?
Currently, createDataFrame consumes the entire input collection into a local Python list before parallelizing it. When users pass a generator containing millions of rows (even with a predefined schema), the driver node exhausts its memory. This change allows Spark to stream the generator data directly into the RDD creation process.

Does this PR introduce any user-facing change?
No.

How was this patch tested?
The patch was tested using the existing PySpark unit test suite.

Verified that python/pyspark/sql/tests/test_types.py passes, specifically test_variant_type, which requires early detection of unsupported types.
Verified that the logic passes mypy type checking and ruff linting (no unused imports and correct camelCase parameter naming).
Manually verified that a large generator no longer triggers a local list conversion in session.py.

Was this patch authored or co-authored using generative AI tooling?
No

adityaksolves · 2026-03-18T12:14:28Z

cc @HyukjinKwon - I have fixed the memory issue and all tests are passing (Build #4). This is ready for review and should be linked to [SPARK-36600].

Can the Spark QA bot please trigger the JIRA link?

[SPARK-36600][SQL] Avoid unnecessary list conversion in createDataFrame

35549aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36600][SQL] Avoid unnecessary list conversion in createDataFrame#54882

[SPARK-36600][SQL] Avoid unnecessary list conversion in createDataFrame#54882
adityaksolves wants to merge 1 commit intoapache:masterfrom
adityaksolves:fix-pyspark-memory-spark-36600

adityaksolves commented Mar 18, 2026

Uh oh!

adityaksolves commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adityaksolves commented Mar 18, 2026

Uh oh!

adityaksolves commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant