fix: missing first page in to_dataframe by daniel-sanche · Pull Request #15575 · googleapis/google-cloud-python

daniel-sanche · 2026-02-18T02:32:41Z

There was a bug in ReadRowsIterable.to_dataframe(), where it would attempt to use .to_arrow() to build a dataframe, before falling back to a generic method if .to_arrow was unsupported.

This was a problem, because both methods read from a single self.pages iterator, so the first attempt would pull a page of data out of the feed, which would then be lost if to_arrow() is unsupported

This PR fixes the bug by pulling a single item off of the stream and calling to_arrow on it individually, instead of reading off the self.pages generator directly

Fixes #14900

gemini-code-assist · 2026-02-18T02:32:58Z

Summary of Changes

Hello @daniel-sanche, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical data loss bug in the ReadRowsIterable.to_dataframe() method within the BigQuery Storage client library. Previously, an optimization attempt to use Apache Arrow could inadvertently consume the first page of data, causing it to be lost if the Arrow conversion failed. The solution implements a robust 'peek' strategy to safely determine the stream type and handles empty data streams more consistently, ensuring all data is processed correctly. This change significantly improves the reliability of data retrieval when converting BigQuery storage streams to pandas DataFrames.

Highlights

Data Loss Fix in to_dataframe(): Addressed a critical bug in ReadRowsIterable.to_dataframe() where an attempt to use to_arrow() would prematurely consume the first page of data, leading to data loss if the Arrow conversion was not supported.
Implemented 'Peek' Strategy: Introduced a 'peek' mechanism using itertools.chain to safely inspect the first page of data without consuming it from the self.pages generator, ensuring all data remains available for subsequent processing.
Centralized Empty DataFrame Handling: Refactored the logic for creating empty DataFrames into a new private helper method, _empty_dataframe, to consistently handle cases where the stream is empty, regardless of whether an Arrow or Avro parser is used.
Improved Avro Type Parsing: Adjusted the _dtypes_from_avro method to more robustly handle type information extraction from Avro schemas, particularly for union types.
Comprehensive Unit Tests Added: New unit tests were added to specifically verify the fix for data loss in Avro streams, ensure correct behavior with empty data streams (both with and without a session), and confirm the expected consumption behavior of to_arrow().

Changelog

packages/google-cloud-bigquery-storage/google/cloud/bigquery_storage_v1/reader.py
- Imported itertools for enhanced stream manipulation.
- Modified to_arrow to return an empty pyarrow.Table with an empty schema when no stream parser is initialized.
- Refactored to_dataframe to use a 'peek' strategy with itertools.chain to prevent premature data consumption.
- Extracted the logic for creating empty DataFrames into a new helper method, _empty_dataframe.
- Updated _dtypes_from_avro to ensure type_info is correctly assigned before type checking.
packages/google-cloud-bigquery-storage/tests/unit/test_reader_v1.py
- Added test_to_dataframe_avro_no_lost_records to confirm to_dataframe retains all records for Avro streams.
- Included test_to_dataframe_empty_stream_no_session and test_to_dataframe_empty_stream_with_session to validate empty stream handling.
- Added test_to_arrow_avro_consumes_first_page to verify to_arrow behavior when the format is unknown.

Activity

No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request effectively addresses the data loss issue in to_dataframe() by implementing a 'peek' strategy on the first page of the stream. This prevents the self.pages iterator from being partially consumed and lost when the Arrow optimization fails. The introduction of the _empty_dataframe helper method improves maintainability and handles empty streams more robustly. I have suggested a few minor improvements to ensure consistency in schema handling and to prevent potential KeyError exceptions when custom dtypes are provided.

gemini-code-assist · 2026-02-18T02:34:40Z

packages/google-cloud-bigquery-storage/google/cloud/bigquery_storage_v1/reader.py

+        try:
+            # Optimization: If it's an Arrow stream, calling to_arrow, then converting to a
+            # pandas dataframe is about 2x faster. This is because pandas.concat is
+            # rarely no-copy, whereas pyarrow.Table.from_batches + to_pandas is
+            # usually no-copy.
+            record_batches = [
+                p.to_arrow() for p in itertools.chain([first_page], pages)
+            ]
+
+            table = pyarrow.Table.from_batches(record_batches)
+            df = table.to_pandas()
            for column in dtypes:
                df[column] = pandas.Series(df[column], dtype=dtypes[column])
            return df


While the current implementation correctly handles the case where the first page fails to convert to Arrow, it is technically susceptible to data loss if p.to_arrow() succeeds for the first page but fails for a subsequent one (as the pages iterator would have been partially consumed before the except block is reached). Although BigQuery Storage streams are typically homogeneous in format, it is safer to verify the first page individually before committing to the full iteration.

Suggested change

try:

# Optimization: If it's an Arrow stream, calling to_arrow, then converting to a

# pandas dataframe is about 2x faster. This is because pandas.concat is

# rarely no-copy, whereas pyarrow.Table.from_batches + to_pandas is

# usually no-copy.

record_batches = [

p.to_arrow() for p in itertools.chain([first_page], pages)

]

table = pyarrow.Table.from_batches(record_batches)

df = table.to_pandas()

for column in dtypes:

df[column] = pandas.Series(df[column], dtype=dtypes[column])

return df

try:

# Optimization: If it's an Arrow stream, calling to_arrow, then converting to a

# pandas dataframe is about 2x faster.

first_batch = first_page.to_arrow()

record_batches = [first_batch] + [

p.to_arrow() for p in pages

]

table = pyarrow.Table.from_batches(record_batches)

df = table.to_pandas()

for column, dtype in dtypes.items():

if column in df.columns:

df[column] = pandas.Series(df[column], dtype=dtype)

else:

df[column] = pandas.Series([], dtype=dtype)

return df

Yes, this code assumes if the first page is in arrow format, the rest of the stream will be as well. @Linchin let me know if that's a wrong assumption

I think I'll change the code to raise an exception if we do get in this unexpected state though, to be explicit about what we expect

I think it's reasonable to expect the following pages to be arrow too. Thanks for adding the exception, just in case.

packages/google-cloud-bigquery-storage/google/cloud/bigquery_storage_v1/reader.py

…torage_v1/reader.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

parthea · 2026-02-18T16:58:05Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively resolves the bug where the first page of data could be lost when converting a ReadRowsIterable to a pandas DataFrame. The implemented "peek" strategy for the first page, along with the new _empty_dataframe helper method, provides a robust solution for handling both Arrow and Avro streams, as well as empty streams. The addition of comprehensive unit tests, including scenarios for lost records, empty streams, and mid-stream format changes, significantly improves the code's reliability. The minor refactoring in _dtypes_from_avro also contributes to better maintainability. Overall, the changes are well-designed and thoroughly tested.

parthea · 2026-02-18T18:03:45Z

packages/google-cloud-bigquery-storage/google/cloud/bigquery_storage_v1/reader.py

+                    df[column] = pandas.Series(df[column], dtype=dtype)
+                else:
+                    df[column] = pandas.Series([], dtype=dtype)
+            return df


I asked Gemini for a code review, and it suggested to simplify this block to

if isinstance(self._stream_parser, _ArrowStreamParser): self._stream_parser._parse_arrow_schema() # Initialize DataFrame directly from the Arrow schema df = self._stream_parser._schema.empty_table().to_pandas() # Apply dtypes efficiently for column, dtype in dtypes.items(): df[column] = pandas.Series(df.get(column, []), dtype=dtype) return df

I ran nox -s unit-3.14 -r -- -k test_to_dataframe_empty_w_dtypes_arrow locally and it passed with this refactor

parthea · 2026-02-18T18:04:46Z

packages/google-cloud-bigquery-storage/google/cloud/bigquery_storage_v1/reader.py

+        if self._stream_parser is None:
+            df = pandas.DataFrame(columns=dtypes.keys())
+            for col, dtype in dtypes.items():
+                df[col] = pandas.Series([], dtype=dtype)
+            return df


I asked Gemini for a code review, and it suggested to simplify this block to

if self._stream_parser is None: return pandas.DataFrame(columns=dtypes.keys()).astype(dtypes)

I ran unit tests with this suggestion and they passed

daniel-sanche added 7 commits February 17, 2026 11:12

added tests for issue

3b67859

added fix

052d9d5

Merge branch 'main' into gemini_14900

4b53c99

simplified code

c6960f0

restructured code

b647174

cleaned up code

0470297

fixed comments

846bac8

daniel-sanche requested review from a team as code owners February 18, 2026 02:32

daniel-sanche requested a review from PhongChuong February 18, 2026 02:32

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

daniel-sanche mentioned this pull request Feb 18, 2026

bigquery-storage read session to_dataframe() loses 1152 records per stream #14900

Open

1 task

daniel-sanche and others added 4 commits February 17, 2026 18:42

Update packages/google-cloud-bigquery-storage/google/cloud/bigquery_s…

fd23fc3

…torage_v1/reader.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

raise error if format changes mid-stream

31595bc

updated comment

7d007af

added word

7c43864

daniel-sanche changed the title ~~fix: first page data loss in to_dataframe~~ fix: missing first page in to_dataframe Feb 18, 2026

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

updated test

ecd6b5e

parthea reviewed Feb 18, 2026

View reviewed changes

parthea approved these changes Feb 18, 2026

View reviewed changes

Linchin approved these changes Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: missing first page in to_dataframe#15575

fix: missing first page in to_dataframe#15575
daniel-sanche wants to merge 12 commits intomainfrom
fix_14900_2

daniel-sanche commented Feb 18, 2026

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 18, 2026

Uh oh!

daniel-sanche Feb 18, 2026

Uh oh!

Linchin Feb 18, 2026

Uh oh!

Uh oh!

parthea commented Feb 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

parthea Feb 18, 2026

Uh oh!

parthea Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

daniel-sanche commented Feb 18, 2026

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

daniel-sanche Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Linchin Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

parthea commented Feb 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

parthea Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

parthea Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

parthea Feb 18, 2026 •

edited

Loading