Skip to content

fix: missing first page in to_dataframe#15575

Open
daniel-sanche wants to merge 12 commits intomainfrom
fix_14900_2
Open

fix: missing first page in to_dataframe#15575
daniel-sanche wants to merge 12 commits intomainfrom
fix_14900_2

Conversation

@daniel-sanche
Copy link
Contributor

There was a bug in ReadRowsIterable.to_dataframe(), where it would attempt to use .to_arrow() to build a dataframe, before falling back to a generic method if .to_arrow was unsupported.

This was a problem, because both methods read from a single self.pages iterator, so the first attempt would pull a page of data out of the feed, which would then be lost if to_arrow() is unsupported

This PR fixes the bug by pulling a single item off of the stream and calling to_arrow on it individually, instead of reading off the self.pages generator directly

Fixes #14900

@daniel-sanche daniel-sanche requested review from a team as code owners February 18, 2026 02:32
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @daniel-sanche, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical data loss bug in the ReadRowsIterable.to_dataframe() method within the BigQuery Storage client library. Previously, an optimization attempt to use Apache Arrow could inadvertently consume the first page of data, causing it to be lost if the Arrow conversion failed. The solution implements a robust 'peek' strategy to safely determine the stream type and handles empty data streams more consistently, ensuring all data is processed correctly. This change significantly improves the reliability of data retrieval when converting BigQuery storage streams to pandas DataFrames.

Highlights

  • Data Loss Fix in to_dataframe(): Addressed a critical bug in ReadRowsIterable.to_dataframe() where an attempt to use to_arrow() would prematurely consume the first page of data, leading to data loss if the Arrow conversion was not supported.
  • Implemented 'Peek' Strategy: Introduced a 'peek' mechanism using itertools.chain to safely inspect the first page of data without consuming it from the self.pages generator, ensuring all data remains available for subsequent processing.
  • Centralized Empty DataFrame Handling: Refactored the logic for creating empty DataFrames into a new private helper method, _empty_dataframe, to consistently handle cases where the stream is empty, regardless of whether an Arrow or Avro parser is used.
  • Improved Avro Type Parsing: Adjusted the _dtypes_from_avro method to more robustly handle type information extraction from Avro schemas, particularly for union types.
  • Comprehensive Unit Tests Added: New unit tests were added to specifically verify the fix for data loss in Avro streams, ensure correct behavior with empty data streams (both with and without a session), and confirm the expected consumption behavior of to_arrow().
Changelog
  • packages/google-cloud-bigquery-storage/google/cloud/bigquery_storage_v1/reader.py
    • Imported itertools for enhanced stream manipulation.
    • Modified to_arrow to return an empty pyarrow.Table with an empty schema when no stream parser is initialized.
    • Refactored to_dataframe to use a 'peek' strategy with itertools.chain to prevent premature data consumption.
    • Extracted the logic for creating empty DataFrames into a new helper method, _empty_dataframe.
    • Updated _dtypes_from_avro to ensure type_info is correctly assigned before type checking.
  • packages/google-cloud-bigquery-storage/tests/unit/test_reader_v1.py
    • Added test_to_dataframe_avro_no_lost_records to confirm to_dataframe retains all records for Avro streams.
    • Included test_to_dataframe_empty_stream_no_session and test_to_dataframe_empty_stream_with_session to validate empty stream handling.
    • Added test_to_arrow_avro_consumes_first_page to verify to_arrow behavior when the format is unknown.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the data loss issue in to_dataframe() by implementing a 'peek' strategy on the first page of the stream. This prevents the self.pages iterator from being partially consumed and lost when the Arrow optimization fails. The introduction of the _empty_dataframe helper method improves maintainability and handles empty streams more robustly. I have suggested a few minor improvements to ensure consistency in schema handling and to prevent potential KeyError exceptions when custom dtypes are provided.

Comment on lines 411 to 424
try:
# Optimization: If it's an Arrow stream, calling to_arrow, then converting to a
# pandas dataframe is about 2x faster. This is because pandas.concat is
# rarely no-copy, whereas pyarrow.Table.from_batches + to_pandas is
# usually no-copy.
record_batches = [
p.to_arrow() for p in itertools.chain([first_page], pages)
]

table = pyarrow.Table.from_batches(record_batches)
df = table.to_pandas()
for column in dtypes:
df[column] = pandas.Series(df[column], dtype=dtypes[column])
return df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While the current implementation correctly handles the case where the first page fails to convert to Arrow, it is technically susceptible to data loss if p.to_arrow() succeeds for the first page but fails for a subsequent one (as the pages iterator would have been partially consumed before the except block is reached). Although BigQuery Storage streams are typically homogeneous in format, it is safer to verify the first page individually before committing to the full iteration.

Suggested change
try:
# Optimization: If it's an Arrow stream, calling to_arrow, then converting to a
# pandas dataframe is about 2x faster. This is because pandas.concat is
# rarely no-copy, whereas pyarrow.Table.from_batches + to_pandas is
# usually no-copy.
record_batches = [
p.to_arrow() for p in itertools.chain([first_page], pages)
]
table = pyarrow.Table.from_batches(record_batches)
df = table.to_pandas()
for column in dtypes:
df[column] = pandas.Series(df[column], dtype=dtypes[column])
return df
try:
# Optimization: If it's an Arrow stream, calling to_arrow, then converting to a
# pandas dataframe is about 2x faster.
first_batch = first_page.to_arrow()
record_batches = [first_batch] + [
p.to_arrow() for p in pages
]
table = pyarrow.Table.from_batches(record_batches)
df = table.to_pandas()
for column, dtype in dtypes.items():
if column in df.columns:
df[column] = pandas.Series(df[column], dtype=dtype)
else:
df[column] = pandas.Series([], dtype=dtype)
return df

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this code assumes if the first page is in arrow format, the rest of the stream will be as well. @Linchin let me know if that's a wrong assumption

I think I'll change the code to raise an exception if we do get in this unexpected state though, to be explicit about what we expect

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to expect the following pages to be arrow too. Thanks for adding the exception, just in case.

daniel-sanche and others added 4 commits February 17, 2026 18:42
…torage_v1/reader.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@daniel-sanche daniel-sanche changed the title fix: first page data loss in to_dataframe fix: missing first page in to_dataframe Feb 18, 2026
@parthea
Copy link
Contributor

parthea commented Feb 18, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves the bug where the first page of data could be lost when converting a ReadRowsIterable to a pandas DataFrame. The implemented "peek" strategy for the first page, along with the new _empty_dataframe helper method, provides a robust solution for handling both Arrow and Avro streams, as well as empty streams. The addition of comprehensive unit tests, including scenarios for lost records, empty streams, and mid-stream format changes, significantly improves the code's reliability. The minor refactoring in _dtypes_from_avro also contributes to better maintainability. Overall, the changes are well-designed and thoroughly tested.

df[column] = pandas.Series(df[column], dtype=dtype)
else:
df[column] = pandas.Series([], dtype=dtype)
return df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked Gemini for a code review, and it suggested to simplify this block to

        if isinstance(self._stream_parser, _ArrowStreamParser):
            self._stream_parser._parse_arrow_schema()
            
            # Initialize DataFrame directly from the Arrow schema
            df = self._stream_parser._schema.empty_table().to_pandas()

            # Apply dtypes efficiently
            for column, dtype in dtypes.items():
                df[column] = pandas.Series(df.get(column, []), dtype=dtype)
                
            return df

I ran nox -s unit-3.14 -r -- -k test_to_dataframe_empty_w_dtypes_arrow locally and it passed with this refactor

Comment on lines +442 to +446
if self._stream_parser is None:
df = pandas.DataFrame(columns=dtypes.keys())
for col, dtype in dtypes.items():
df[col] = pandas.Series([], dtype=dtype)
return df
Copy link
Contributor

@parthea parthea Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked Gemini for a code review, and it suggested to simplify this block to

if self._stream_parser is None:
    return pandas.DataFrame(columns=dtypes.keys()).astype(dtypes)

I ran unit tests with this suggestion and they passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bigquery-storage read session to_dataframe() loses 1152 records per stream

3 participants

Comments