Skip to content

BigQueryReadClient.create_read_session returning multiple empty streams #733

Closed
@frederickmannings

Description

@frederickmannings

Summary

The BigQueryReadClient returns multiple 'empty' streams when instantiating a read session via create_read_session. Attempting to invoke to_pandas() on the stream reader yeilds an AttributeError.

Firstly, is this behaviour abnormal? If not, I will just wrap the method in a try/catch and plow on.

Environment details

  • OS type and version: Linux - WSL - Ubuntu 22.04.03 LTS
  • Python version: 3.9.18
  • pip version: 23.3.1
  • package manager: [email protected]
  • google-cloud-bigquery-storage version: 2.24.0

Steps to reproduce

    from google.cloud.bigquery_storage_v1 import BigQueryReadClient, types

    client = BigQueryReadClient()
    requested_session = types.ReadSession()
    requested_session.table = "projects/<project>/datasets/<dataset>/tables/<table>"

    requested_session.data_format = types.DataFormat.AVRO 

    requested_session.read_options.selected_fields = <some_fields>

    requested_session.read_options.row_restriction = <some_row_restriction>

    parent = "projects/<project_id>"
    session = client.create_read_session(
        parent=parent,
        read_session=requested_session,
    )

    dfs = []
    for stream in session.streams:
        reader = client.read_rows(stream.name)
        sub_df = reader.to_dataframe() # < error raised here, for all but 1 of the streams: 'NoneType' object has no attribute '_parse_avro_schema'

        dfs.append(sub_df)
        ...

Stack trace

Exception has occurred: AttributeError
'NoneType' object has no attribute '_parse_avro_schema'
  File "reader.py", line 424, in to_dataframe
    self._stream_parser._parse_avro_schema()
  File "reader.py", line 299, in to_dataframe
    return self.rows(read_session=read_session).to_dataframe(dtypes=dtypes)
AttributeError: 'NoneType' object has no attribute '_parse_avro_schema'

The relevent line:

self._stream_parser._parse_avro_schema()

So clearly, the object is not being populated as expected. After inspecting the data from the one stream that does yeild data, it seems that the remaining streams are empty.

Detail

The emergence of this problem is something specific to the table that I am accessing, and the combination of filtering and type of the requested field. The minimal case where this occurs is when quering a single BYTES type field. The approximate size of this field is 0.1Mb.

The issue persists when querying one row. I can query 1 row, of just this BYTES field from the BigQuery table and I will get some 13 empty streams and 1 populated stream.

If I try catch over the streams, I am able to successfully grab the data from the one stream.

Am I doing something wrong here, or is this normal?

Metadata

Metadata

Assignees

Labels

api: bigquerystorageIssues related to the googleapis/python-bigquery-storage API.priority: p3Desirable enhancement or fix. May not be included in next release.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions