Description
Summary
The BigQueryReadClient
returns multiple 'empty' streams when instantiating a read session via create_read_session
. Attempting to invoke to_pandas()
on the stream reader yeilds an AttributeError
.
Firstly, is this behaviour abnormal? If not, I will just wrap the method in a try/catch and plow on.
Environment details
- OS type and version: Linux - WSL - Ubuntu 22.04.03 LTS
- Python version: 3.9.18
- pip version: 23.3.1
- package manager: [email protected]
google-cloud-bigquery-storage
version: 2.24.0
Steps to reproduce
from google.cloud.bigquery_storage_v1 import BigQueryReadClient, types
client = BigQueryReadClient()
requested_session = types.ReadSession()
requested_session.table = "projects/<project>/datasets/<dataset>/tables/<table>"
requested_session.data_format = types.DataFormat.AVRO
requested_session.read_options.selected_fields = <some_fields>
requested_session.read_options.row_restriction = <some_row_restriction>
parent = "projects/<project_id>"
session = client.create_read_session(
parent=parent,
read_session=requested_session,
)
dfs = []
for stream in session.streams:
reader = client.read_rows(stream.name)
sub_df = reader.to_dataframe() # < error raised here, for all but 1 of the streams: 'NoneType' object has no attribute '_parse_avro_schema'
dfs.append(sub_df)
...
Stack trace
Exception has occurred: AttributeError
'NoneType' object has no attribute '_parse_avro_schema'
File "reader.py", line 424, in to_dataframe
self._stream_parser._parse_avro_schema()
File "reader.py", line 299, in to_dataframe
return self.rows(read_session=read_session).to_dataframe(dtypes=dtypes)
AttributeError: 'NoneType' object has no attribute '_parse_avro_schema'
The relevent line:
So clearly, the object is not being populated as expected. After inspecting the data from the one stream that does yeild data, it seems that the remaining streams are empty.
Detail
The emergence of this problem is something specific to the table that I am accessing, and the combination of filtering and type of the requested field. The minimal case where this occurs is when quering a single BYTES
type field. The approximate size of this field is 0.1Mb.
The issue persists when querying one row. I can query 1 row, of just this BYTES
field from the BigQuery table and I will get some 13 empty streams and 1 populated stream.
If I try catch over the streams, I am able to successfully grab the data from the one stream.
Am I doing something wrong here, or is this normal?