Closed
Description
Version 2.4.0 of the library is allocating much more memory that the previous version, 2.3.1, when running multiple queries.
In particular, it seems that the QueryJob
object is retaining the results of the query internally, and that memory is not deallocated.
I think that the problem is related to #374.
Environment details
- macOS 11.0.1 (also observing this on Linux in a production environment)
- Python version: 3.8.6
- pip version: 20.1.1
google-cloud-bigquery
version: 2.4.0
Steps to reproduce
Run the script in the code example with google-cloud-bigquery
2.4.0 and 2.3.1 versions.
You will also need to install:
google-cloud-bigquery-storage==2.1.0
pandas==1.1.4
psutil==5.7.3
The outputs on my machine are:
With 2.4.0:
Initial memory used: 77 MB
Memory used: 642 MB
Memory used: 875 MB
Memory used: 1117 MB
Memory used: 1342 MB
Memory used: 1568 MB
Memory used: 1792 MB
Memory used: 2039 MB
Memory used: 2265 MB
Memory used: 2505 MB
Memory used: 2725 MB
With 2.3.1:
Initial memory used: 77 MB
Memory used: 97 MB
Memory used: 98 MB
Memory used: 99 MB
Memory used: 99 MB
Memory used: 99 MB
Memory used: 99 MB
Memory used: 100 MB
Memory used: 101 MB
Memory used: 101 MB
Memory used: 101 MB
Code example
Please note that we are storing a reference to the QueryJob
objects, but not to the resulting DataFrames.
import os
import psutil
from google.cloud import bigquery
if __name__ == '__main__':
client = bigquery.Client()
process = psutil.Process(os.getpid())
print(f"Initial memory used: {process.memory_info().rss / 1e6:.0f} MB")
jobs = []
for i in range(10):
job = client.query("SELECT x FROM UNNEST(GENERATE_ARRAY(1, 1000000)) AS x")
job.result().to_dataframe()
jobs.append(job)
print(f"Memory used: {process.memory_info().rss / 1e6:.0f} MB")