BigQuery seems to automatically convert STRING to BYTES if STRING > 186 bytes #1563

cgeudens · 2023-04-29T16:09:59Z

Environment details

OS type and version: macOS Ventura 13.3.1
Python version: 3.9.12
pip version: pip 23.1.1
google-cloud-bigquery version: 3.8.0, 3.9.0, 3.10.0

Steps to reproduce

See the code sample below. Loading a string > 186 bytes to a STRING column raises the following exception: google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table. Field longstring has changed type from STRING to BYTES

Code example

import json
from google.cloud import bigquery
from google.cloud.bigquery.schema import SchemaField
from google.oauth2.service_account import Credentials
from pocs.poc_settings import settings

service_acc = json.loads(settings.GCP_BQ_SERVICE_ACC)
credentials = Credentials.from_service_account_info(info=service_acc)
bq_client = bigquery.Client(credentials=credentials)

def add_data(user_id):

    dataset_ref = bigquery.DatasetReference(project="myproject", dataset_id="mydataset")
    table_ref = bigquery.TableReference(dataset_ref=dataset_ref, table_id=user_id)

    json_rows = [
        {"longstring": "a" * 138}   # 187 bytes
    ]

    job_config = bigquery.LoadJobConfig(
        autodetect=False,
        create_disposition="CREATE_NEVER",
        max_bad_records=0,
        write_disposition="WRITE_APPEND"
    )

    job = bq_client.load_table_from_json(
        json_rows=json_rows,
        destination=table_ref,
        num_retries=2,
        job_config=job_config
    )

    job.result()

def create_table(user_id):

    dataset_ref = bigquery.DatasetReference(project="myproject", dataset_id="mydataset")
    table_ref = bigquery.TableReference(dataset_ref=dataset_ref, table_id=user_id)
    schema = [
        SchemaField(name="longstring", field_type="STRING")
    ]
    table = bigquery.Table(table_ref=table_ref, schema=schema)
    table = bq_client.create_table(table)

create_table("test")
add_data("test")

Stack trace

google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table. Field longstring has changed type from STRING to BYTES

UPDATE: After some more debugging, I'm starting to believe that this is a bug of the load_table_from_json() method. Replacing this method by a call to insert_rows_json() did it for me:

def add_data(user_id):

    dataset_ref = bigquery.DatasetReference(project="myproject", dataset_id="mydb")
    table_ref = bigquery.TableReference(dataset_ref=dataset_ref, table_id=user_id)
    json_rows = [
        {"longstring": a * 138}   # 187 bytes
    ]

    pot_errors = bq_client.insert_rows_json(table=table_ref, json_rows=json_rows, skip_invalid_rows=False)

The text was updated successfully, but these errors were encountered:

Linchin · 2024-01-17T23:53:47Z

I tried to use load_table_from_file() with the same JSON file, and it works with long strings, too.

Linchin · 2024-01-18T00:50:23Z

The problem goes away if I set job_config.autodetect == False. Might be auto-detect rules related.

Linchin · 2024-01-18T01:41:53Z

Hi @cgeudens, thank you for bringing up this issue. I hope the following will answer your question:

When we set job_config.autodetect == True, the backend would do its best to guess the schema of the data. For a string like "c" * 138, it's impossible for the backend to tell if its base64 bytes or string, and the algorithm guesses it's bytes. If you add a character that isn't supported by base64, such as !, even longer strings will be interpreted as string. So I guess this is working as intended.

A quick workaround is to set autodetect to be false. But I do wonder if it makes sense to pull the existing schema from the backend and include it in the job config. I believe we have been doing this with some other methods.

I will close this issue for now, but feel free to reopen, or create a new issue if you have any further questions!

(a similar internal issue: b/154546830)

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Apr 29, 2023

chalmerlowe mentioned this issue Aug 17, 2023

Revisit the method load_table_from_json() #1646

Open

chalmerlowe added the priority: p3 Desirable enhancement or fix. May not be included in next release. label Aug 17, 2023

Linchin closed this as completed Jan 18, 2024

This was referenced Jan 25, 2024

fix: no longer set autodetect=True with empty schema in load_table_from_json() #1801

Closed

fix: change load_table_from_json autodetect logic #1804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery seems to automatically convert STRING to BYTES if STRING > 186 bytes #1563

BigQuery seems to automatically convert STRING to BYTES if STRING > 186 bytes #1563

BigQuery seems to automatically convert STRING to BYTES if STRING > 186 bytes #1563

BigQuery seems to automatically convert STRING to BYTES if STRING > 186 bytes #1563

Comments

Environment details

Steps to reproduce

Code example

Stack trace