[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery seems to automatically convert STRING to BYTES if STRING > 186 bytes #1563

Closed
cgeudens opened this issue Apr 29, 2023 · 3 comments
Closed
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p3 Desirable enhancement or fix. May not be included in next release.

Comments

@cgeudens
Copy link
cgeudens commented Apr 29, 2023

Environment details

  • OS type and version: macOS Ventura 13.3.1
  • Python version: 3.9.12
  • pip version: pip 23.1.1
  • google-cloud-bigquery version: 3.8.0, 3.9.0, 3.10.0

Steps to reproduce

See the code sample below. Loading a string > 186 bytes to a STRING column raises the following exception: google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table. Field longstring has changed type from STRING to BYTES

Code example

import json
from google.cloud import bigquery
from google.cloud.bigquery.schema import SchemaField
from google.oauth2.service_account import Credentials
from pocs.poc_settings import settings

service_acc = json.loads(settings.GCP_BQ_SERVICE_ACC)
credentials = Credentials.from_service_account_info(info=service_acc)
bq_client = bigquery.Client(credentials=credentials)

def add_data(user_id):

    dataset_ref = bigquery.DatasetReference(project="myproject", dataset_id="mydataset")
    table_ref = bigquery.TableReference(dataset_ref=dataset_ref, table_id=user_id)

    json_rows = [
        {"longstring": "a" * 138}   # 187 bytes
    ]

    job_config = bigquery.LoadJobConfig(
        autodetect=False,
        create_disposition="CREATE_NEVER",
        max_bad_records=0,
        write_disposition="WRITE_APPEND"
    )

    job = bq_client.load_table_from_json(
        json_rows=json_rows,
        destination=table_ref,
        num_retries=2,
        job_config=job_config
    )

    job.result()

def create_table(user_id):

    dataset_ref = bigquery.DatasetReference(project="myproject", dataset_id="mydataset")
    table_ref = bigquery.TableReference(dataset_ref=dataset_ref, table_id=user_id)
    schema = [
        SchemaField(name="longstring", field_type="STRING")
    ]
    table = bigquery.Table(table_ref=table_ref, schema=schema)
    table = bq_client.create_table(table)

create_table("test")
add_data("test")

Stack trace

google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table. Field longstring has changed type from STRING to BYTES

UPDATE: After some more debugging, I'm starting to believe that this is a bug of the load_table_from_json() method. Replacing this method by a call to insert_rows_json() did it for me:

def add_data(user_id):

    dataset_ref = bigquery.DatasetReference(project="myproject", dataset_id="mydb")
    table_ref = bigquery.TableReference(dataset_ref=dataset_ref, table_id=user_id)
    json_rows = [
        {"longstring": a * 138}   # 187 bytes
    ]

    pot_errors = bq_client.insert_rows_json(table=table_ref, json_rows=json_rows, skip_invalid_rows=False)
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Apr 29, 2023
@chalmerlowe chalmerlowe added the priority: p3 Desirable enhancement or fix. May not be included in next release. label Aug 17, 2023
@Linchin
Copy link
Contributor
Linchin commented Jan 17, 2024

I tried to use load_table_from_file() with the same JSON file, and it works with long strings, too.

@Linchin
Copy link
Contributor
Linchin commented Jan 18, 2024

The problem goes away if I set job_config.autodetect == False. Might be auto-detect rules related.

@Linchin
Copy link
Contributor
Linchin commented Jan 18, 2024

Hi @cgeudens, thank you for bringing up this issue. I hope the following will answer your question:

When we set job_config.autodetect == True, the backend would do its best to guess the schema of the data. For a string like "c" * 138, it's impossible for the backend to tell if its base64 bytes or string, and the algorithm guesses it's bytes. If you add a character that isn't supported by base64, such as !, even longer strings will be interpreted as string. So I guess this is working as intended.

A quick workaround is to set autodetect to be false. But I do wonder if it makes sense to pull the existing schema from the backend and include it in the job config. I believe we have been doing this with some other methods.

I will close this issue for now, but feel free to reopen, or create a new issue if you have any further questions!

(a similar internal issue: b/154546830)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p3 Desirable enhancement or fix. May not be included in next release.
Projects
None yet
Development

No branches or pull requests

3 participants