[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

empty content field in text anchor object when using DocumentAI's form processor #4740

Open
pauleeeeee opened this issue Oct 16, 2023 · 0 comments
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@pauleeeeee
Copy link

I checked everywhere I could and see one Reddit thread on this same behavior (https://www.reddit.com/r/googlecloud/comments/16xwhzt/entities_from_gc_document_ai_always_have/)

Environment details

  • which product (packages/*): Google Cloud Document AI Node.js + Typescript
  • OS: Windows 10
  • Node.js version: 16
  • npm version: latest
  • google-cloud-node version: latest ("@google-cloud/documentai": "^8.0.1")

Problem description

Expected behavior: use DocumentAI's Form Processor to extract table data. It's expected that the content field in the various nodes of the document response object return a value (ie: response[i].document?.pages[i].tables[i].headerRows[i].cells[i].layout?.textAnchor?.content should return with the table cell's content.

Actual behavior: all content fields on any textAnchor object are a blank string ("")).

image

Steps to reproduce

import * as fs from 'fs';
import { DocumentProcessorServiceClient } from '@google-cloud/documentai';
import { google } from '@google-cloud/documentai/build/protos/protos';


// Set up the Google Cloud Document AI client
const client = new DocumentProcessorServiceClient();
const projectId = 'your-project';
const location = 'region';
const processorId = 'processor ID';

// read in file from local storage
const testPDF = fs.readFileSync("src/test.pdf");
const docProcessor = new DocumentProcessorServiceClient();

// Define the request object for processing the document
const request:google.cloud.documentai.v1.IProcessRequest = {
    name: `projects/${projectId}/locations/${location}/processors/${processorId}`,
    rawDocument: {
        content: testPDF.toString('base64'),
        mimeType: 'application/pdf',
      }
};

// parse tables
function parseTables(response:Array<google.cloud.documentai.v1.ProcessResponse>):any{
    var tables:any[] = [];
    const pagei = 0;
    const tablei = 0;
    response[0].document?.pages?.forEach((page, index) => {
        page.tables?.forEach((table, index) => {
            table.headerRows?.forEach((row, index) => {
                row.cells?.forEach((cell, index) => {
                    console.log(cell.layout?.textAnchor?.content);
                });
            });
            table.bodyRows?.forEach((row, index) => {
                row.cells?.forEach((cell, index) => {
                    console.log(cell.layout?.textAnchor?.content);
                });
            });
            console.log(index);
        });
    });
}

function processDocument(){
    docProcessor.processDocument(request).then(
        data => {
            parseTables(data);  
        console.log(data);
      }).catch(function(error:any) {
        console.log(error);
      }); 
    
}
processDocument();

Additional information

It is clearly stated in the documentation that this content field should be filled for user convenience: https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#TextAnchor
image

@sofisl sofisl added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Oct 19, 2023
@sofisl sofisl self-assigned this Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

2 participants