[go: nahoru, domu]

Skip to content

Commit

Permalink
- Batch processing fix (#11)
Browse files Browse the repository at this point in the history
  • Loading branch information
evekhm committed Jun 7, 2023
1 parent f863c47 commit 33c858c
Show file tree
Hide file tree
Showing 13 changed files with 97 additions and 34 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,12 +140,14 @@ When exposed, the end point (via domain name) will be accessible via Internet an
on all the end points.
When protected, you will need machine in the same internal network in order to access the UI (for testing, you could create Windows VM in the same network and access it via RDP using IAP tunnel).

By default, the end-point is private (so then when upgrading customer accidentally end point does not become open unintentionally).
The preference can be set in `terraform/environments/dev/terraform.tfvars` file via `cda_external_ui` parameter:

```shell
cda_external_ui = true # Expose UI to the Internet: true or false
cda_external_ui = false # Expose UI to the Internet: true or false
```

For simple demo purposes you probably want to expose the end point (`cda_external_ui = true`).

### When deploying using Shared VPC
As is often the case in real-world configurations, this blueprint accepts as input an existing [Shared-VPC](https://cloud.google.com/vpc/docs/shared-vpc)
Expand Down
2 changes: 1 addition & 1 deletion cloudrun/queue/src/routes/queue.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ async def publish_msg(request: Request, response: Response):

try:
envelope = await request.json()
Logger.error(f"Pub/Sub envelope: {envelope}")
Logger.info(f"Pub/Sub envelope: {envelope}")

except json.JSONDecodeError:
response.status_code = status.HTTP_400_BAD_REQUEST
Expand Down
19 changes: 13 additions & 6 deletions common/src/common/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@

PDF_MIME_TYPE = "application/pdf"

DOC_CLASS_SPLIT_DISPLAY_NAME = "Sub-documents"
DOC_TYPE_SPLIT_DISPLAY_NAME = "Package"

# ========= Document upload ======================
BUCKET_NAME = f"{PROJECT_ID}-document-upload"
TOPIC_ID = "queue-topic"
Expand Down Expand Up @@ -93,7 +96,7 @@ def __init__(self, case_id, uid, gcs_url, document_type,


def init_bucket(bucketname, filename):
print(f"init_bucket with bucketname={bucketname}")
Logger.debug(f"init_bucket with bucketname={bucketname}")
global gcs
if not gcs:
gcs = storage.Client()
Expand Down Expand Up @@ -123,7 +126,7 @@ def load_config(bucketname, filename):
return config_data
else:
Logger.info(
f"load_config - It seems config file has changed....Reloading config from: {filename}")
f"load_config - It seems config file has changed....Reloading config from: {filename}")
try:
if blob.exists():
config_data = json.loads(blob.download_as_text(encoding="utf-8"))
Expand All @@ -137,7 +140,7 @@ def load_config(bucketname, filename):
# Fall-back to local file
Logger.warning(f"load_config - Warning: Using local {filename}")
json_file = open(
os.path.join(os.path.dirname(__file__), "config", filename))
os.path.join(os.path.dirname(__file__), "config", filename))
config_data = json.load(json_file)
return config_data

Expand Down Expand Up @@ -244,11 +247,15 @@ def get_document_type(doc_name):

def get_display_name_by_doc_class(doc_class):
Logger.debug(f"get_display_name_by_doc_class {doc_class}")
if doc_class is None:
return None

doc = get_document_types_config().get(doc_class)
if not doc:
Logger.error(
f"doc_class {doc_class} not present in document_types_config")
return None
if doc_class != DOC_CLASS_SPLIT_DISPLAY_NAME:
Logger.warning(
f"doc_class {doc_class} not present in document_types_config")
return doc_class

display_name = doc.get("display_name")
Logger.debug(f"Using doc_class={doc_class}, display_name={display_name}")
Expand Down
18 changes: 18 additions & 0 deletions docs/Lab1.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,24 @@ export API_DOMAIN=<YOUR_DOMAIN>
```

### Terraform

Copy terraform sample variable file as `terraform.tfvars`:
```shell
cp terraform/environments/dev/terraform.sample.tfvars terraform/environments/dev/terraform.tfvars
vi terraform/environments/dev/terraform.tfvars
```

Verify `cda_external_ip` points to the reserved External IP name inside `terraform/environments/dev/terraform.tfvars`:
```
cda_external_ip = "cda-ip" #IP-ADDRESS-NAME-HERE
```
For quickstart simple demo, you want to change and make end point public (change from `false` to `true`):
```shell
cda_external_ui = true # Expose UI to the Internet: true or false
```


> If you are missing `~/.kube/config` file on your system (never run `gcloud cluster get-credentials`), you will need to modify terraform file.
>
> If following command does not locate a file:
Expand Down
17 changes: 10 additions & 7 deletions init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,13 @@ export ORGANIZATION_ID=$(gcloud organizations list --format="value(name)")
export ADMIN_EMAIL=$(gcloud auth list --filter=status:ACTIVE --format="value(account)")
export TF_VAR_admin_email=${ADMIN_EMAIL}

FILE="${DIR}/terraform/environments/dev/terraform.tfvars"
if test -f "$FILE"; then
:
else
echo "Creating default terraform.tfvars ..."
cp "${DIR}/terraform/environments/dev/terraform.sample.tfvars" "${DIR}/terraform/environments/dev/terraform.tfvars"
fi
#FILE="${DIR}/terraform/environments/dev/terraform.tfvars"
#if test -f "$FILE"; then
# :
#else
# echo "Creating default terraform.tfvars ..."
# cp "${DIR}/terraform/environments/dev/terraform.sample.tfvars" "${DIR}/terraform/environments/dev/terraform.tfvars"
#fi

bash "${DIR}"/setup/setup_terraform.sh 2>&1 | tee -a "$LOG"

Expand All @@ -67,6 +67,9 @@ bash ../../../setup/update_config.sh 2>&1 | tee -a "$LOG"
subscription=$(terraform output -json eventarc_subscription)
subscription_name=$(echo "$subscription" | tr -d '"' | sed 's:.*/::')
gcloud alpha pubsub subscriptions update "$subscription_name" --ack-deadline=120 --project $PROJECT_ID
subscription=$(terraform output -json queue_subscription)
subscription_name=$(echo "$subscription" | tr -d '"' | sed 's:.*/::')
gcloud alpha pubsub subscriptions update "$subscription_name" --ack-deadline=120 --project $PROJECT_ID

#TODO fix in TF to perform this step
if [ -n "$DOCAI_PROJECT_ID" ] && [ "$DOCAI_PROJECT_ID" != "$PROJECT_ID" ]; then
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@

from common.utils.logging_handler import Logger
from common.config import get_document_type
from common.config import STATUS_SUCCESS, STATUS_ERROR, STATUS_SPLIT
from common.config import STATUS_SUCCESS, STATUS_ERROR, STATUS_SPLIT,\
DOC_CLASS_SPLIT_DISPLAY_NAME, DOC_TYPE_SPLIT_DISPLAY_NAME
from utils.classification.split_and_classify import classify, DocumentConfig

router = APIRouter(prefix="/classification")
Expand Down Expand Up @@ -158,8 +159,8 @@ async def classification(payload: ProcessTask):
# Document was split, need to update status of the original document,
# since it will not be sent for extraction
update_classification_status(case_id, uid, STATUS_SPLIT,
document_class="Sub-documents",
document_type="Package")
document_class=DOC_CLASS_SPLIT_DISPLAY_NAME,
document_type=DOC_TYPE_SPLIT_DISPLAY_NAME)

for doc_prediction_result in prediction:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,6 @@ def __init__(self, case_id, uid, gcs_url, out_folder,
self.context = context

self.pdf_path = f'{out_folder}/{case_id}_{uid}_' + basename(gcs_url)
Logger.info(f'Downloading from {gcs_url} to {self.pdf_path}')

self.blob = download_pdf_gcs(gcs_uri=gcs_url, output_filename=self.pdf_path)
self.doc_path = self.pdf_path
self.gcs_url = gcs_url


Expand All @@ -78,7 +74,7 @@ def get_classification_predictions(gcs_input_uris, timeout: int = 400):
result = {} # Contains per processed document
if not parser_details:
default_class = get_classification_default_class()
Logger.error(f"No classification parser defined, exiting classification, "
Logger.warning(f"No classification parser defined, exiting classification, "
f"using {default_class}")
for uri in gcs_input_uris:
result[uri] = [{'predicted_class': default_class,
Expand Down Expand Up @@ -115,10 +111,7 @@ def get_classification_predictions(gcs_input_uris, timeout: int = 400):
gcs_output_uri = f"gs://{DOCAI_OUTPUT_BUCKET_NAME}"

timestamp = datetime.datetime.utcnow().strftime("%Y-%m-%d_%H-%M-%S-%f")
gcs_output_uri_prefix = "extractor_out_" + timestamp
letters = string.ascii_lowercase
temp_folder = "".join(random.choice(letters) for i in range(10))
gcs_output_uri_prefix = "classifier_out_" + temp_folder
gcs_output_uri_prefix = "classifier_out_" + timestamp
# temp folder location
destination_uri = f"{gcs_output_uri}/{gcs_output_uri_prefix}/"

Expand Down Expand Up @@ -355,6 +348,8 @@ def split(doc_prediction: Dict, config: DocumentConfig):
f" output_filename={output_filename}")

try:
Logger.info(f'Downloading from {config.gcs_url} to {config.pdf_path}')
blob = download_pdf_gcs(gcs_uri=config.gcs_url, output_filename=config.pdf_path)
with Pdf.open(config.pdf_path) as original_pdf:
Logger.info(f"Creating: {output_filename} (confidence: {confidence})")
Logger.info(f"original_pdf.pages={original_pdf.pages}")
Expand Down
8 changes: 4 additions & 4 deletions microservices/matching_service/src/routes/matching.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ async def match_document(case_id: str, uid: str):
active="active").filter(document_type="application_form").get()

if af_doc and af_doc.entities is not None:
Logger.info(f"Matching document with case_id {case_id}"\
Logger.info(f"Matching document with case_id {case_id}"
f" and uid {uid} with the corresponding Application form")

#Get Supporting Document data from DB
Expand Down Expand Up @@ -135,13 +135,13 @@ async def match_document(case_id: str, uid: str):

if dsm_status.status_code == status.HTTP_200_OK:
Logger.info(
f"Matching document with case_id {case_id} and "\
f"Matching document with case_id {case_id} and "
f"uid {uid} was successful"
)
return {"status": STATUS_SUCCESS, "score": overall_score}
else:
Logger.error(
f"Matching document with case_id {case_id} and "\
f"Matching document with case_id {case_id} and "
f"uid {uid} Failed. Doc status not updated"
)
raise HTTPException(
Expand All @@ -157,7 +157,7 @@ async def match_document(case_id: str, uid: str):
# "Error in getting matching score")

else:
Logger.error(f"Matching with case_id {case_id} and uid {uid}: "\
Logger.warning(f"Matching with case_id {case_id} and uid {uid}: "
f"Application form with entities not found with given case_id: {case_id}")
return {"status": STATUS_SUCCESS, "score": 0}
# No matching => Nothing to match unless it is Configured!
Expand Down
1 change: 1 addition & 0 deletions sql-scripts/count.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
SELECT Count(*) as Count FROM `validation.validation_table` WHERE STARTS_WITH(case_id, @LABEL)
12 changes: 10 additions & 2 deletions sql-scripts/run_query.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
#source "$DIR/../SET"

SCRIPT=$1
LABEL=$2

if [ -z "$SCRIPT" ]; then
echo " Usage: ./run_query.sh QUERY_NAME"
echo " Options:"
Expand All @@ -12,12 +14,18 @@ if [ -z "$SCRIPT" ]; then
echo " - corrected_values"
echo " - entities"
echo " - confidence"

echo " - count"
exit
fi


FILTER="--parameter=LABEL:STRING:$LABEL"
#echo "Running $SCRIPT"
bq query --project_id="$PROJECT_ID" --dataset_id=$BIGQUERY_DATASET --nouse_legacy_sql --flagfile="${DIR}/${SCRIPT}.sql" 2> /dev/null
bq query --project_id="$PROJECT_ID" --dataset_id=$BIGQUERY_DATASET --nouse_legacy_sql "$FILTER" --flagfile="${DIR}/${SCRIPT}.sql" 2> /dev/null

#f=$( cat "${DIR}/${SCRIPT}.sql" )
#bq query --project_id="$PROJECT_ID" --dataset_id=$BIGQUERY_DATASET --nouse_legacy_sql "$FILTER" "$f" 2> /dev/null


#./run_query.sh query_extraction_confidence.sql
#./run_query.sh diagnose.sql
Expand Down
4 changes: 4 additions & 0 deletions terraform/environments/dev/output.tf
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@ output "eventarc_subscription" {
value = module.cloudrun-startspipeline-eventarc.event-subscription
}

output "queue_subscription" {
value = module.cloudrun-queue-pubsub.queue-subscription
}

output "ingress-ip" {
value = module.ingress.ingress_ip_address
}
Expand Down
2 changes: 1 addition & 1 deletion terraform/environments/dev/terraform.sample.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,6 @@
// }
// region = "us-central1"
//}
cda_external_ui = true # Expose UI to the Internet: true or false
cda_external_ui = false # Expose UI to the Internet: true or false
cda_external_ip = "cda-ip" # Name of the reserved IP address. Must be reserved in the Service Project, Global IP address
//master_ipv4_cidr_block = "172.16.0.0/28" # MASTER.CIDR/28 When using a different cidr block, make sure to add a firewall rule on port 8443 (see setup/setup_vpc_host_project.sh)
24 changes: 24 additions & 0 deletions terraform/modules/pubsub/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
/**
* Copyright 2022 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* https://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
//
output "queue-subscription" {
value = google_eventarc_trigger.queue-topic-trigger.transport[0].pubsub[0].subscription
}

//output "event-topic" {
// value = google_eventarc_trigger.startpipeline-topic-trigger.transport[0].pubsub[0].topic
//}

0 comments on commit 33c858c

Please sign in to comment.