TransformDataset doesn't process the data in paralell (uses only single worker) #146

wsuchy · 2019-11-01T18:16:07Z

When using multiple input files and FnApiRunner / SUBPROCESS_SDK runner:

pipeline_options = PipelineOptions(['--direct_num_workers', str(workers)])
return beam.Pipeline(options=pipeline_options,
                         runner=fn_api_runner.FnApiRunner(
                             default_environment=beam_runner_api_pb2.Environment(
                                 urn=python_urns.SUBPROCESS_SDK,
                                 payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
                                         % sys.executable.encode('ascii'))))

the tft_beam.AnalyzeAndTransformDataset really uses all workers, generates multiple output files, which makes processing quite fast (gist: analyze_and_transform()) .

The tft_beam.TransformDataset however uses only one worker and produces only one output file (gist: transform_only()). This makes almost impossible to process test and validation dastasets within a reasonable amount of time.

Is there a problem with my code or is it a bug?

GIST: https://gist.github.com/wsuchy/0c89b27a72b457ae6c904d8786658d2e
Dataset comes from https://www.kaggle.com/generall/oneshotwikilinks and has been processed using prepare_dataset function

The text was updated successfully, but these errors were encountered:

schmidt-jake · 2020-04-29T20:05:57Z

I'm also running into this issue.

wsuchy changed the title ~~TransformDataset doesn't process the data in paralell (uses only single core)~~ TransformDataset doesn't process the data in paralell (uses only single worker) Nov 1, 2019

rmothukuru self-assigned this Nov 4, 2019

rmothukuru added stat:awaiting tensorflower type:bug labels Nov 4, 2019

rmothukuru assigned zoyahav and unassigned rmothukuru Nov 4, 2019

rmothukuru added type:performance Performance Issue and removed type:bug labels Nov 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TransformDataset doesn't process the data in paralell (uses only single worker) #146

TransformDataset doesn't process the data in paralell (uses only single worker) #146

TransformDataset doesn't process the data in paralell (uses only single worker) #146

TransformDataset doesn't process the data in paralell (uses only single worker) #146

Comments