Improving performance with large number of compute_and_apply_vocabulary transforms #180

cyc · 2020-06-11T20:45:21Z

I have a dataset that has a relatively large number (dozens) of string/int features that need to be vocabularized. Is there any way to do that more efficiently? Right now in my preprocessing_fn I just have a separate tft.compute_and_apply_vocabulary for each feature, but this blows up my dataflow graph size and I suspect that the overall performance is worse due to this.

Ordinarily if I were applying a numeric transform like tft.bucketize or tft.scale_to_z_score I would just concatenate my numeric features together and apply a single analyzer op in an elementwise manner, which is much more efficient. However, for computing vocabularies there seems to be no way to do an optimization like this.

Also, it is worth noting that if I add just 30-50 more of these string/int features that need to have a vocabulary computed, then I believe that I will quickly run into the "job graph size too large" error on dataflow. Is there a way to get around that?

The text was updated successfully, but these errors were encountered:

zoyahav · 2020-06-12T12:27:34Z

Unfortunately, you're right, there's no straight forward way to pack vocabulary computes as you would with other TFT analyzers.

Depending on the size of your vocabularies you could join some of them (though with this method your vocabulary range for each feature will not be continuous, and you will need to be careful with frequency_threshold/top_k to make sure specific feature vocabularies don't get completely filtered out).

raw_data = [
      {'A': 'hello', 'B': 'world'},
      {'A': 'world', 'B': 'hello'},
      {'A': '!', 'B': '!'},
  ]

raw_data_metadata = dataset_metadata.DatasetMetadata(
    dataset_schema.from_feature_spec({
        'A': tf.io.FixedLenFeature([], tf.string),
        'B': tf.io.FixedLenFeature([], tf.string),
    }))

def preprocessing_fn(inputs):
    a = tf.strings.join(['A', inputs['A']])
    b = tf.strings.join(['B', inputs['B']])
    a_b = tf.concat((a, b), axis=-1)
    vocab = tft.vocabulary(a_b, vocab_filename='a_b_vocab')
    return {
        'a_int': tft.apply_vocabulary(a, vocab),
        'b_int': tft.apply_vocabulary(b, vocab),
    }

...
... = tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)
...

tft_output = tft.TFTransformOutput(transform_fn_dir)
print(tft_output.vocabulary_by_name('a_b_vocab'))

The vocabulary contents are:
[b'Bworld', b'Bhello', b'B!', b'Aworld', b'Ahello', b'A!']

And the transformed data is:
{'a_int': 4, 'b_int': 0}
{'a_int': 3, 'b_int': 1}
{'a_int': 5, 'b_int': 2}

cyc · 2020-06-13T17:45:55Z

@zoyahav, thanks for the tips! I think that could help in certain cases for some vocabulary features.

I do have a dataset currently with 102 string features and definitely have run into the issue with not even being able to start the dataflow job because the graph size is too large. In the short term, I will try to do my best to reduce the number of tft.vocabulary analyzers (e.g. pack them per your suggestion), but in the long term what are the plans for making the analyzers more scalable on dataflow?

rcrowe-google · 2020-10-02T00:53:47Z

@cyc, Following up on this issue, Beam 2.24 has been released and should hopefully help with this issue. Could you try it and let us know? Also, for Dataflow, the V2 runner may very likely help. To try it, add --experiments=use_runner_v2

rcrowe-google · 2021-01-28T23:43:14Z

I'm following up since the thread went quiet to make sure that this was resolved for you.

Were you able to try Beam >= 2.24, and was it an improvement?
Were you able to try Dataflow Runner V2, together with Dataflow Shuffle, and was it an improvement?

arghyaganguly · 2021-04-30T13:14:27Z

Closing this due to inactivity.Please feel free to reopen.Thanks.

meowcakes · 2021-10-12T09:11:21Z

I have encountered the same issue, but with Flink. I am also using tft.compute_and_apply_vocabulary on dozens of features, and the Flink DAG produced by TFT is enormous. Unless --execution_mode_for_batch=BATCH_FORCED is used the pipeline just hangs, and even with that option it takes an excessively long time to run. Increasing the number of task managers and the parallelism actually causes it to take longer to finish.

rmothukuru self-assigned this Jun 12, 2020

rmothukuru added the type:performance Performance Issue label Jun 12, 2020

rmothukuru assigned zoyahav and unassigned rmothukuru Jun 12, 2020

rmothukuru added the stat:awaiting tensorflower label Jun 12, 2020

arghyaganguly closed this as completed Apr 30, 2021

zoyahav assigned iindyk Oct 12, 2021

cyc mentioned this issue Jan 21, 2022

Support combiner packing for tft.vocabulary #259

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving performance with large number of compute_and_apply_vocabulary transforms #180

Improving performance with large number of compute_and_apply_vocabulary transforms #180

Improving performance with large number of compute_and_apply_vocabulary transforms #180

Improving performance with large number of compute_and_apply_vocabulary transforms #180

Comments