-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving performance with large number of compute_and_apply_vocabulary transforms #180
Comments
Unfortunately, you're right, there's no straight forward way to pack vocabulary computes as you would with other TFT analyzers. Depending on the size of your vocabularies you could join some of them (though with this method your vocabulary range for each feature will not be continuous, and you will need to be careful with frequency_threshold/top_k to make sure specific feature vocabularies don't get completely filtered out).
The vocabulary contents are: And the transformed data is: |
@zoyahav, thanks for the tips! I think that could help in certain cases for some vocabulary features. I do have a dataset currently with 102 string features and definitely have run into the issue with not even being able to start the dataflow job because the graph size is too large. In the short term, I will try to do my best to reduce the number of |
@cyc, Following up on this issue, Beam 2.24 has been released and should hopefully help with this issue. Could you try it and let us know? Also, for Dataflow, the V2 runner may very likely help. To try it, add |
I'm following up since the thread went quiet to make sure that this was resolved for you.
|
Closing this due to inactivity.Please feel free to reopen.Thanks. |
I have encountered the same issue, but with Flink. I am also using |
I have a dataset that has a relatively large number (dozens) of string/int features that need to be vocabularized. Is there any way to do that more efficiently? Right now in my preprocessing_fn I just have a separate
tft.compute_and_apply_vocabulary
for each feature, but this blows up my dataflow graph size and I suspect that the overall performance is worse due to this.Ordinarily if I were applying a numeric transform like
tft.bucketize
ortft.scale_to_z_score
I would just concatenate my numeric features together and apply a single analyzer op in an elementwise manner, which is much more efficient. However, for computing vocabularies there seems to be no way to do an optimization like this.Also, it is worth noting that if I add just 30-50 more of these string/int features that need to have a vocabulary computed, then I believe that I will quickly run into the "job graph size too large" error on dataflow. Is there a way to get around that?
The text was updated successfully, but these errors were encountered: