Support combiner packing for tft.vocabulary #259

cyc · 2022-01-21T19:56:49Z

I'm sure this is likely on the roadmap already, but it would be very beneficial for data transformations that require many simultaneous vocabulary computations to support combiner packing the same way it is supported for tft.experimental.approximate_vocabulary.

This may be a duplicate of #180 (comment) but I think it is worth re-raising given that this has been implemented for approximate_vocabulary.

zoyahav · 2022-01-26T12:47:24Z

Due to implementation differences (i.e. vocabulary is not combiner based) this is a much more complex task to apply packing to vocabulary as opposed to approximate_vocabulary unfortunately, without causing regressions that is.
@iindyk has looked into this option in the past, he can comment further about feasibility here.

@cyc, could you please describe what it is you're hoping to accomlish for your pipeline through vocabulary analyzer packing?

cyc · 2022-01-27T18:27:54Z

@zoyahav the application is similar to what is described in #260. The issue is essentially the same as the previous issue, #180. It seems like having large numbers of unpacked analyzers has some negative effects on performance. I noticed that the implementation of the analyzer seemed to have changed between tft.vocabulary and tft.experimental.approximate_vocabulary so I was wondering if the implementation of tft.vocabulary would similarly change eventually.

Feel free to close this as "wontfix" if this isn't on the roadmap. I was more just inquiring about whether this has been considered or not.

iindyk · 2022-03-07T19:15:20Z

The main reason why tft.experimental.approximate_vocabulary can be packed is that it has a limit on the number of unique tokens in the resulting vocabulary (top_k is required) that can be pre-allocated; and if the actual number of unique input values is larger than top_k, approximation happens allowing to still keep only top_k most frequent elements.

The reason why we can't use this logic in tft.vocabulary is that we can't make any assumptions about the number of unique input values and we need exact computation. It also needs needs to work for very large vocabularies (O(10^8) tokens) for which pre-allocation approach is suboptimal.

I looked into some other ways of packing the vocabulary computation (different from what is done for approximate_vocabulary), but they did not provide improvement, particularly with large vocabularies.

Also worth noting that if top_k in approximate_vocabulary is >= the actual number of unique input values, then the computation will be exact and coincide with the result of tft.vocabulary which may allow some users to switch to it without accuracy degradation.

Sorry for a late response, just came across this.

pindinagesh self-assigned this Jan 24, 2022

pindinagesh added the type:support label Jan 25, 2022

pindinagesh assigned zoyahav and unassigned pindinagesh Jan 25, 2022

pindinagesh added the stat:awaiting tensorflower label Jan 25, 2022

zoyahav assigned iindyk Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support combiner packing for tft.vocabulary #259

Support combiner packing for tft.vocabulary #259

Support combiner packing for tft.vocabulary #259

Support combiner packing for tft.vocabulary #259

Comments