You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am generating a vocabulary from my own dataset containing 28GB TFRecords with short description strings (up to 20 words) and integer labels from 1-100.
Generating the vocabulary without labels works fine.
But as soon as the labels argument is provided to tft.vocabulary memory usage increases dramatically >100GB until the processes gets killed due to running out of memory.
`def main():
### Brings data into the pipeline
examples = external_input(
'directory with tfrecords')
example_gen = ImportExampleGen(input=examples)
examples = example_gen.outputs['examples']
Ubuntu 18.04
tensorflow 2.2.0
tfx 0.21.4
I am generating a vocabulary from my own dataset containing 28GB TFRecords with short description strings (up to 20 words) and integer labels from 1-100.
Generating the vocabulary without labels works fine.
But as soon as the labels argument is provided to tft.vocabulary memory usage increases dramatically >100GB until the processes gets killed due to running out of memory.
`def preprocessing_fn(inputs):
label = inputs['label']
desc = inputs['description']
desc = tf.strings.lower(desc.values)
# remove all numbers and punctuations
desc = tf.strings.regex_replace(desc, "[^a-zA-Z¿]+", " ")
tokens = tf.strings.split(desc)
ngrams = tf.strings.ngrams(tokens, [1,2])
ngrams = ngrams.to_sparse()
`def main():
### Brings data into the pipeline
examples = external_input(
'directory with tfrecords')
example_gen = ImportExampleGen(input=examples)
examples = example_gen.outputs['examples']
pipe = pipeline.Pipeline(
pipeline_name='test',
pipeline_root='pipelines/test',
components=[
example_gen,
schema_importer,
transform,
],
metadata_connection_config=metadata.sqlite_metadata_connection_config(
'test/metadata.db'),
enable_cache=True,
beam_pipeline_args=['--direct_num_workers=0'])
The text was updated successfully, but these errors were encountered: