memory leak when using tft.vocabulary with labels argument #181

AnnKatrinBecker · 2020-06-18T10:42:17Z

Ubuntu 18.04
tensorflow 2.2.0
tfx 0.21.4

I am generating a vocabulary from my own dataset containing 28GB TFRecords with short description strings (up to 20 words) and integer labels from 1-100.

Generating the vocabulary without labels works fine.
But as soon as the labels argument is provided to tft.vocabulary memory usage increases dramatically >100GB until the processes gets killed due to running out of memory.

`def preprocessing_fn(inputs):
label = inputs['label']
desc = inputs['description']
desc = tf.strings.lower(desc.values)
# remove all numbers and punctuations
desc = tf.strings.regex_replace(desc, "[^a-zA-Z¿]+", " ")
tokens = tf.strings.split(desc)
ngrams = tf.strings.ngrams(tokens, [1,2])
ngrams = ngrams.to_sparse()

tft.vocabulary(ngrams, top_k=100000, labels=tf.sparse.to_dense(label),
               vocab_filename='ngrams_100k_labels')
return {'description': desc, 'label': label}`

`def main():
### Brings data into the pipeline
examples = external_input(
'directory with tfrecords')
example_gen = ImportExampleGen(input=examples)
examples = example_gen.outputs['examples']

# Import schema
schema_importer = ImporterNode(
    instance_name='imported_schema',
    source_uri='pipelines/test/SchemaGen/schema/3',
    artifact_type=Schema)

### Perform transformation
transform = Transform(examples=example_gen.outputs['examples'],
                      schema=schema_importer.outputs['result'],
                      module_file='preprocessing.py')

pipe = pipeline.Pipeline(
pipeline_name='test',
pipeline_root='pipelines/test',
components=[
example_gen,
schema_importer,
transform,
],
metadata_connection_config=metadata.sqlite_metadata_connection_config(
'test/metadata.db'),
enable_cache=True,
beam_pipeline_args=['--direct_num_workers=0'])

absl.logging.set_verbosity(absl.logging.INFO)
BeamDagRunner().run(pipe)`

The text was updated successfully, but these errors were encountered:

mrcolo · 2020-10-25T17:28:20Z

Any update on this? I'm getting a similar problem import a 38GB dataset using ImportExampleGen.

arghyaganguly · 2021-04-27T11:18:09Z

For large scale datasets it is recommended to use DataflowRunner (on GCP) or FlinkRunner or SparkRunner if it is a on-premises execution.

rmothukuru self-assigned this Jun 19, 2020

rmothukuru transferred this issue from tensorflow/tfx Jun 19, 2020

rmothukuru added stat:awaiting tensorflower type:performance Performance Issue labels Jun 19, 2020

rmothukuru assigned zoyahav and unassigned rmothukuru Jun 19, 2020

zoyahav assigned iindyk Apr 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory leak when using tft.vocabulary with labels argument #181

memory leak when using tft.vocabulary with labels argument #181

memory leak when using tft.vocabulary with labels argument #181

memory leak when using tft.vocabulary with labels argument #181

Comments