Memory leak for large strings #1539

noamgai21 · 2024-05-23T06:36:13Z

This snippet will cause memory usage to rise indefinitely:

from transformers import AutoTokenizer
import gc

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
refresh_every = 100000

for i in range(100000):
  s = f'{i} {i} ' * 10000
  tokenizer.encode(s)
  gc.collect()
  if i % 100 == 0:
    print(i)
  if i % refresh_every == 0:
    tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)

If you set refresh_every to 100000 (like it is in the snippet), the memory usage will keep on rising. This colab notebook crashes after about 15 minutes of executing.

If you set refresh_every to 100, the memory consumption will be stable.

The text was updated successfully, but these errors were encountered:

noamgai21 · 2024-05-23T06:36:32Z

Related to #1495

tomaarsen · 2024-06-18T11:47:08Z

Hello!

I am also experiencing a memory leak with these tokenizers when processing long sequences without any spaces. This has been reported as a memory leak in Sentence Transformers, and affects some of my users: UKPLab/sentence-transformers#1795

Reproduction

import random
import string
import time
import psutil
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('xlm-roberta-base')

def random_string(length: int) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))

for iteration in range(99999999):
    start_t = time.time()
    tokenizer.encode_batch([random_string(12345) for _ in range(200)])
    memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
    delta_t = time.time() - start_t
    print(f"{iteration:02d}: {memory_usage_in_MiB:.2f}MB, {delta_t:.2f}s")

Outputs

00: 353.12MB, 0.35s
01: 421.64MB, 0.51s
02: 492.77MB, 0.68s
03: 571.88MB, 0.93s
04: 623.66MB, 1.02s
05: 710.28MB, 1.35s
06: 803.41MB, 1.31s
07: 859.77MB, 1.43s
08: 912.55MB, 1.69s
09: 1014.13MB, 1.78s
10: 1081.04MB, 1.95s
11: 1133.04MB, 2.29s
12: 1208.43MB, 2.56s
13: 1413.81MB, 2.65s
14: 1495.07MB, 2.83s
15: 1575.66MB, 3.00s
16: 1646.78MB, 3.19s
17: 1720.24MB, 3.57s
18: 1793.95MB, 3.82s
19: 1862.75MB, 4.02s
20: 1939.91MB, 4.21s
21: 2008.09MB, 4.71s
22: 2084.01MB, 5.04s
23: 2157.63MB, 5.26s
24: 2228.05MB, 5.56s
25: 2304.84MB, 6.13s
26: 2374.40MB, 6.50s
27: 2445.36MB, 6.68s
28: 2517.31MB, 7.38s
29: 2590.93MB, 7.91s
30: 2432.09MB, 8.19s
31: 2645.64MB, 8.56s
32: 2720.85MB, 8.81s
33: 2801.12MB, 9.73s
34: 2874.08MB, 10.14s
35: 2949.19MB, 11.18s
36: 3017.41MB, 11.28s
37: 3094.99MB, 12.76s
38: 3164.58MB, 14.09s
39: 3232.37MB, 13.26s
40: 3309.48MB, 15.10s

This is rather severe, not just a massive growth in memory usage, but the tokenization speed is also much, much lower.

Notes

The memory usage is much more reasonable if the strings:

are not arbitrary, e.g. repeated "abc"
contain spaces, e.g. by adding + " " to the list of choices.

@n1t0 @Narsil @ArthurZucker

Tom Aarsen

ArthurZucker · 2024-06-21T08:21:20Z

I will check this might be related to FFI (Foreign Function Interface) and the way string are passed to rust in the background.

SilasMarvin · 2024-06-26T18:22:35Z

+1 on facing this issue. Happy to help in any way to get this fixed!

kczimm · 2024-06-26T18:58:13Z

FWIW, it appears to leak even if TOKENIZERS_PARALLELISM=0.

tomaarsen mentioned this issue Jun 18, 2024

Memory leak in SentenceTransformer.encode during the first ~10000 predictions UKPLab/sentence-transformers#1795

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak for large strings #1539

Memory leak for large strings #1539

Memory leak for large strings #1539

Memory leak for large strings #1539

Comments

Reproduction

Outputs

Notes