[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak for large strings #1539

Open
noamgai21 opened this issue May 23, 2024 · 5 comments
Open

Memory leak for large strings #1539

noamgai21 opened this issue May 23, 2024 · 5 comments

Comments

@noamgai21
Copy link
noamgai21 commented May 23, 2024

This snippet will cause memory usage to rise indefinitely:

from transformers import AutoTokenizer
import gc

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)
refresh_every = 100000

for i in range(100000):
  s = f'{i} {i} ' * 10000
  tokenizer.encode(s)
  gc.collect()
  if i % 100 == 0:
    print(i)
  if i % refresh_every == 0:
    tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", use_fast=True)

If you set refresh_every to 100000 (like it is in the snippet), the memory usage will keep on rising. This colab notebook crashes after about 15 minutes of executing.

If you set refresh_every to 100, the memory consumption will be stable.

@noamgai21
Copy link
Author

Related to #1495

@tomaarsen
Copy link
Member
tomaarsen commented Jun 18, 2024

Hello!

I am also experiencing a memory leak with these tokenizers when processing long sequences without any spaces. This has been reported as a memory leak in Sentence Transformers, and affects some of my users: UKPLab/sentence-transformers#1795

Reproduction

import random
import string
import time
import psutil
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained('xlm-roberta-base')

def random_string(length: int) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))

for iteration in range(99999999):
    start_t = time.time()
    tokenizer.encode_batch([random_string(12345) for _ in range(200)])
    memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
    delta_t = time.time() - start_t
    print(f"{iteration:02d}: {memory_usage_in_MiB:.2f}MB, {delta_t:.2f}s")

Outputs

00: 353.12MB, 0.35s
01: 421.64MB, 0.51s
02: 492.77MB, 0.68s
03: 571.88MB, 0.93s
04: 623.66MB, 1.02s
05: 710.28MB, 1.35s
06: 803.41MB, 1.31s
07: 859.77MB, 1.43s
08: 912.55MB, 1.69s
09: 1014.13MB, 1.78s
10: 1081.04MB, 1.95s
11: 1133.04MB, 2.29s
12: 1208.43MB, 2.56s
13: 1413.81MB, 2.65s
14: 1495.07MB, 2.83s
15: 1575.66MB, 3.00s
16: 1646.78MB, 3.19s
17: 1720.24MB, 3.57s
18: 1793.95MB, 3.82s
19: 1862.75MB, 4.02s
20: 1939.91MB, 4.21s
21: 2008.09MB, 4.71s
22: 2084.01MB, 5.04s
23: 2157.63MB, 5.26s
24: 2228.05MB, 5.56s
25: 2304.84MB, 6.13s
26: 2374.40MB, 6.50s
27: 2445.36MB, 6.68s
28: 2517.31MB, 7.38s
29: 2590.93MB, 7.91s
30: 2432.09MB, 8.19s
31: 2645.64MB, 8.56s
32: 2720.85MB, 8.81s
33: 2801.12MB, 9.73s
34: 2874.08MB, 10.14s
35: 2949.19MB, 11.18s
36: 3017.41MB, 11.28s
37: 3094.99MB, 12.76s
38: 3164.58MB, 14.09s
39: 3232.37MB, 13.26s
40: 3309.48MB, 15.10s

This is rather severe, not just a massive growth in memory usage, but the tokenization speed is also much, much lower.

Notes

The memory usage is much more reasonable if the strings:

  1. are not arbitrary, e.g. repeated "abc"
  2. contain spaces, e.g. by adding + " " to the list of choices.

@n1t0 @Narsil @ArthurZucker

  • Tom Aarsen

@ArthurZucker
Copy link
Collaborator

I will check this might be related to FFI (Foreign Function Interface) and the way string are passed to rust in the background.

@SilasMarvin
Copy link

+1 on facing this issue. Happy to help in any way to get this fixed!

@kczimm
Copy link
kczimm commented Jun 26, 2024

FWIW, it appears to leak even if TOKENIZERS_PARALLELISM=0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants