-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak for large strings #1539
Comments
Related to #1495 |
Hello! I am also experiencing a memory leak with these tokenizers when processing long sequences without any spaces. This has been reported as a memory leak in Sentence Transformers, and affects some of my users: UKPLab/sentence-transformers#1795 Reproductionimport random
import string
import time
import psutil
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained('xlm-roberta-base')
def random_string(length: int) -> str:
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
for iteration in range(99999999):
start_t = time.time()
tokenizer.encode_batch([random_string(12345) for _ in range(200)])
memory_usage_in_MiB = psutil.Process().memory_info().rss / (1024 * 1024)
delta_t = time.time() - start_t
print(f"{iteration:02d}: {memory_usage_in_MiB:.2f}MB, {delta_t:.2f}s") Outputs
This is rather severe, not just a massive growth in memory usage, but the tokenization speed is also much, much lower. NotesThe memory usage is much more reasonable if the strings:
|
I will check this might be related to FFI (Foreign Function Interface) and the way string are passed to rust in the background. |
+1 on facing this issue. Happy to help in any way to get this fixed! |
FWIW, it appears to leak even if |
This snippet will cause memory usage to rise indefinitely:
If you set
refresh_every
to 100000 (like it is in the snippet), the memory usage will keep on rising. This colab notebook crashes after about 15 minutes of executing.If you set
refresh_every
to 100, the memory consumption will be stable.The text was updated successfully, but these errors were encountered: