[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature suggestion: cutoff_len could optionally drop too long examples from dataset. #3995

Open
s4s0l opened this issue May 30, 2024 · 1 comment
Labels
pending This problem is yet to be addressed

Comments

@s4s0l
Copy link
s4s0l commented May 30, 2024

Sorry if it was discussed somewhere but its hard to search the issues with 'cutoff_len' as it appears everywhere in logs:/

Currently setting a cutoff_len (at least for sft) will trim too long training examples using 'infer_max_len` function. It's a nice trick that takes into consideration that example is a 'pair' and it tries to cut the example in a way that does not cut whole answer. But it would be nice to have an option to not trim anything and simply exclude too long examples. Trimming examples can damage them in a way that will corrupt whole fine tuning process, especially for reasoning or math tasks, where trimming can generate example that looses its original intent or even is simply invalid.

Am I missing some setting, tool?

While hacking around it I noticed that 'Template' cannot control such behaviour because while 'encode_multiturn' returns an array but it actually cannot return here an empty / truncated list, so the changes are not local enough for my python skills / knowledge of codebase to prepare proper PR. My apologies:/

@maiqingqiang
Copy link

+1

@hiyouga hiyouga added the pending This problem is yet to be addressed label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

3 participants