Feature suggestion: cutoff_len could optionally drop too long examples from dataset. #3995

s4s0l · 2024-05-30T15:36:00Z

Sorry if it was discussed somewhere but its hard to search the issues with 'cutoff_len' as it appears everywhere in logs:/

Currently setting a cutoff_len (at least for sft) will trim too long training examples using 'infer_max_len` function. It's a nice trick that takes into consideration that example is a 'pair' and it tries to cut the example in a way that does not cut whole answer. But it would be nice to have an option to not trim anything and simply exclude too long examples. Trimming examples can damage them in a way that will corrupt whole fine tuning process, especially for reasoning or math tasks, where trimming can generate example that looses its original intent or even is simply invalid.

Am I missing some setting, tool?

While hacking around it I noticed that 'Template' cannot control such behaviour because while 'encode_multiturn' returns an array but it actually cannot return here an empty / truncated list, so the changes are not local enough for my python skills / knowledge of codebase to prepare proper PR. My apologies:/

maiqingqiang · 2024-05-31T02:22:08Z

+1

hiyouga added the pending This problem is yet to be addressed label Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature suggestion: cutoff_len could optionally drop too long examples from dataset. #3995

Feature suggestion: cutoff_len could optionally drop too long examples from dataset. #3995

Feature suggestion: cutoff_len could optionally drop too long examples from dataset. #3995

Feature suggestion: cutoff_len could optionally drop too long examples from dataset. #3995

Comments