You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During training, we employ a batch size of 32 and utilize the AdamW optimizer with a constant learn- ing rate of 5e − 5. The model is trained for two epochs.
But in the specialization.py script the gradient_accumulation_steps parameter is set to 8 and you are training for three epochs. Which is the true training recipe for the deberta-10k-rank_net model?
The text was updated successfully, but these errors were encountered:
The training recipe for deberta-10k-rank_net is as described in the paper. We use 4 GPUs in training and the global batch size is 4 * 8 = 32. We evaluate the checkpoint of 2 epochs.
In the paper you claim:
But in the
specialization.py
script thegradient_accumulation_steps
parameter is set to 8 and you are training for three epochs. Which is the true training recipe for thedeberta-10k-rank_net
model?The text was updated successfully, but these errors were encountered: