-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Issues: NVIDIA/Megatron-LM
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[BUG] Bug of expert model parallel
stale
No activity in 60 days on issue or PR
#766
opened Apr 7, 2024 by
1049451037
updated Jun 30, 2024
[BUG]When loading from checkpoint to continue training, it will hang during the validation‘s forward.
#647
opened Dec 28, 2023 by
young-chao
updated Jun 29, 2024
[QUESTION] bf16 Parameters and fp32 Gradients
stale
No activity in 60 days on issue or PR
#800
opened Apr 30, 2024 by
pluiez
updated Jun 29, 2024
Batch_input and elapsed time per iteration slow down during model training
#897
opened Jun 29, 2024 by
Yuhanleeee
updated Jun 29, 2024
[BUGS] Pipeline Parallelism fails/hangs with Megatron Core example
#881
opened Jun 20, 2024 by
schheda1
updated Jun 28, 2024
[REGRESSION] MoEs are obtaining higher loss than they should during training
#894
opened Jun 27, 2024 by
kiddyboots216
updated Jun 28, 2024
[BUG] @jit_fuser fails with Unknown type constructor Sequence
#880
opened Jun 20, 2024 by
Edenzzzz
updated Jun 28, 2024
[BUG]Question about helpers.cpp in version core_v0.7.0
#896
opened Jun 28, 2024 by
longzhang418
updated Jun 28, 2024
[QUESTION] Does Megatron-LM supports P100?
#849
opened May 29, 2024 by
gaokaiz2
updated Jun 28, 2024
[BUG] AttributeError: module 'transformer_engine' has no attribute 'pytorch'
stale
No activity in 60 days on issue or PR
#696
opened Feb 19, 2024 by
zhentingqi
updated Jun 27, 2024
[QUESTION] Getting tools/preprocess_data.py to work is painful
#892
opened Jun 26, 2024 by
sambar1729
updated Jun 26, 2024
[QUESTION] Sample idx, bin files in public domain for trying out pretrain_gpt.py?
#891
opened Jun 26, 2024 by
sambar1729
updated Jun 26, 2024
[QUESTION] Has standalone_embedding_stage been supported yet in core?
#890
opened Jun 26, 2024 by
JiwenJ
updated Jun 26, 2024
[QUESTION] Why is TELayerNormColumnParallelLinear used instead of TEColumnParallelLinear in gpt_layer_specs
#884
opened Jun 21, 2024 by
clarence-lee-sheng
updated Jun 25, 2024
[BUG] NCCL TIMEOUT ( maybe ALLREDUCE ? )
#735
opened Mar 14, 2024 by
ZhangEnmao
updated Jun 25, 2024
[QUESTION]Zarr-based strategies will not be registered because of missing packages
#689
opened Feb 5, 2024 by
ZhangEnmao
updated Jun 24, 2024
How about supporting alternatives to fine-tuning?
stale
No activity in 60 days on issue or PR
#114
opened Jul 6, 2021 by
hwijeen
updated Jun 22, 2024
[QUESTION] What's the internal difference for training when setting only "fp8-format" or setting "fp8-format"+"bf16"
#883
opened Jun 21, 2024 by
dong-liuliu
updated Jun 21, 2024
[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively?
#785
opened Apr 19, 2024 by
ezioliao
updated Jun 20, 2024
[QUESTION] Why megatron-core seems slower and use more gpu mem than legacy for gpt_pretrain?
stale
No activity in 60 days on issue or PR
#770
opened Apr 9, 2024 by
REIGN12
updated Jun 19, 2024
[QUESTION] Validation loss & PPL keep going up
stale
No activity in 60 days on issue or PR
#787
opened Apr 20, 2024 by
zhentingqi
updated Jun 19, 2024
[QUESTION] Gloo connectFullMesh failed when the number of nodes setting "export GLOO_SOCKET_IFNAME=bond4" exceeds 60
#877
opened Jun 19, 2024 by
Genlovy-Hoo
updated Jun 19, 2024
When can we have a the MOE checkpoint convert script.
#790
opened Apr 22, 2024 by
shamanez
updated Jun 19, 2024
[QUESTION]when pretraining bert,meet bug:cuBLAS Error: the requested functionality is not supported
#876
opened Jun 18, 2024 by
shanyuaa
updated Jun 18, 2024
Previous Next
ProTip!
Exclude everything labeled
bug
with -label:bug.