NVIDIA / Megatron-LM Public

Notifications You must be signed in to change notification settings
Fork 2.1k
Star 9.2k

Code
Issues 302
Pull requests 130
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Security
Insights

Issues: NVIDIA/Megatron-LM

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear current search query, filters, and sorts

302 Open 275 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[BUG] Bug of expert model parallel stale

No activity in 60 days on issue or PR

#766 opened Apr 7, 2024 by 1049451037 updated Jun 30, 2024

[BUG]When loading from checkpoint to continue training, it will hang during the validation‘s forward.

#647 opened Dec 28, 2023 by young-chao updated Jun 29, 2024

[QUESTION] bf16 Parameters and fp32 Gradients stale

No activity in 60 days on issue or PR

#800 opened Apr 30, 2024 by pluiez updated Jun 29, 2024

Batch_input and elapsed time per iteration slow down during model training

#897 opened Jun 29, 2024 by Yuhanleeee updated Jun 29, 2024

[BUGS] Pipeline Parallelism fails/hangs with Megatron Core example

#881 opened Jun 20, 2024 by schheda1 updated Jun 28, 2024

[REGRESSION] MoEs are obtaining higher loss than they should during training

#894 opened Jun 27, 2024 by kiddyboots216 updated Jun 28, 2024

[BUG] @jit_fuser fails with Unknown type constructor Sequence

#880 opened Jun 20, 2024 by Edenzzzz updated Jun 28, 2024

[BUG]Question about helpers.cpp in version core_v0.7.0

#896 opened Jun 28, 2024 by longzhang418 updated Jun 28, 2024

[QUESTION] Does Megatron-LM supports P100?

#849 opened May 29, 2024 by gaokaiz2 updated Jun 28, 2024

[BUG] AttributeError: module 'transformer_engine' has no attribute 'pytorch' stale

No activity in 60 days on issue or PR

#696 opened Feb 19, 2024 by zhentingqi updated Jun 27, 2024

[QUESTION] Getting tools/preprocess_data.py to work is painful

#892 opened Jun 26, 2024 by sambar1729 updated Jun 26, 2024

[QUESTION] Sample idx, bin files in public domain for trying out pretrain_gpt.py?

#891 opened Jun 26, 2024 by sambar1729 updated Jun 26, 2024

[QUESTION] Has standalone_embedding_stage been supported yet in core?

#890 opened Jun 26, 2024 by JiwenJ updated Jun 26, 2024

[QUESTION] Why is TELayerNormColumnParallelLinear used instead of TEColumnParallelLinear in gpt_layer_specs

#884 opened Jun 21, 2024 by clarence-lee-sheng updated Jun 25, 2024

[BUG] NCCL TIMEOUT ( maybe ALLREDUCE ? )

#735 opened Mar 14, 2024 by ZhangEnmao updated Jun 25, 2024

[QUESTION]Zarr-based strategies will not be registered because of missing packages

#689 opened Feb 5, 2024 by ZhangEnmao updated Jun 24, 2024

How about supporting alternatives to fine-tuning? stale

No activity in 60 days on issue or PR

#114 opened Jul 6, 2021 by hwijeen updated Jun 22, 2024

[QUESTION] What's the internal difference for training when setting only "fp8-format" or setting "fp8-format"+"bf16"

#883 opened Jun 21, 2024 by dong-liuliu updated Jun 21, 2024

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively?

#785 opened Apr 19, 2024 by ezioliao updated Jun 20, 2024

[QUESTION] Why megatron-core seems slower and use more gpu mem than legacy for gpt_pretrain? stale

No activity in 60 days on issue or PR

#770 opened Apr 9, 2024 by REIGN12 updated Jun 19, 2024

[QUESTION] Validation loss & PPL keep going up stale

No activity in 60 days on issue or PR

#787 opened Apr 20, 2024 by zhentingqi updated Jun 19, 2024

[QUESTION] Gloo connectFullMesh failed when the number of nodes setting "export GLOO_SOCKET_IFNAME=bond4" exceeds 60

#877 opened Jun 19, 2024 by Genlovy-Hoo updated Jun 19, 2024

When can we have a the MOE checkpoint convert script.

#790 opened Apr 22, 2024 by shamanez updated Jun 19, 2024

[BUG] Megatron Core example not working

#855 opened Jun 3, 2024 by schheda1 updated Jun 18, 2024

[QUESTION]when pretraining bert，meet bug：cuBLAS Error: the requested functionality is not supported

#876 opened Jun 18, 2024 by shanyuaa updated Jun 18, 2024

Previous 1 2 3 4 5 … 12 13 Next

Previous Next

ProTip! Exclude everything labeled bug with -label:bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly