Releases: huggingface/transformers
v4.30.1 Patch release
- Fix bnb config json serialization in #24137 by @younesbelkada
- Correctly build models and import call_context for older TF versions in #24138 by @Rocketknight1
- Fix bugs with trainer in #24134 by @pacman100
v4.30.0: 100k, Agents improvements, Safetensors core dependency, Swiftformer, Autoformer, MobileViTv2, timm-as-a-backbone
100k
Transformers has just reached 100k stars on GitHub, and to celebrate we wanted to highlight 100 projects in the vicinity of transformers
and we have decided to create an awesome-transformers page to do just that.
We accept PRs to add projects to the list!
- Top 100 by @LysandreJik in #22912
- Add LlamaIndex to awesome-transformers.md by @ravi03071991 in #23484
- add cleanlab to awesome-transformers tools list by @jwmueller in #23440
4-bit quantization and QLoRA
By leveraging the bitsandbytes
library by @TimDettmers, we add 4-bit support to transformers
models!
- 4-bit QLoRA via bitsandbytes (4-bit base model + LoRA) by @TimDettmers in #23479
Agents
The Agents framework has been improved and continues to be stabilized. Among bug fixes, here are the important new features that were added:
- Local agent capabilities, to load a generative model directly from
transformers
instead of relying on APIs. - Prompts are now hosted on the Hub, which means that anyone can fork the prompts and update them with theirs, to let other community contributors re-use them
- We add an
AzureOpenAiAgent
class to support Azure OpenAI agents.
- Add local agent by @sgugger in #23438
- Enable prompts on the Hub by @sgugger in #23662
- Add AzureOpenAiAgent by @sgugger in #24058
Safetensors
The safetensors
library is a safe serialization framework for machine learning tensors. It has been audited and will become the default serialization framework for several organizations (Hugging Face, EleutherAI, Stability AI).
It has now become a core dependency of transformers
.
New models
Swiftformer
The SwiftFormer paper introduces a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations in the self-attention computation with linear element-wise multiplications. A series of models called ‘SwiftFormer’ is built based on this, which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Even their small variant achieves 78.5% top-1 ImageNet1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2× faster compared to MobileViT-v2.
- Add swiftformer by @shehanmunasinghe in #22686
Autoformer
This model augments the Transformer as a deep decomposition architecture, which can progressively decompose the trend and seasonal components during the forecasting process.
MobileViTv2
MobileViTV2 is the second version of MobileViT, constructed by replacing the multi-headed self-attention in MobileViT with separable self-attention.
- Add MobileViTv2 by @shehanmunasinghe in #22820
PerSAM
PerSAM proposes a minimal modification to SAM to allow dreambooth-like personalization, enabling to segment concepts in new images using just one example.
- Add PerSAM [bis] by @NielsRogge in #23659
Timm backbone
We add support for loading timm
weights within the AutoBackbone
API in transformers
. timm
models can be instantiated through the TimmBackbone
class, and then used with any vision model that needs a backbone.
- Add TimmBackbone model by @amyeroberts in #22619
Image to text pipeline conditional support
We add conditional text generation to the image to text pipeline; allowing the model to continue generating an initial text prompt according to an image.
- [image-to-text pipeline] Add conditional text support + GIT by @NielsRogge in #23362
TensorFlow implementations
- Add TensorFlow implementation of EfficientFormer by @D-Roberts in #22620
Accelerate Migration
A major rework of the internals of the Trainer
is underway, leveraging accelerate
instead of redefining them in transformers
. This should unify both framework and lead to increased interoperability and more efficient development.
- Smangrul/accelerate mp integrate by @pacman100 in #23148
- Smangrul/accelerate ddp integrate by @pacman100 in #23151
- fix trainer slow tests related to hyperparam search by @pacman100 in #24011
- remove the extra
accelerator.prepare
by @pacman100 in #23914 - move fsdp handling to accelerate by @pacman100 in #23158
- shift torch dynamo handling to accelerate by @pacman100 in #23168
- accelerate deepspeed and gradient accumulation integrate by @pacman100 in #23236
- fix executable batch size issue by @pacman100 in #24067
- fix accelerator prepare during eval only mode by @pacman100 in #24014
- reset accelerate env variables after each test by @pacman100 in #24107
- Fix translation no_trainer by @muellerzr in #23407
- Update error message when Accelerate isn't installed by @muellerzr in #23373
- Fix parallel mode check by @muellerzr in #23409
- Muellerzr fix deepspeed by @muellerzr in #23657
- Update all no_trainer with skip_first_batches by @muellerzr in #23664
- Fix sagemaker DP/MP by @muellerzr in #23681
- Log the right train_batch_size if using auto_find_batch_size and also log the adjusted value seperately. by @muellerzr in #23800
- Up pinned accelerate version by @muellerzr in #24089
- Move import check to before state reset by @muellerzr in #23906
- Upgrade safetensors version by @muellerzr in #23911
- Act on deprecations in Accelerate no_trainer examples by @muellerzr in #24053
- Oops, missed one by @muellerzr in #24054
Bugfixes and improvements
- chore: allow protobuf 3.20.3 requirement by @jose-turintech in #22759
- Fix link displayed for custom tools by @sgugger in #23274
- Remove missplaced test file by @sgugger in #23275
- Bring back the PR
Refactor doctests + add CI
tomain
by @ydshieh in #23271 - [
gpt
] Gpt2 fix half precision causal mask by @younesbelkada in #23256 - Temporary tolerance fix for flaky whipser PT-TF equiv. test by @amyeroberts in #23257
- Add
top_k
argument to post-process of conditional/deformable-DETR by @CreatlV in #22787 transformers-cli
->huggingface-cli
by @AlpinDale in #23276- Temporarily increase tol for PT-FLAX whisper tests by @amyeroberts in #23288
- Added missing " in CHAT_PROMPT_TEMPLATE by @galatolofederico in #23287
- Update custom_tools.mdx: fix link by @mishig25 in #23292
- Update transformers_agents.mdx by @mishig25 in #23289
- Convert numpy arrays to lists before saving the evaluation metrics as json by @harisankar95 in #23268
- Fix doctest files fetch issue by @ydshieh in #23277
- skip
test_run_squad_no_trainer
for now by @ydshieh in #23302 - Better check for packages availability by @apbard in #23163
- Add gradient_checkpointing parameter to FlaxWhisperEncoder by @raghavanone in #23300
- Agents extras by @LysandreJik in #23301
- Fix broken links in the agent docs by @sgugger in #23297
- Fix typo in gradio-tools docs by @freddyaboulton in #23305
- Fix image segmentation tool test by @sgugger in #23306
- unpin tf prob by @ydshieh in #23293
- Revert "search buffers for dtype" by @sgugger in #23308
- Remove
LanguageIdentificationTool
in__init__.py
as we don't have it yet by @ydshieh in #23326 - Fix docker image (caused by
tensorflow_text
) by @ydshieh in #23321 - Compute the mask in-place, with less memory reads, and on CUDA on
XLNetLMHeadModel
by @lezcano in #23332 - Only add files with modification outside doc blocks by @ydshieh in #23327
- [docs] Fix Agents and Tools docstring by @stevhliu in #23313
- OR am I crazy? by @hwuebben in #23295
- Handle padding warning in generation when using
inputs_embeds
by @zrthxn in #23131 - replaced assert with raise ValueError for t5, switch_transformers, pix2struct, mt5, longt5, gptsan_japanese. by @susnato in #23273
- Use cu118 with cudnn >= 8.6 in docker file by @ydshieh in #23339
- Removing one of the twice defined position_embeddings in LongFormer by @GregorySenay in #23343
- Fix issue introduced in PR #23163 by @ydshieh in #23363
- Typo suggestion by @richardachen in #23360
- Fix some
is_xxx_available
by @ydshieh in #23365 - Fix
BigBirdForMaskedLM
doctest by @ydshieh in #23369 - Fix
OwlViTForObjectDetection.image_guided_detection
doc example by @ydshieh in #23370 - Revert "Only add files with modification outside doc blocks" by @ydshieh in #23371
- [Bugfix]
OPTDecoderLayer
does not return attentions whengradient_checkpointing
andtraining
is enabled. by @gmlwns2000 in #23367 - Skip failing
AlignModelTest::test_multi_gpu_data_parallel_forward
by @ydshieh in #23374 - Fix test typos - audio feature extractors by @LWprogramming in #23310
- Added type hints for
Graphormer
pytorch version by @dewasahu2003 in #23073 - Replace NumPy Operations with JAX NumPy Equivalents for JIT Compilation Compatibility by @gojiteji in #23356
- Use
mkstemp
to replace deprecatedmktemp
by @ready-research in #23372 - Fix
RwkvModel
by @ydshieh in #23392 - Update
test_batched_inference_image_captioning_conditioned
by @ydshieh in #23391 - OPT/BioGPT: Improved attention mask shape exception by @gante in #23270
- Fix chat prompt in HFAgent by @IvanSedykh in #23335
- 🌐 [i18n-KO] Translated
asr.mdx
to Korean by @sim-so in #23106 - Minor fixes in transformers-tools by @Wauplin in #23364
- [
Pix2Struct
] Add conditional generation on docstring example by @younesbelkada in #23399 - Generate: faster
can_generate
check on TF and Flax by @gante in #23398 - [AutoModel] fix
torch_dtype=auto
infrom_pretrained
by @stas00 in #23379 - Docs: add link to assisted generation blog post by @gante in #23397
- Build with non Python files by @sgugger in #23405
- Generate: add test to check KV format by @gante in #23403
- Replace appends with list compr...
v4.29.2: Patch release
Fixes the package so non-Python files (like CUDA kernels) are properly included.
V4.29.1: Patch release
Reverts a regression in the FSDP integration.
Add pip install transformers["agent"]
to have all dependencies agents rely on.
Fixes the documentation about agents.
- Revert "search buffers for dtype" in #23308 by @sgugger
- Fix image segmentation tool test in #23306 by @sgugger
- Fix typo in gradio-tools docs in #23305 by @freddyaboulton
- Fix broken links in the agent docs in #23297 by @sgugger
- Agents extras in #23301 by @LysandreJik
- Update transformers_agents.mdx in #23289 by @mishig25
- Update custom_tools.mdx: fix link in #23292 by @mishig25
v4.29.0: Transformers Agents, SAM, RWKV, FocalNet, OpenLLaMa
Transformers Agents
Transformers Agent is a new API that lets you use the library and Diffusers by prompting an agent (which is a large language model) in natural language. That agent will then output code using a set of predefined tools, leveraging the appropriate (and state-of-the-art) models for the task the user wants to perform. It is fully multimodal and extensible by the community. Learn more in the docs
- Transformers Agents by @LysandreJik @patrickvonplaten and @sgugger in #23214
SAM
SAM (Segment Anything Model) was proposed in Segment Anything by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
The model can be used to predict segmentation masks of any object of interest given an input image.
- Add Segment Anything Model (SAM) by @ArthurZucker in #22654
- [
SAM
] Correct arxiv link by @younesbelkada in #22886 - Fix SAM example in documentation by @fxmarty in #22887
- [
SAM
] Change tofacebook/sam-vit-base
by @younesbelkada in #22891 - Small sam patch by @ArthurZucker in #22920
- [
SAM
] Add sam doc by @younesbelkada in #22984 - Make sam ONNX exportable by @fxmarty in #22915
DocumentQuestionAnsweringPipeline
only for fast ⚡ tokenizers by @ydshieh in #22745- Add
automatic-mask-generation
pipeline for Segment Anything Model (SAM) by @ArthurZucker in #22840 - Expose AutoModelForMaskGeneration by @fxmarty in #22910
RWKV
RWKV suggests a tweak in the traditional Transformer attention to make it linear. This way, the model can be used as recurrent network: passing inputs for timestamp 0 and timestamp 1 together is the same as passing inputs at timestamp 0, then inputs at timestamp 1 along with the state of timestamp 0 (see example below).
This can be more efficient than a regular Transformer and can deal with sentence of any length (even if the model uses a fixed context length for training).
- Add RWKV-4 by @sgugger and @younesbelkada in #22797
FocalNet
The FocalNet model was proposed in Focal Modulation Networks by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. FocalNets completely replace self-attention (used in models like ViT and Swin) by a focal modulation mechanism for modeling token interactions in vision. The authors claim that FocalNets outperform self-attention based models with similar computational costs on the tasks of image classification, object detection, and segmentation.
- Add FocalNet by @NielsRogge in #21532
- Add focalnet backbone by @alaradirik in #23104
OpenLLaMa
The Open-Llama model was proposed in Open-Llama project by community developer s-JoL.
The model is mainly based on LLaMA with some modifications, incorporating memory-efficient attention from Xformers, stable embedding from Bloom, and shared input-output embedding from PLAM. And the model is pre-trained on both Chinese and English, which gives it better performance on Chinese language tasks.
Assisted Generation
Assisted generation is a new technique that lets you speed up generation with large language models by using a smaller model as assistant. The assistant model will be the ones doing multiple forward pass while the LLM will merely validate the tokens proposed by the assistant. This can lead to speed-ups up to 10x!
- Generate: Add assisted generation by @gante in #22211
- Generate: assisted generation with sample (take 2) by @gante in #22949
Code on the Hub from another repo
To avoid duplicating the model code in multiple repos when using the code on the Hub feature, loading such models will now save in their config the repo in which the code is. This way there is one source of ground truth for code on the Hub models.
- Use code on the Hub from another repo by @sgugger in #22698
- Use code on the Hub from another repo by @sgugger in #22814
Breaking changes
This releases has three breaking changes compared to version v4.28.0.
The first one focuses on fixing training issues for Pix2Struct. This slightly affects the results, but should result in the model training much better.
- 🚨🚨🚨 [
Pix2Struct
] Attempts to fix training issues 🚨🚨🚨 by @younesbelkada in #23004
The second one is aligning the ignore index in the LUKE model to other models in the library. This breaks the convention that models should stick to their original implementation, but it was necessary in order to align with other transformers in the library
Finally, the third breaking change aims to harmonize the training procedure for most of recent additions in transformers. It should be users' responsibility to fill_mask the padding tokens of the labels with the correct value. This PR addresses the issue that was raised by other architectures such as Luke or Pix2Struct
- 🚨🚨🚨 [
Blip
] remove labels masking by @younesbelkada in #23024
Bugfixes and improvements
- Change
torch_dtype
tostr
whensaved_model=True
insave_pretrained
for TF models by @ydshieh in #22740 - 🌐 [i18n-KO] Translated
training.mdx
to Korean by @gabrielwithappy in #22670 - Remove
DS_BUILD_AIO=1
by @ydshieh in #22741 - [trainer] update url by @stas00 in #22747
- fix(llama): fix LlamaTokenzier by @rockmagma02 in #22746
- Generate: handle text conditioning with multimodal encoder-decoder models by @gante in #22748
- Revert (for now) the change on
Deta
in #22437 by @ydshieh in #22750 - Fix
serving_output
for TF composite models (encoder-decoder like models) by @ydshieh in #22743 - 🌐 [i18n-KO] Translated
sequence_classification.mdx
to Korean by @0525hhgus in #22655 - [Examples] TPU-based training of a language model using TensorFlow by @sayakpaul in #21657
- Pix2struct: doctest fix by @gante in #22761
- Generate: pin number of beams in BART test by @gante in #22763
- Fix a mistake in Llama weight converter log output. by @aljungberg in #22764
- Fix failing torchscript tests for
CpmAnt
model by @ydshieh in #22766 - [WIP]🌐 [i18n-KO] Translated
tutorial/proprecssing.mdx
to Korean by @sim-so in #22578 - Tweak ESM tokenizer for Nucleotide Transformer by @Rocketknight1 in #22770
- Fix word_ids hyperlink by @mayankagarwals in #22765
- Seq2SeqTrainer: Evict decoder_input_ids only when it is created from labels by @gante in #22772
- Indexing fix - CLIP checkpoint conversion by @amyeroberts in #22776
- Move labels to the same device as logits for Whisper by @oscar-garzon in #22779
- Generate: add CJK support to TextStreamer by @bcol23 in #22664
- Fix
test_word_time_stamp_integration
forWav2Vec2ProcessorWithLMTest
by @ydshieh in #22800 - 🌐 [i18n-KO] Translated
custom_models.mdx
to Korean by @HanNayeoniee in #22534 - [i18n-KO] fix: docs: ko: sagemaker anchors and
_toctree.yml
by @jungnerd in #22549 - improve(llama): Faster apply_rotary_pos_emb by @fpgaminer in #22785
- Fix sneaky torch dependency in TF example by @Rocketknight1 in #22804
- 🌐 [i18n-KO] Translated
tasks/translation.mdx
to Korean by @wonhyeongseo in #22805 - Don't use
LayoutLMv2
andLayoutLMv3
in some pipeline tests by @ydshieh in #22774 - Fix squeeze into torch 1.x compatible form in llama model by @DyeKuu in #22808
- Remove accelerate from tf test reqs by @muellerzr in #22777
- Simplify update metadata job by @sgugger in #22811
- Revert "Use code on the Hub from another repo" by @sgugger in #22813
- Introduce
PartialState
as the device handler in theTrainer
by @muellerzr in #22752 - Mark auto models as important by @sgugger in #22815
- TTS fine-tuning for SpeechT5 by @hollance in #21824
- 🌐 [i18n-KO] Fix anchor links for docs
auto_tutorial
,training
by @gabrielwithappy in #22796 - Fix Past CI not running against the latest
main
by @ydshieh in #22823 - Fix
test_eos_token_id_int_and_list_top_k_top_sampling
by @ydshieh in #22826 - Update accelerate version + warning check fix by @muellerzr in #22833
- Fix from_pretrained when model is instantiated on the meta device by @sgugger in #22837
- Raise err if minimum Accelerate version isn't available by @muellerzr in #22841
- Make ClipSeg compatible with model parallelism by @youssefadr in #22844
- fix SpeechT5 doc comments by @hollance in #22854
- move preprocess_logits_for_metrics before _nested_gather in trainer.e… by @ChenyangLiu in #22603
- feat(model parallelism): move labels to the same device as logits for M2M100 by @elabongaatuo in #22850
- use
accelerate@main
in CI by @ydshieh in #22859 - Remove 'main' from doc links by @amyeroberts in #22860
- Show diff between 2 CI runs on Slack reports by @ydshieh in #22798
- Remove some pipeline skip cases by @ydshieh in #22865
- Fixup multigpu local_rank by @muellerzr in #22869
- Fix to removing ESM special tokens by @Rocketknight1 in #22870
- XGLM: Fix left-padding (PT and TF) by @gante in #22828
- Patching clip model to create mask tensor on the device by @shanmugamr1992 in #22711
- fix: Correct small typo in docstring by @oscar-defelice in #22857
- Generation: only search for eos_token if set by @xloem in #22875
- Change schedule CI time by @ydshieh in #22884
- fix warning function call creating logger error (max_length and max_new_tokens) by @QuentinAmbard in #22889
- [Examples/TensorFlow] minor refactoring to allow compatible datasets to work by @sayakpaul in #22879
- moved labels to the same device as logits for OTP, CODEGEN ,gptj and pixel2struct model by @sushmanthreddy in #2...
v4.28.1: Patch release
v4.28.0: LLaMa, Pix2Struct, MatCha, DePlot, MEGA, NLLB-MoE, GPTBigCode
LLaMA
The LLaMA model was proposed in LLaMA: Open and Efficient Foundation Language Models. It is a collection of foundation language models ranging from 7B to 65B parameters. You can request access to the weights here then use the conversion script to generate a checkpoint compatible with Hugging Face
Pix2Struct, MatCha, DePlot
Pix2Struct is a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct has been fine-tuned on various tasks and datasets, ranging from image captioning and visual question answering (VQA) over different inputs (books, charts, science diagrams) to captioning UI components, and others.
- Add Pix2Struct by @younesbelkada in #21400
- Add DePlot + MatCha on
transformers
by @younesbelkada in #22528
Mega
MEGA proposes a new approach to self-attention with each encoder layer having a multi-headed exponential moving average in addition to a single head of standard dot-product attention, giving the attention mechanism stronger positional biases. This allows MEGA to perform competitively to Transformers on standard benchmarks including LRA while also having significantly fewer parameters. MEGA’s compute efficiency allows it to scale to very long sequences, making it an attractive option for long-document NLP tasks.
GPTBigCode
The model is a an optimized GPT2 model with support for Multi-Query Attention.
- Add GPTBigCode model (Optimized GPT2 with MQA from Santacoder & BigCode) by @jlamypoirier in #22575
NLLB-MoE
The mixture of experts version of the NLLB release has been added to the library.
NLLB-MoE
Adds the moe model by @ArthurZucker in #22024
Serializing 8bit models
- [
bnb
] Let's make serialization of int8 models possible by @younesbelkada in #22177
You can now push 8bit models and/or load 8bit models directly from the Hub, save memory and load your 8bit models faster! An example repo here
Breaking Changes
Ordering of height and width for the BLIP image processor
Notes from the PR:
The BLIP image processor incorrectly passed in the dimensions to resize in the order (width, height). This is reordered to be correct.
In most cases, this won't have an effect as the default height and width are the same. However, this is not backwards compatible for custom configurations with different height, width settings and direct calls to the resize method with different height, width values.
- 🚨🚨🚨 Fix ordering of height, width for BLIP image processor by @amyeroberts in #22466
Prefix tokens for the NLLB tokenizer
The big problem was the prefix
and suffix
tokens of the NLLB tokenizer.
Previous behaviour:
>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
>>> tokenizer("How was your day?").input_ids
[13374, 1398, 4260, 4039, 248130, 2, 256047]
>>> # 2: '</s>'
>>> # 256047 : 'eng_Latn'
New behaviour
>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
>>> tokenizer("How was your day?").input_ids
[256047, 13374, 1398, 4260, 4039, 248130, 2]
In case you have pipelines that were relying on the old behavior, here is how you would enable it once again:
>>> from transformers import NllbTokenizer
>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour = True)
- 🚨🚨🚨
[NLLB Tokenizer]
Fix the prefix tokens 🚨🚨🚨 by @ArthurZucker in #22313
TensorFlow ports
The BLIP model is now available in TensorFlow.
- Add TF port of BLIP by @Rocketknight1 in #22090
Export TF Generate with a TF tokenizer
As the title says, this PR adds the possibility to export TF generate with a TF-native tokenizer -- the full thing in a single TF graph.
Task guides
A new task guide has been added, focusing on depth-estimation.
- Depth estimation task guide by @MKhalusova in #22205
Bugfixes and improvements
- Load optimizer state on CPU to avoid CUDA OOM by @sgugger in #22159
- Run all tests by default by @sgugger in #22162
- Fix: unfinished_sequences with correct device by @Stxr in #22184
- Revert 22152 MaskedImageCompletionOutput changes by @amyeroberts in #22187
- Regression pipeline device by @sgugger in #22190
- Update BridgeTowerForContrastiveLearning by @abhiwand in #22145
- t5 remove data dependency by @prathikr in #22097
- Fix DeepSpeed CI by @ydshieh in #22194
- Fix typo in Align docs by @alaradirik in #22199
- Update expected values in
MgpstrModelIntegrationTest
by @ydshieh in #22195 - Italian Translation of migration.mdx by @Baelish03 in #22183
- Update tiny model creation script by @ydshieh in #22202
- Temporarily fix ONNX model exporting error by @SatyaJandhyalaAtMS in #21830
- [
XGLM
] Addaccelerate
support for XGLM by @younesbelkada in #22207 - fixes a typo in WhisperFeatureExtractor docs. by @susnato in #22208
- Hotfix for natten issue with torch 2.0.0 on CircleCI by @ydshieh in #22218
- fix typos in llama.mdx by @keturn in #22223
- fix code example in mgp-str doc by @wdp-007 in #22219
- Use
dash==2.8.1
for now for daily CI by @ydshieh in #22227 - LLaMA house-keeping by @sgugger in #22216
- fix AutoTP in deepspeed could not work for bloom by @sywangyi in #22196
- Add LlamaForSequenceClassification by @lewtun in #22209
- Removed .mdx extension in two links by @MKhalusova in #22230
- fix(docs): fix task guide links in model docs by @Seb0 in #22226
- Fix natten by @alihassanijr in #22229
- Revert "Use
dash==2.8.1
for now for daily CI" by @ydshieh in #22233 - Fix Unnecessary move of tensors from CPU to GPU in LlamaRotaryEmbedding by @ma787639046 in #22234
- [trainer] param count for deepspeed zero3 by @stas00 in #22193
- Update training_args.py -- a nightly install is not required anymore for torch.compile by @pminervini in #22266
- [Docs] fix typos in some tokenizer docs by @yesinkim in #22256
- Italian translation perf_infer_cpu by @nickprock in #22243
- [Trainer] Add optional communication backends for torch.distributed when using GPU by @heya5 in #22247
- Fix the gradient checkpointing bug of the llama model by @yqy2001 in #22270
- Fix balanced and auto device_map by @sgugger in #22271
- Rework a bit the LLaMA conversion script by @sgugger in #22236
- Proper map location for optimizer load by @sgugger in #22273
- Fix doc links by @amyeroberts in #22274
- Move torch.compile() wrapping after DDP/FSDP wrapping to ensure correct graph breaks during training by @ani300 in #22279
- Example of pad_to_multiple_of for padding and truncation guide & docstring update by @MKhalusova in #22278
- Update vision docstring bool masked pos by @amyeroberts in #22237
- replace_8bit_linear modules_to_not_convert default value fix by @BlackSamorez in #22238
- Fix error in mixed precision training of
TFCvtModel
by @gcuder in #22267 - More doctests by @ydshieh in #22268
- fix more doctests by @ydshieh in #22292
- Add translation perf_infer_gpu_one for it by @davidegazze in #22296
- Restore fp16 support on xla gpu device by @ymwangg in #22300
- Correct NATTEN function signatures and force new version by @alihassanijr in #22298
- [deepspeed] offload + non-cpuadam optimizer exception doc by @stas00 in #22044
- Final update of doctest by @ydshieh in #22299
- Add MaskedImageModelingOutput by @alaradirik in #22212
- Enable traced model for text-generation task by @jiqing-feng in #22265
- add low_cpu_mem_usage option in run_clm.py example which will benefit… by @sywangyi in #22288
- fix: Allow only test_file in pytorch and flax summarization by @connor-henderson in #22293
- Fix position embeddings for GPT-J and CodeGen by @njhill in #22069
- Fixed bug to calculate correct xpath_sub_list in MarkupLMTokenizer by @silentghoul-spec in #22302
- Enforce
max_memory
for device_map strategies by @sgugger in #22311 - Beef up Llama tests by @gante in #22314
- docs: Resolve incorrect type typo in trainer methods by @tomaarsen in #22316
- Chunkable token classification pipeline by @luccailliau in #21771
- Fix PipelineTests skip conditions by @ydshieh in #22320
- [deepspeed zero3] need
generate(synced_gpus=True, ...)
by @stas00 in #22242 - [gptj] support older pytorch version by @stas00 in #22325
- Move common properties to BackboneMixin by @amyeroberts in #21855
- Backbone add mixin tests by @amyeroberts in #22542
- Backbone add out indices by @amyeroberts in #22493
- [
MBart
] Addaccelerate
support for MBart by @younesbelkada in #22309 - Fixed gradient checkpoint bug for TimeSeriesTransformer by @mollerup23 in #22272
- Mention why one needs to specify max_steps in Trainer by @lhoestq in #22333
- Fix various imports by @sgugger in #22281
- Minor typo in pipeline FillMaskPipeline's documentation. by @SamuelLarkin in #22339
- Added type hints to TFDeiTModel by @Batese2001 in #22327
- Fix --bf16 option support for Neuron after PR #22300 by @jeffhataws in #22307
- Generate: add test for left-padding support by @gante in #22322
- Enable training Llama with model or pipeline parallelism by @kooshi in #22329
- Automatically create/update tiny models by @ydshieh in #22275
- [HFTracer] Make embeddings ops take on the dtype of the weight by @jamesr66a in #22347
- Fix...