-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(deps): update dependency transformers to v4.48.1 #196
Open
renovate
wants to merge
1
commit into
main
Choose a base branch
from
renovate/transformers-4.x
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Blocked by this issue explosion/spaCy#13649 |
renovate
bot
force-pushed
the
renovate/transformers-4.x
branch
from
October 15, 2024 09:04
310a637
to
88e0cf6
Compare
renovate
bot
changed the title
fix(deps): update dependency transformers to v4.45.2
fix(deps): update dependency transformers to v4.46.0
Oct 24, 2024
renovate
bot
force-pushed
the
renovate/transformers-4.x
branch
2 times, most recently
from
October 29, 2024 18:39
a56915d
to
4fb4b29
Compare
renovate
bot
changed the title
fix(deps): update dependency transformers to v4.46.0
fix(deps): update dependency transformers to v4.46.1
Oct 29, 2024
renovate
bot
force-pushed
the
renovate/transformers-4.x
branch
from
November 5, 2024 21:42
4fb4b29
to
63d9975
Compare
renovate
bot
changed the title
fix(deps): update dependency transformers to v4.46.1
fix(deps): update dependency transformers to v4.46.2
Nov 5, 2024
renovate
bot
force-pushed
the
renovate/transformers-4.x
branch
from
November 6, 2024 18:33
63d9975
to
4d29005
Compare
renovate
bot
requested review from
pigri,
krichard1212 and
waroca
as code owners
November 6, 2024 18:33
renovate
bot
force-pushed
the
renovate/transformers-4.x
branch
3 times, most recently
from
November 6, 2024 20:26
935a07d
to
e4bca8a
Compare
renovate
bot
force-pushed
the
renovate/transformers-4.x
branch
from
November 19, 2024 00:07
e4bca8a
to
0ab21e1
Compare
renovate
bot
changed the title
fix(deps): update dependency transformers to v4.46.2
fix(deps): update dependency transformers to v4.46.3
Nov 19, 2024
renovate
bot
force-pushed
the
renovate/transformers-4.x
branch
2 times, most recently
from
December 5, 2024 19:50
77a84e6
to
d847960
Compare
renovate
bot
changed the title
fix(deps): update dependency transformers to v4.46.3
fix(deps): update dependency transformers to v4.47.0
Dec 5, 2024
renovate
bot
force-pushed
the
renovate/transformers-4.x
branch
from
December 17, 2024 20:33
d847960
to
8c48a28
Compare
renovate
bot
changed the title
fix(deps): update dependency transformers to v4.47.0
fix(deps): update dependency transformers to v4.47.1
Dec 17, 2024
renovate
bot
force-pushed
the
renovate/transformers-4.x
branch
from
January 10, 2025 15:58
8c48a28
to
4243d97
Compare
renovate
bot
changed the title
fix(deps): update dependency transformers to v4.47.1
fix(deps): update dependency transformers to v4.48.0
Jan 10, 2025
renovate
bot
force-pushed
the
renovate/transformers-4.x
branch
2 times, most recently
from
January 20, 2025 17:18
6339d86
to
43c36db
Compare
renovate
bot
changed the title
fix(deps): update dependency transformers to v4.48.0
fix(deps): update dependency transformers to v4.48.1
Jan 20, 2025
renovate
bot
force-pushed
the
renovate/transformers-4.x
branch
from
January 23, 2025 12:58
43c36db
to
e1f43d9
Compare
Quality Gate passedIssues Measures |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
4.45.1
->4.48.1
Release Notes
huggingface/transformers (transformers)
v4.48.1
: Patch release v4.48.1Compare Source
Patch release v4.48.1
Yet again we are dawned with a gradient accumulation fix! There is also a refactoring of the attention that let a small typo in, we made sure PHI is no longer broken!
Moonshine
had a small issue when wrapping generate so we removed that!🤗
v4.48.0
: : ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, MoonshineCompare Source
New models
ModernBERT
The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.
It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.
It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:
Aria
The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.
Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.
TimmWrapper
We add a
TimmWrapper
set of classes such that timm models can be loaded in as transformer models into the library.Here's a general usage example:
Thanks to this, timm models now have access to pipelines, as well as
Trainer
, accelerate device maps, quantization, etc:Pixtral-Large
Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.
ColPali
The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.
In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.
Falcon3
Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:
One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.
Bamba
Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.
Checkout all Bamba-9B model checkpoints here.
VitPose
ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.
The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.
DINOv2 with registers
The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.
Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.
The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:
Emu3
The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.
Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on VQ-VAE model. Discretized visual tokens are later fused with text token ids for image and text generation.
Emu3 outperforms leading models like SDXL and LLaVA-1.6 in both generation and perception tasks, without relying on diffusion or compositional methods..
Cohere2
A new Cohere update was added through a new "Cohere2" set of classes.
TextNet
TextNet is a lightweight and efficient architecture designed specifically for text detection, offering superior performance compared to traditional models like MobileNetV3. With variants TextNet-T, TextNet-S, and TextNet-B (6.8M, 8.0M, and 8.9M parameters respectively), it achieves an excellent balance between accuracy and inference speed.
DiffLlama
Differential Transformer combines the Llama architecture with Differential Transformer's Attention.
PixtralLarge
The conversion script needed a few update, while the modeling code was barely changed!
Moonshine
Moonshine is an autoregressive speech recognition encoder-decoder model that improves upon Whisper's architecture. Namely, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper, which is restricted to fixed 30-second windows. It was introduced by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, and Pete Warden in Moonshine: Speech Recognition for Live Transcription and Voice Commands
.
Quantization methods
VPTQ Quantization
From the VPTQ contributors:
HIGGS Quantization
From the contributors:
Cleanup
We merged a cleanup for vision language models, to make sure it all models are standardized.
Breaking changes
Conversion scripts
Many models in Transformers include scripts to convert the original model checkpoints into a Transformers-compatible format. These scripts can be found in the repo using the glob pattern
models/**/convert_*.py
. They were a recurring source of vulnerability reports and CVEs because many models were originally released using insecure formats like older PyTorch.bin
weights orpickle
files. The conversion scripts had to open these formats, and this meant that they were vulnerable to maliciously crafted inputs.In practice, we do not see this as a serious vulnerability. The conversion scripts are never imported or called by the rest of the library; each script is standalone, and so the only way to exploit the vulnerability is to create a malicious checkpoint, induce a user to download it, and then also induce them to manually call a specific conversion script on it.
However, even if there is little practical risk of an exploit, we are aware that open vulnerability reports create a compliance problem for users, and so beginning with this release we will be excluding these conversion scripts from release branches and wheels. They will remain accessible to developers on the
main
branch.Backtracking in Nougat
A regular expression used within the Nougat code has been modified to ensure it does not hang. The method should output the same results but we cannot guarantee it; we recommend upgrading to the latest transformers if you use this model to ensure your code is performance-optimized.
Whisper decoding
This PR finalizes work that aimes to enable short-form (< 30 secs) and long-form generation using temperature fallback. It is a significant improvement to the whisper codebase, but it does result in the following breaking changes:
➡️ Previously:
• Short-form: Returned a
ModelOutput
ortorch.LongTensor
, including decoder input IDs and the EOS token ID.• Long-form: Returned a
Dict
ortorch.LongTensor
, excluding decoder input IDs and the EOS token ID.➡️ From now on:
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.
Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when
return_dict_in_generate=True
and (return_timestamps=False
orforce_unique_generate_call=True
).In this case, the output will be a
ModelOutput
, which is the result of the underlying call to GenerationMixin’s generate. Indeed,return_timestamps=False
ensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.Attention refactor
In order to have a cleaner, isolated, future-proof code for the attention layers, they have been refactored so as to keep the model attention code within their files; but attention definitions relating to SDPA, Flash Attention, and other types of attention have been moved to a common file.
Bugfixes and improvements
num_items_in_batch
not being an integer by @xspirus in #35115docs/source/ar/community.md
into Arabic by @AhmedAlmaghz in #33027AssistedCandidateGenerator
for Improved Modularity and Reusability by @keyboardAnt and @jmamou in #35009Thread
for SF conversion by @ydshieh in #35236rsfE
withpytest
by @ydshieh in #35119benchmark
job inpush-important-models.yml
by @ydshieh in #35292benchmarks_entrypoint.py
by @McPatate in #34495text
by @probicheaux in #35201docs
] Add link to ModernBERT Text Classification GLUE finetuning script by @tomaarsen in #35347Mamba2
] Fix caching, slow path, and multi-gpu by @vasqu in #35154_make_causal_mask
by @jiwoong-choi in #35291weights_only=True
withtorch.load
fortransfo_xl
by @ydshieh in #35241test_generate_with_static_cache
even less flaky by @ydshieh in #34995is_causal
is passed explicitly by @Cyrilvallez in #35390PaliGemmaProcessor
by @alvarobartt in #35278.github/workflows/self-comment-ci.yml
for now by @ydshieh in #35366GPTQ
,CompressedTensors
] Fix unsafe imports and metada check by @vasqu in #34815ACCELERATE_MIN_VERSION
on error by @KSafran in #35189model_accepts_loss_kwargs
for timm model by @qubvel in #35257sdpa_kernel
by @jla524 in #35410docs/source/ar/tasks/question_answering.md
into Arabic by @AhmedAlmaghz in #35196docs/source/ar/tasks/summarization.md
into Arabic by @AhmedAlmaghz in #35195sdpa_kernel
by @jla524 in #35461Significant community contributions
The following contributors have made significant changes to the library over the last release:
Thread
for SF conversion (#35236)rsfE
withpytest
(#35119)benchmark
job inpush-important-models.yml
(#35292)weights_only=True
withtorch.load
fortransfo_xl
(#35241)test_generate_with_static_cache
even less flaky (#34995).github/workflows/self-comment-ci.yml
for now (#35366)v4.47.1
Compare Source
Patch release v4.47.1
We waited a little bit to make sure it was stable, thanks @winglian for double checking and everyone for the fixes!
Fix GA loss bugs and add unit test (#35121)
Contributed by @techkang and @ArthurZucker.
Fix num_items_in_batch not being an integer (#35115)
Contributed by @xspirus.
Fix FSDP no longer working (#35212)
Contributed by @muellerzr.
Don't use no_sync when DeepSpeed doesn't support it for certain ZeRO configurations (#35212)
Contributed by @winglian.
Only import torch.distributed if it is available (#35133)
Contributed by @GaetanLepage.
[Whisper] Patch float type on MPS (#35295)
Contributed by @eustlb. 🔜 we should probably have MPS CIs to avoid repeating this!
v4.47.0
: v4.47.0: PaliGemma-2, I-JEPA, OLMo-2, LayerSkip, Tensor ParallelCompare Source
New models
PaliGemma-2
PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.
PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.
I-JEPA
The I-JEPA model was proposed in Image-based Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.
OLMo 2
The OLMo2 model is the successor of the OLMo model, which was proposed in OLMo: Accelerating the Science of Language Models.
The architectural changes from the original OLMo model to this model are:
Commits:
Layer-Skip Llama
We add support for Meta's Layer-Skip Llama 3.2 1B model.
The Llama3.2 1B model was continually pretrained with LayerSkip recipe, early exit loss and layer dropout, as presented in Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding and is capable of performing self-speculative decoding: decode with earlier layers and verify with remaining layers.
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.