quantization

This is the official PyTorch implementation of "LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and also an efficient LLM compression tool with various advanced compression methods, supporting multiple inference backends.

Updated Jul 1, 2024
Python

openvinotoolkit / nncf

Star

Neural Network Compression Framework for enhanced OpenVINO™ inference

nlp sparsity compression deep-learning tensorflow transformers pytorch classification pruning object-detection quantization semantic-segmentation bert hawq onnx openvino mmdetection mixed-precision-training quantization-aware-training

Updated Jul 1, 2024
Python

huggingface / optimum-intel

Star

🤗 Optimum Intel: Accelerate inference with Intel optimization tools

optimization intel transformers inference pruning quantization distillation onnx openvino diffusers

Updated Jul 1, 2024
Jupyter Notebook

quic / aimet

Star

AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.

open-source machine-learning opensource deep-neural-networks compression deep-learning pruning quantization auto-ml network-quantization network-compression

Updated Jul 1, 2024
Python

intel / auto-round

Star

SOTA Weight-only Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs"

rounding quantization awq int4 gptq neural-compressor weight-only

Updated Jul 1, 2024
Python

intel / neural-compressor

Star

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime

sparsity pruning quantization knowledge-distillation auto-tuning int8 low-precision quantization-aware-training post-training-quantization awq int4 large-language-models gptq smoothquant sparsegpt fp4 mxformat

Updated Jul 1, 2024
Python

neuralmagic / deepsparse

Star

Sparsity-aware deep learning inference runtime for CPUs

nlp performance computer-vision inference machinelearning pruning object-detection pretrained-models quantization cpus onnx sparsification llm-inference deepsparse

Updated Jul 1, 2024
Python

huawei-noah / Efficient-Computing

Star

Efficient computing methods developed by Huawei Noah's Ark Lab

pruning quantization knowledge-distillation model-compression self-supervised binary-neural-networks

Updated Jul 1, 2024
Jupyter Notebook

pytorch / ao

Star

Create and integrate custom data types, layouts and kernels with up to 2x speedups and 65% less VRAM for inference and training

training layouts sparsity inference pytorch quantization mx brrr dtypes

Updated Jul 1, 2024
Python

datawhalechina / awesome-compression

Star

模型压缩的小白入门教程

compression quantization knowledge-distillation model-pruning prune model-compression neural-architecture-search kd model-quantization tinyml

Updated Jun 30, 2024

PINTO0309 / onnx2tf

Sponsor

Star

Self-Created Tools to convert ONNX files (NCHW) to TensorFlow/TFLite/Keras format (NHWC). The purpose of this tool is to solve the massive Transpose extrapolation problem in onnx-tensorflow (onnx-tf). I don't need a Star, but give me a pull request.