llama.cpp

GGUF
Filename extension	.gguf
Magic number	0x47 0x47 0x55 0x46
Developed by	Georgi Gerganov and community
Initial release	August 22, 2023; 11 months ago
Latest release	v3
Type of format	Machine-learning tensors

llama.cpp
Original author(s)	Georgi Gerganov
Developer(s)	Georgi Gerganov and community
Initial release	March 10, 2023; 16 months ago
Repository	github.com/ggerganov/llama.cpp
Written in	C++, C
Type	Library, CLI, and Web server for Large language models
License	MIT License

llama.cpp is an open source software library mostly written in C++ that performs inference on various Large Language Models such as Llama.^[3] It is co-developed alongside the ggml library, a general-purpose tensor library.^[4]

History

llama.cpp began development in March 2023 by Georgi Gerganov as an implementation of the Llama inference code in pure C/C++ with no dependencies. This bettered performance on computers without GPU or other dedicated hardware.^[3]^[5] As of July 2024 it has 61 thousand stars on GitHub.^[6] Before llama.cpp, Gerganov worked on a similar library called whisper.cpp^[7] which implemented Whisper, a speech to text model by OpenAI. llama.cpp gained traction with users who lacked specialized hardware as it could run on just a CPU including on Android devices.^[5]

llamafile created by Mozilla using the cosmopolitan tool created by Justine Tunney, bundles llama.cpp with the model into a single executable file.^[8] Tunney et. al. introduced new optimized matrix multiplication kernels for x86 and ARM CPUs, improving prompt evaluation performance for FP16 and 8-bit quantized data types.^[9]^[10]^[11]

Architecture

llama.cpp initially could only run on CPUs but now can run on GPUs using multiple different back-ends including Vulkan and SYCL. These back-ends make up the GGML tensor library which is used by the front-end model-specific llama.cpp code.^[12] llama.cpp supports ahead of time model quantization as opposed to on-the-fly quantization.^[13] llama.cpp makes use of several CPU extensions for optimization: AVX, AVX2 and AVX-512 for X86-64, and Neon on ARM. Apple silicon is an important target for the project.^[6]^[11]

GGUF file format

The GGUF file format is a binary format used by llama.cpp that stores both tensors and metadata in a single file.^[16] It was created to better maintain backwards compatibility as llama.cpp expanded it's support for other model architectures.^[17]

GGUF files are typically created by converting models developed with a different machine learning library such as PyTorch, although fine-tuning is supported natively.^[18]

The format focuses on quantization, the act of reducing precision in the model weights. This can lead to reduced memory usage, and increased speed at the expense of lower model accuracy.^[19]^[17]

Supported data types

GGUF supports common floating-point data formats float32, float16, and bfloat16, as well as 1.5-bit and 2-bit to 8-bit quantized integer types.

Supported models

References

^ "Initial release · ggerganov/llama.cpp@26c0846". GitHub. Retrieved 15 May 2024.
^ "llama.cpp/LICENSE at master · ggerganov/llama.cpp". GitHub.
^ ^a ^b Connatser, Matthew. "How this open source LLM chatbot runner hit the gas on x86, Arm CPUs". theregister.com. Retrieved 15 April 2024.
^ Gerganov, Georgi (17 May 2024). "ggerganov/ggml".
^ ^a ^b Edwards, Benj (13 March 2023). "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi". arstechnica.com. Retrieved 15 April 2024.
^ ^a ^b "ggerganov/llama.cpp". GitHub.
^ "ggerganov/whisper.cpp". GitHub.
^ Papp, Donald (3 December 2023). "Mozilla Lets Folks Turn AI LLMs Into Single-File Executables". Hackaday. Retrieved 27 July 2024.
^ Connatser, Matthew. "Llamafile LLM driver project boosts performance on CPU cores". www.theregister.com. Retrieved 10 May 2024.
^ Tunney, Justine. "LLaMA Now Goes Faster on CPUs". justine.lol. Retrieved 24 July 2024.
^ ^a ^b Larabel, Michael. "Llamafile 0.7 Brings AVX-512 Support: 10x Faster Prompt Eval Times For AMD Zen 4". www.phoronix.com.
^ Pounder, Les (25 March 2023). "How To Create Your Own AI Chatbot Server With Raspberry Pi 4". tomshardware.com. Retrieved 16 April 2024.
^ Walkowiak, Bartosz; Walkowiak, Tomasz (2024). "Implementation of language models within an infrastructure designed for Natural Language Processing" (PDF). International Journal of Electronics and Telecommunications. 70 (1): 153–159. doi:10.24425/ijet.2024.149525. Retrieved 8 May 2024.
^ "GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp". GitHub.
^ "ggml/docs/gguf.md at master · ggerganov/ggml". GitHub.
^ "GGUF". huggingface.co. Retrieved 9 May 2024.
^ ^a ^b Mucci, Tim (3 July 2024). "GGUF versus GGML". www.ibm.com. Retrieved 26 July 2024.
^ Boykis, Vicki (28 February 2024). "GGUF, the long way around". Vicki Boykis. Retrieved 26 July 2024.
^ Labonne, Maxime (29 November 2023). "Quantize Llama models with GGUF and llama.cpp". Medium. Towards Data Science. Retrieved 9 May 2024.

[githubrelease-1] "Initial release · ggerganov/llama.cpp@26c0846". GitHub. Retrieved 15 May 2024.

[license-2] "llama.cpp/LICENSE at master · ggerganov/llama.cpp". GitHub.

[register-llamafile-3] Connatser, Matthew. "How this open source LLM chatbot runner hit the gas on x86, Arm CPUs". theregister.com. Retrieved 15 April 2024.

[ggml-4] Gerganov, Georgi (17 May 2024). "ggerganov/ggml".

[arstechnica-5] Edwards, Benj (13 March 2023). "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi". arstechnica.com. Retrieved 15 April 2024.

[llama.cpprepo-6] "ggerganov/llama.cpp". GitHub.

[whisper-7] "ggerganov/whisper.cpp". GitHub.

[hackaday-llamafile-8] Papp, Donald (3 December 2023). "Mozilla Lets Folks Turn AI LLMs Into Single-File Executables". Hackaday. Retrieved 27 July 2024.

[llamafileregister-9] Connatser, Matthew. "Llamafile LLM driver project boosts performance on CPU cores". www.theregister.com. Retrieved 10 May 2024.

[justine-llamafile-optimization-10] Tunney, Justine. "LLaMA Now Goes Faster on CPUs". justine.lol. Retrieved 24 July 2024.

[phoronix-llamafile-11] Larabel, Michael. "Llamafile 0.7 Brings AVX-512 Support: 10x Faster Prompt Eval Times For AMD Zen 4". www.phoronix.com.

[tomshardware-12] Pounder, Les (25 March 2023). "How To Create Your Own AI Chatbot Server With Raspberry Pi 4". tomshardware.com. Retrieved 16 April 2024.

[Walkowiak-13] Walkowiak, Bartosz; Walkowiak, Tomasz (2024). "Implementation of language models within an infrastructure designed for Natural Language Processing" (PDF). International Journal of Electronics and Telecommunications. 70 (1): 153–159. doi:10.24425/ijet.2024.149525. Retrieved 8 May 2024.

[githubgguf-14] "GGUF by ggerganov · Pull Request #2398 · ggerganov/llama.cpp". GitHub.

[ggufdoc-15] "ggml/docs/gguf.md at master · ggerganov/ggml". GitHub.

[huggingface-16] "GGUF". huggingface.co. Retrieved 9 May 2024.

[ibm-gguf-vs-ggml-17] Mucci, Tim (3 July 2024). "GGUF versus GGML". www.ibm.com. Retrieved 26 July 2024.

[18] Boykis, Vicki (28 February 2024). "GGUF, the long way around". Vicki Boykis. Retrieved 26 July 2024.

[towardsdatascience-19] Labonne, Maxime (29 November 2023). "Quantize Llama models with GGUF and llama.cpp". Medium. Towards Data Science. Retrieved 9 May 2024.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]