Highlights
- Pro
Block or Report
Block or report KnowingNothing
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseStars
Language
Sort by: Recently starred
A low-latency & high-throughput serving engine for LLMs
MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)
A list of tutorials, paper, talks, and open-source projects for emerging compiler and architecture
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
scalable and robust tree-based speculative decoding algorithm
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
A "large" language model running on a microcontroller
[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl
Optimized primitives for collective multi-GPU communication
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Modern C++ Programming Course (C++03/11/14/17/20/23/26)
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
Universal LLM Deployment Engine with ML Compilation
18 Lessons, Get Started Building with Generative AI 🔗 https://microsoft.github.io/generative-ai-for-beginners/
TileFlow is a performance analysis tool based on Timeloop for fusion dataflows
FlashInfer: Kernel Library for LLM Serving
Ongoing research training transformer models at scale
🦜🔗 Build context-aware reasoning applications
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
LlamaIndex is a data framework for your LLM applications