[go: nahoru, domu]

Skip to content
View KnowingNothing's full-sized avatar
🥰
🥰

Highlights

  • Pro
Block or Report

Block or report KnowingNothing

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A low-latency & high-throughput serving engine for LLMs

Python 61 7 Updated Jun 30, 2024

MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)

Python 28 2 Updated May 29, 2024

A list of tutorials, paper, talks, and open-source projects for emerging compiler and architecture

337 29 Updated Jun 8, 2024

Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".

Python 255 22 Updated Nov 3, 2023

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Python 1,616 82 Updated Jan 21, 2024

A multi-level tensor algebra superoptimizer

C++ 262 16 Updated Jul 3, 2024

scalable and robust tree-based speculative decoding algorithm

Python 280 29 Updated Jun 7, 2024

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Python 139 12 Updated Jun 26, 2024

A "large" language model running on a microcontroller

C++ 459 32 Updated Dec 9, 2023

Grok open release

Python 49,143 8,314 Updated May 29, 2024

机场推荐与机场评测

2,567 69 Updated Jul 1, 2024

A PyTorch Native LLM Training Framework

Python 477 19 Updated May 31, 2024

Minimalist ML framework for Rust

Rust 14,260 799 Updated Jul 2, 2024

[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl

C++ 4,879 758 Updated Feb 8, 2024

Optimized primitives for collective multi-GPU communication

C++ 2,955 756 Updated Jul 1, 2024

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 218 15 Updated Jul 2, 2024

An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.

Python 46 3 Updated Jun 25, 2024

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 5,340 484 Updated Jul 3, 2024

Modern C++ Programming Course (C++03/11/14/17/20/23/26)

HTML 11,393 754 Updated May 16, 2024

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

C++ 1,581 218 Updated Jul 3, 2024

Universal LLM Deployment Engine with ML Compilation

Python 17,695 1,408 Updated Jul 2, 2024

18 Lessons, Get Started Building with Generative AI 🔗 https://microsoft.github.io/generative-ai-for-beginners/

Jupyter Notebook 53,824 27,842 Updated Jul 3, 2024

Serving multiple LoRA finetuned LLM as one

Python 884 40 Updated May 8, 2024

TileFlow is a performance analysis tool based on Timeloop for fusion dataflows

C++ 51 5 Updated Apr 12, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 757 62 Updated Jul 3, 2024

Ongoing research training transformer models at scale

Python 9,275 2,091 Updated Jul 1, 2024

🦜🔗 Build context-aware reasoning applications

Python 88,467 13,890 Updated Jul 3, 2024

Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)

Jupyter Notebook 387 26 Updated Dec 2, 2023

LlamaIndex is a data framework for your LLM applications

Python 33,234 4,640 Updated Jul 3, 2024
Next