[go: nahoru, domu]

Skip to content

lrw04/tinyllamas-ncnn

 
 

Repository files navigation

tinyllamas-ncnn

This is a repository hosting code for converting tinyllamas models into ncnn format and inference code using ncnn.

Changes to the model:

  • Removed batching to avoid tensors of rank 5 and up
  • Moved sampling into sample.py
  • Reverted to using manual implementation of flash attention
  • Always pad input to transformer to the full length of context length to avoid variable shape inputs
  • Applied workaround as described in Tencent/ncnn#4937

Usage

Export TorchScript

First put desired model listed in https://github.com/karpathy/llama2.c#models into /out/ckpt.pt. Then create a Python venv, enter it, and install Python dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run sample.py and the resulting TorchScript will be in model.pt.

Convert model into ncnn format

pnnx model.pt inputshape=[xxx]i32

You need to replace xxx with the context length of your chosen model.

The resulting model.ncnn.bin and model.ncnn.param is the model in ncnn format.

Compile the inference binary

You can either compile it using CMake with ncnn as a dependency or link the library yourself. Before compiling, adjust ctx_length in tinyllamas.cpp to the context length of your model.

Compile with CMake

This follows the standard procedure for compiling CMake projects.

Link ncnn manually

c++ tinyllamas.cpp ~/ncnn/build/src/libncnn.a -I ~/ncnn/src -I ~/ncnn/build/src/ -o tinyllamas -fopenmp

Use the resulting binary

When run with a wrong number of arguments, the binary prints out usage information.

Use Meta's Llama 2 weights (untested)

We first convert Llama 2 weights into llama2.c's binary format using export_meta_llama_bin.py, then convert it into llama2.c's PyTorch format for export to TorchScript using bin2pt.py. Then go through the same steps as described above.

Note that due to the implementation being memory-inefficient, you might need more than 64 GB of memory even for converting the 7B model.

License

MIT

About

Inference TinyLlama models on ncnn

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 81.0%
  • C++ 17.6%
  • CMake 1.4%