tinyllamas-ncnn

This is a repository hosting code for converting tinyllamas models into ncnn format and inference code using ncnn.

Changes to the model:

Removed batching to avoid tensors of rank 5 and up
Moved sampling into sample.py
Reverted to using manual implementation of flash attention
Always pad input to transformer to the full length of context length to avoid variable shape inputs
Applied workaround as described in Tencent/ncnn#4937

Usage

Export TorchScript

First put desired model listed in https://github.com/karpathy/llama2.c#models into /out/ckpt.pt. Then create a Python venv, enter it, and install Python dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run sample.py and the resulting TorchScript will be in model.pt.

Convert model into ncnn format

pnnx model.pt inputshape=[xxx]i32

You need to replace xxx with the context length of your chosen model.

The resulting model.ncnn.bin and model.ncnn.param is the model in ncnn format.

Compile the inference binary

You can either compile it using CMake with ncnn as a dependency or link the library yourself. Before compiling, adjust ctx_length in tinyllamas.cpp to the context length of your model.

Compile with CMake

This follows the standard procedure for compiling CMake projects.

Link ncnn manually

c++ tinyllamas.cpp ~/ncnn/build/src/libncnn.a -I ~/ncnn/src -I ~/ncnn/build/src/ -o tinyllamas -fopenmp

Use the resulting binary

When run with a wrong number of arguments, the binary prints out usage information.

Use Meta's Llama 2 weights (untested)

We first convert Llama 2 weights into llama2.c's binary format using export_meta_llama_bin.py, then convert it into llama2.c's PyTorch format for export to TorchScript using bin2pt.py. Then go through the same steps as described above.

Note that due to the implementation being memory-inefficient, you might need more than 64 GB of memory even for converting the 7B model.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 326 Commits
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
bin2pt.py		bin2pt.py
configurator.py		configurator.py
export_meta_llama_bin.py		export_meta_llama_bin.py
model.py		model.py
requirements.txt		requirements.txt
sample.py		sample.py
tinyllamas.cpp		tinyllamas.cpp
tokenizer.bin		tokenizer.bin
tokenizer.model		tokenizer.model
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tinyllamas-ncnn

Usage

Export TorchScript

Convert model into ncnn format

Compile the inference binary

Compile with CMake

Link ncnn manually

Use the resulting binary

Use Meta's Llama 2 weights (untested)

License

About

Releases

Packages

Languages

License

lrw04/tinyllamas-ncnn

Folders and files

Latest commit

History

Repository files navigation

tinyllamas-ncnn

Usage

Export TorchScript

Convert model into ncnn format

Compile the inference binary

Compile with CMake

Link ncnn manually

Use the resulting binary

Use Meta's Llama 2 weights (untested)

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages