🌔 moondream

a tiny vision language model that kicks ass and runs anywhere

Benchmarks

moondream2 is a 1.86B parameter model initialized with weights from SigLIP and Phi 1.5.

Model	VQAv2	GQA	TextVQA	POPE	TallyQA
moondream1	74.7	57.9	35.6	-	-
moondream2 (latest)	75.4	59.8	43.1	(coming soon)	(coming soon)

Examples

Image	Example
	What is the girl doing? The girl is eating a hamburger. What color is the girl's hair? White
	What is this? A rack is present in the image, containing various electronic devices. A chair is situated on the left side, and a brick wall is visible in the background. What is behind the stand? A brick wall is visible behind the stand.

Usage

Using transformers (recommended)

pip install transformers timm einops

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model_id = "vikhyatk/moondream2"
revision = "2024-03-06"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)

image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))

The model is updated regularly, so we recommend pinning the model version to a specific release as shown above.

To enable Flash Attention on the text model, pass in attn_implementation="flash_attention_2" when instantiating the model.

model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision,
    torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).to("cuda")

Batch inference is also supported.

answers = moondream.batch_answer(
    images=[Image.open('<IMAGE_PATH_1>'), Image.open('<IMAGE_PATH_2>')],
    prompts=["Describe this image.", "Are there people in this image?"],
    tokenizer=tokenizer,
)

Using this repository

Clone this repository and install dependencies.

pip install -r requirements.txt

sample.py provides a CLI interface for running the model. When the --prompt argument is not provided, the script will allow you to ask questions interactively.

python sample.py --image [IMAGE_PATH] --prompt [PROMPT]

Use gradio_demo.py script to start a Gradio interface for the model.

python gradio_demo.py

webcam_gradio_demo.py provides a Gradio interface for the model that uses your webcam as input and performs inference in real-time.

python webcam_gradio_demo.py

Limitations

The model may generate inaccurate statements, and struggle to understand intricate or nuanced instructions.
The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
assets		assets
moondream		moondream
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
batch_generate_example.py		batch_generate_example.py
gradio_demo.py		gradio_demo.py
hf_release.py		hf_release.py
requirements.txt		requirements.txt
sample.py		sample.py
webcam_gradio_demo.py		webcam_gradio_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌔 moondream

Benchmarks

Examples

Usage

About

Releases

Packages

Languages

License

emilyjiayaoli/moondream

Folders and files

Latest commit

History

Repository files navigation

🌔 moondream

Benchmarks

Examples

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages