[go: nahoru, domu]

Skip to content

kotoba-tech/Open-GPT-4o

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Open-GPT-4o: Democratizing Speech Foundation Models for All

We develop Open-GPT-4o with a focus on speech (audio) foundation models, receiving inspiration from the Open-Sora initiative. We will make models and training details accessible to everyone. Open-GPT-4o will open up an avenue for contributions from the open-source community and make technologies available in non-English languages (e.g., Asian languages like Japanese).

Under development. Currently we have building blocks: audio tokenization, voice cloning, text-to-speech, and fast ASR.

📰 News

  • [2024.05.15] Launching Open-GPT-4o.
  • [~2024.05.15] Kotoba Technologies open-source preliminary building blocks towards Open-GPT-4o: Kotoba-Speech (a state-of-the-art Japanese voice cloning and text-to-speech model) and Kotoba-Whisper (fast and accurate Japanese ASR, 6.3x faster than OpenAI Whisper-large with competitive performance).

Latest Demo (To Be Updated)

We plan to create many demos using our foundation models. As a first step, we provide 🤗 Kotoba-Speech Demo on HuggingFace and 🤗 Kotoba-Whisper Demo on HuggingFace. These serve as building blocks for creating speech foundation models.

New Features/Updates

TBA

TODO list

  • LLamma-recipes adaptation for text LLM integration (@kojimano)
  • Collecting speech training data
  • Audio tokenization BPE to reduce sequence lengths (@jungokasai)
  • Support streaming generation of audio (fully autoregressive token generation) (@jungokasai)
  • Add downstream task data
    • English <> Japanese speech translation

Contents

Under development

Installation

Open-GPT-4o Model Weights

Data Processing

Training

Open-GPT-4o Training

Evaluation

Contribution

References

Acknowledgement

  • MetaVoice: An autoregressive (+non-autoregressive part) voice cloning/TTS model for English.
  • audiocraft: Speech/audio tokenization model for preprocessing and generation.
  • llama recipes: distirubted training library that supports llama's continuous training. We are grateful for the amazing open-sourcing community.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published