Omost is a project that aims to convert the coding capabilities of large language models (LLMs) into image generation or composition capabilities.
Omost provides pre-trained LLM models that can write code to compose visual contents on a virtual "Canvas" agent. This Canvas can then be rendered by image generators to create actual images.
The name "Omost" (pronounced "almost") has two meanings: 1) After using Omost, your image is almost there, and 2) "O" stands for "omni" (multi-modal), and "most" means getting the most out of it.
Currently, Omost offers three pre-trained LLM models based on variations of Llama3 and Phi3: omost-llama-3-8b, omost-phi-3-mini-128k, and omost-dolphin-2.9-llama3-8b.
These models were trained on a mix of data sources, including ground-truth annotations, automatically annotated images, reinforcement from Direct Preference Optimization, and a small amount of tuning data from OpenAI GPT4o's multi-modal capability.
You can use the official Hugging Face space or deploy Omost locally by cloning the GitHub repository and following the provided instructions.
Omost provides a baseline renderer based on attention manipulation to generate images from the LLM-generated canvas code.
The recommended quantization settings are 4 bits for omost-llama-3-8b and 8 bits for omost-phi-3-mini-128k to fit within 8GB VRAM.
The omost-llama-3-8b and omost-phi-3-mini-128k models were trained on filtered safe data without NSFW or inappropriate content, while omost-dolphin-2.9-llama3-8b was trained on unfiltered data.
The context length of omost-phi-3-mini-128k is limited to around 8k tokens for optimal performance.
Omost represents a novel approach to combining the capabilities of LLMs and diffusion models for controlled image generation through code composition on a virtual canvas.