Cambrian-1 is a recently introduced family of multimodal large language models (MLLMs) designed with a vision-centric approach.
Vision-centric design: While many MLLMs focus on improving language models, Cambrian-1 emphasizes exploring and optimizing visual components to enhance real-world sensory grounding.
Comprehensive study: The researchers evaluated over 20 different vision encoders, including self-supervised, strongly supervised, and combined approaches.
New benchmark: They introduced CV-Bench, a new vision-centric benchmark to address limitations in existing MLLM evaluation methods.
Spatial Vision Aggregator (SVA): Cambrian-1 features a novel dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing token count.
Open-source release: The team has released model weights, code, datasets, and detailed instruction-tuning and evaluation recipes to foster further research and development in the field.
Performance: Cambrian-1 achieves state-of-the-art performance across various benchmarks, particularly excelling in visual-centric tasks.
Model variants: The researchers are gradually releasing three sizes of the model: 8B, 13B, and 34B parameters.
Training approach: Cambrian-1 uses a two-stage training process, involving visual connector training followed by instruction tuning.
Data curation: The team emphasizes the importance of high-quality visual instruction-tuning data, curating from publicly available sources with attention to data source balancing and distribution ratio.
Cambrian-1 represents a significant advancement in multimodal AI, offering a comprehensive, open-source approach to developing vision-centric language models.