Open-source Vision-Language Models for holistic understanding of complex layouts
Work in Progress: The vision analysis pipeline feature is in active development and this guide is still in draft form.
A vision-language model is a fusion of vision and natural language models which can holistically understand and interpret images. They are emerging as a powerful tool for document analysis, particularly for complex layouts.
Cloud model providers like OpenAI charge ~$0.50
per image. This is prohibitive for batch
processing. Fortunately, open-source alternatives provide a free, increasingly viable
option you can run locally.
Two components are required:
A backend application or service - We’ve tested with the excellent open-source Ollama. Another great option is LM Studio.
An open-source Visual Language Model - We’ve tested with Meta’s Llama 3.2 Vision Language Model, a state-of-the-art open-source VLM. However, any Ollama-compatible VLM will work.
Running it locally is the easiest way to get started, and has the advantage of no API costs and full privacy. It runs headless by default via a CLI, but can also be managed in the browser.
Install Ollama on MacOS with Homebrew:
Start the Ollama server:
You can transition to a more powerful machine on your local network or in the cloud once you get up and running. Ollama has a wealth of community integrations.
Confirm Ollama is running correctly on whichever machine you’ve installed it on.
Test the endpoint from a machine on the same network:
To verify the model works across your network, run this test command from the machine that will execute the pipeline:
Enable Vision Language Model analysis for complex layouts and image-heavy pages.
The Vision Language Model to use. Llama 3.2 Vision is recommended for its balance of performance and accuracy.
Default port for local Ollama API.
Timeout in seconds for Ollama requests. Vision processing is more compute-intensive than text processing.
Maximum number of retry attempts for Vision Language Model requests.
Delay between retry attempts in seconds.
Full endpoint URL for Ollama API.
Configuration object controlling when Vision Language Model analysis is triggered:
Vision Language Models are computationally intensive. You’ll need to download a model size appropriate for your hardware.
GPU
The parameter count doesn’t correlate with VRAM numbers. This is a common source of confusion.
A rough guide:
Storage
Ollama:
Other VLM’s specifically for OCR to try:
Getting Started with State-of-the-Art VLM Using the Swarms API
Excellent article on OCR with VLM and LM Studio