Meta + Ollama

Work in Progress: The vision analysis pipeline feature is in active development and this guide is still in draft form.

Why VLM’s?

A vision-language model is a fusion of vision and natural language models which can holistically understand and interpret images. They are emerging as a powerful tool for document analysis, particularly for complex layouts.

Cloud model providers like OpenAI charge ~$0.50 per image. This is prohibitive for batch processing. Fortunately, open-source alternatives provide a free, increasingly viable option you can run locally.

How To Run a VLM

Two components are required:

A backend application or service - We’ve tested with the excellent open-source Ollama. Another great option is LM Studio.
An open-source Visual Language Model - We’ve tested with Meta’s Llama 3.2 Vision Language Model, a state-of-the-art open-source VLM. However, any Ollama-compatible VLM will work.

Set up Ollama Backend Locally

Running it locally is the easiest way to get started, and has the advantage of no API costs and full privacy. It runs headless by default via a CLI, but can also be managed in the browser.

Install Ollama on MacOS with Homebrew:

brew install ollama

Start the Ollama server:

ollama serve

You can transition to a more powerful machine on your local network or in the cloud once you get up and running. Ollama has a wealth of community integrations.

Verify Ollama Setup

Confirm Ollama is running correctly on whichever machine you’ve installed it on.

# Check locally
curl http://[your-local-or-remote-IP]/api/show -d '{"name": "llama3.2-vision"}'

Test the endpoint from a machine on the same network:

curl http://[your-local-or-remote-IP]:11434/api/show -d '{"name": "llama3.2-vision"}'

Download the VLM

# Download the model
ollama pull llama3.2-vision

# Quick test
ollama run llava "What's in this image? /Users/[your-username]/Desktop/some-book-cover.png"

To verify the model works across your network, run this test command from the machine that will execute the pipeline:

$ curl -X POST http://[your-ollama-host]:11434/api/chat   -H "Content-Type: application/json"   -d '{
    "model": "llama3.2-vision",
    "messages": [
      {
        "role": "user",
        "content": "Extract the following metadata fields from this book cover image. For each field, if you cannot find it, say \"Not visible\": \n1. Title\n2. Subtitle\n3. Author\n4. Publisher\n5. Edition\n6. Year\n7. ISBN\n8. DOI\n9. LOC (Library of Congress Control Number)",
        "images": ["'"$(base64 < 'logs/20250201_120432/[book-pdf-name]/[book-pdf-name]_page_1_original.png')"'"]
      }
    ]
  }' | jq -r 'select(.message != null) | .message.content'

  # this should give you response with metadata in the terminal

Configuration

ANALYZE_WITH_VISION

boolean

default:"false"

Enable Vision Language Model analysis for complex layouts and image-heavy pages.

OLLAMA_MODEL

string

default:"llama3.2-vision"

The Vision Language Model to use. Llama 3.2 Vision is recommended for its balance of performance and accuracy.

OLLAMA_PORT

number

default:"11434"

Default port for local Ollama API.

OLLAMA_TIMEOUT

number

default:"60"

Timeout in seconds for Ollama requests. Vision processing is more compute-intensive than text processing.

OLLAMA_MAX_RETRIES

number

default:"3"

Maximum number of retry attempts for Vision Language Model requests.

OLLAMA_RETRY_DELAY

number

default:"1"

Delay between retry attempts in seconds.

OLLAMA_ENDPOINT

string

default:"http://localhost:11434/api/chat"

Full endpoint URL for Ollama API.

VISION_ANALYSIS_THRESHOLDS

object

Configuration object controlling when Vision Language Model analysis is triggered:

{
    // % of page area that must be images to trigger vision analysis
    "full_page_image_ratio": 0.90,
    
    // Always analyze first N pages (typically covers and TOC)
    "early_pages_cutoff": 3,

    // OCR confidence below this threshold triggers vision analysis
    "low_ocr_confidence": 30.0
}

FAQ

What are the hardware requirements for running a VLM?

Vision Language Models are computationally intensive. You’ll need to download a model size appropriate for your hardware.

GPU

Recommended: NVIDIA RTX 30-series or better, or Apple Silicon M-series
Performance: GPU acceleration provides 10x or better performance compared to CPU-only inference
VRAM: Model size determines VRAM requirements

The parameter count doesn’t correlate with VRAM numbers. This is a common source of confusion.

A rough guide:

8 GB minimum for 7B parameter models
16 GB for 13B parameter models
32 GB for 33B parameter models

Storage

Disk Space: ~10GB for the Llama 3.2 Vision model

References

Ollama:
Other VLM’s specifically for OCR to try:
- CogVLM
- Lucid_Vision
- MiniGPT-v2 and MiniGPT-4
- MMOCR
- Open VLM Leaderboard
- for Apple Silicon Macs: MLX-VLM
Getting Started with State-of-the-Art VLM Using the Swarms API
Excellent article on OCR with VLM and LM Studio

Getting Started

Key Concepts

Configuration

Analysis & Iteration

Project

Why VLM’s?

How To Run a VLM

Set up Ollama Backend Locally

Verify Ollama Setup

Download the VLM

Configuration

FAQ

What are the hardware requirements for running a VLM?

References

Getting Started

Key Concepts

Configuration

Analysis & Iteration

Project

​Why VLM’s?

​How To Run a VLM

​Set up Ollama Backend Locally

​Verify Ollama Setup

​Download the VLM

​Configuration

​FAQ

​What are the hardware requirements for running a VLM?

​References

Why VLM’s?

How To Run a VLM

Set up Ollama Backend Locally

Verify Ollama Setup

Download the VLM

Configuration

FAQ

What are the hardware requirements for running a VLM?

References