> ## Documentation Index
> Fetch the complete documentation index at: https://cstreams.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Meta + Ollama

> Open-source Vision-Language Models for holistic understanding of complex layouts

<Note>
  **Work in Progress:** The vision analysis pipeline feature is in active development
  and this guide is still in draft form.
</Note>

### Why VLM's?

A vision-language model is a fusion of vision and natural language models which can
holistically understand and interpret images. They are emerging as a powerful tool for
document analysis, particularly for complex layouts.

Cloud model providers like OpenAI charge `~$0.50` per image. This is prohibitive for batch
processing. Fortunately, open-source alternatives provide a free, increasingly viable
option you can run locally.

### How To Run a VLM

Two components are required:

* **A backend application or service** - We've tested with the excellent open-source
  [Ollama](https://ollama.com). Another great option is [LM Studio](https://lmstudio.ai/).

* **An open-source Visual Language Model** - We've tested with Meta's
  [Llama 3.2 Vision ](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices)
  Language Model, a state-of-the-art open-source VLM. However, any
  [Ollama-compatible VLM](https://ollama.com/search?c=vision) will work.

### Set up Ollama Backend Locally

Running it locally is the easiest way to get started, and has the advantage of no API
costs and full privacy. It runs headless by default via a CLI, but can also be managed
[in the browser](https://openwebui.com/).

Install Ollama on MacOS with Homebrew:

```bash theme={null}
brew install ollama
```

Start the Ollama server:

```bash theme={null}
ollama serve
```

<Tip>
  You can transition to a more powerful machine on your local network or in the cloud
  once you get up and running. Ollama has a wealth of [community
  integrations](https://github.com/ollama/ollama?tab=readme-ov-file#community-integrations).
</Tip>

### Verify Ollama Setup

Confirm Ollama is running correctly on whichever machine you've installed it on.

```bash theme={null}
# Check locally
curl http://[your-local-or-remote-IP]/api/show -d '{"name": "llama3.2-vision"}'
```

Test the endpoint from a machine on the same network:

```bash theme={null}
curl http://[your-local-or-remote-IP]:11434/api/show -d '{"name": "llama3.2-vision"}'
```

### Download the VLM

```bash theme={null}
# Download the model
ollama pull llama3.2-vision

# Quick test
ollama run llava "What's in this image? /Users/[your-username]/Desktop/some-book-cover.png"
```

To verify the model works across your network, run this test command from the machine that
will execute the pipeline:

```bash theme={null}
$ curl -X POST http://[your-ollama-host]:11434/api/chat   -H "Content-Type: application/json"   -d '{
    "model": "llama3.2-vision",
    "messages": [
      {
        "role": "user",
        "content": "Extract the following metadata fields from this book cover image. For each field, if you cannot find it, say \"Not visible\": \n1. Title\n2. Subtitle\n3. Author\n4. Publisher\n5. Edition\n6. Year\n7. ISBN\n8. DOI\n9. LOC (Library of Congress Control Number)",
        "images": ["'"$(base64 < 'logs/20250201_120432/[book-pdf-name]/[book-pdf-name]_page_1_original.png')"'"]
      }
    ]
  }' | jq -r 'select(.message != null) | .message.content'

  # this should give you response with metadata in the terminal
```

### Configuration

<ParamField query="ANALYZE_WITH_VISION" type="boolean" default="false">
  Enable Vision Language Model analysis for complex layouts and image-heavy pages.
</ParamField>

<ParamField query="OLLAMA_MODEL" type="string" default="llama3.2-vision">
  The Vision Language Model to use. Llama 3.2 Vision is recommended for its balance of
  performance and accuracy.
</ParamField>

<ParamField query="OLLAMA_PORT" type="number" default="11434">
  Default port for local Ollama API.
</ParamField>

<ParamField query="OLLAMA_TIMEOUT" type="number" default="60">
  Timeout in seconds for Ollama requests. Vision processing is more compute-intensive
  than text processing.
</ParamField>

<ParamField query="OLLAMA_MAX_RETRIES" type="number" default="3">
  Maximum number of retry attempts for Vision Language Model requests.
</ParamField>

<ParamField query="OLLAMA_RETRY_DELAY" type="number" default="1">
  Delay between retry attempts in seconds.
</ParamField>

<ParamField query="OLLAMA_ENDPOINT" type="string" default="http://localhost:11434/api/chat">
  Full endpoint URL for Ollama API.
</ParamField>

<ParamField
  query="VISION_ANALYSIS_THRESHOLDS"
  type="object"
  default={{
    full_page_image_ratio: 0.90,
    early_pages_cutoff: 1,
    low_ocr_confidence: 30.0
}}
>
  Configuration object controlling when Vision Language Model analysis is triggered:

  ```json theme={null}
  {
      // % of page area that must be images to trigger vision analysis
      "full_page_image_ratio": 0.90,
      
      // Always analyze first N pages (typically covers and TOC)
      "early_pages_cutoff": 1,

      // OCR confidence below this threshold triggers vision analysis
      "low_ocr_confidence": 30.0
  }
  ```

  Default keeps the cutoff to the very first page so cover art still triggers analysis
  without scanning entire front matter. Raise it when table-of-contents spans multiple
  pages or when you want more conservative fallbacks. Adjust `full_page_image_ratio`
  alongside it when working with dense photo spreads.
</ParamField>

### FAQ

#### What are the hardware requirements for running a VLM?

Vision Language Models are computationally intensive. You'll need to download a model size
appropriate for your hardware.

**GPU**

* **Recommended**: NVIDIA RTX 30-series or better, or Apple Silicon M-series
* **Performance**: GPU acceleration provides 10x or better performance compared to
  CPU-only inference
* **VRAM**: Model size determines VRAM requirements

<Note>
  The parameter count doesn't correlate with VRAM numbers. This is a common source of
  confusion.

  **A rough guide:**

  * 8 GB minimum for 7B parameter models
  * 16 GB for 13B parameter models
  * 32 GB for 33B parameter models
</Note>

**Storage**

* **Disk Space**: \~10GB for the Llama 3.2 Vision model

### References

* Ollama:

  * [API Documentation](https://github.com/ollama/ollama/blob/main/docs/api.md)

  * [Windows Setup Guide](https://github.com/ollama/ollama/blob/main/docs/windows.md)

  * [REST API docs](https://www.postman.com/postman-student-programs/ollama-api/documentation/suc47x8/ollama-rest-api)

* Other VLM's specifically for OCR to try:

  * [CogVLM](https://github.com/THUDM/CogVLM)
  * [Lucid\_Vision](https://github.com/RandomInternetPreson/Lucid_Vision)
  * [MiniGPT-v2](https://minigpt-v2.github.io) and
    [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)
  * [MMOCR](https://mmocr.readthedocs.io/en/stable/get_started/overview.html)
  * [Open VLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)
  * for Apple Silicon Macs: [MLX-VLM](https://github.com/Blaizzy/mlx-vlm)

* [Getting Started with State-of-the-Art VLM Using the Swarms API](https://medium.com/@kyeg/getting-started-with-state-of-the-art-vision-language-models-vlms-using-the-swarms-api-a26fd44c73ae)

* Excellent article on
  [OCR with VLM and LM Studio](https://danielvanstrien.xyz/posts/2024/11/local-vision-language-model-lm-studio.html)
