Document inlining: Crossing the modality gap with Compound AI

By Ray Thai|12/20/2024

Intro

Most of the world’s data, such as medical records, podcasts and financial statements, live as images, PDFs, audio files, or dedicated knowledge stores, formats that LLMs do not process well or accept. Accessing and processing this data is critical for AI applications to solve real-world use cases. VLMs and multi-modal models claim they have filled this gap, but in reality, these are incomplete solutions that can only handle limited types of inputs and lack reasoning capabilities on these modalities, leading to lower-quality results and higher costs. For example, due to a lack of vision training data, VLMs experience “modality gaps” where identical tasks can have significantly better results when inputs are processed via text instead of image.

Bridging this quality gap requires manually setting up complex workflows and pipelines to convert multi-media sources into a format LLMs can understand. Users need to parse data, format it into plain text and potentially chunk/embed it.

Our vision at Fireworks is to deliver the highest quality across all modalities while abstracting away this complexity through compound AI: by building an automated pipeline that transforms any digital asset format to be LLM compatible for processing and logical reasoning. This approach enables you to achieve higher quality results across any input type with the ease of use of a LLM.

Today, we are excited to launch a public preview of our first use case, Document Inlining, a compound system that automatically turns any LLM into a vision model to ingest images or PDFs for document-based vision tasks. Document Inlining parses images and pipes them directly into an LLM of your choice to deliver:

Higher quality - Achieve better reasoning and generation capabilities by utilizing any LLM of choice or specialized/fine-tuned models
Input flexibility - Automatically transform multiple file types like PDFs and screenshots. We can also handle rich document structures like tables/charts
Ultra-simple usage - Our API is OpenAI compatible. Enable this capability by editing 1-line to specify “#transform=inline” alongside your file


response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://storage.googleapis.com/fireworks-public/test/sample_resume.pdf#transform=inline"
                    },
                },
                {
                    "type": "text",
                    "text": "What are the candidate's BA and MBA GPAs?"
                },
            ],
        }
    ],
)

The Challenges of Processing Multimedia Content

Text-only LLMs have limited utility in handling documents. Recently, vision models have become popular but face 2 major challenges in processing content.

Quality Compared to text models, vision models usually have worse instruction-following and reasoning capabilities due to:
1. Modality gap - AI models are typically trained via text tokens for reasoning tasks. A little known fact is that this results in worse intelligence when handling images, even on identical tasks. See the evaluation section for more details.
2. Specialization - Compared to vision models, there are many more specialized text models, ranging from DeepSeek Coder for code or Qwen QWQ for reasoning. Text models can also be more easily fine-tuned.
3. Base capabilities - Text models tend to be more frequently updated for reasoning and instruction-following. For example, Llama 3.2 90B Vision’s stated benchmarks trail Llama 3.3 70B Instruct for text tasks.
Input format -
1. File types and quantity - Many VLMs do not support PDF inputs natively and only support a single image per query (or model quality degrades with multiple images)
2. Rich structure - Documents tend to have rich structures that include tables, charts and other figures that VLMs often struggle to transcribe, which particularly hinders RAG use cases

Introducing Document Inlining

To improve quality and usability of document use cases, Fireworks’ is introducing Document Inlining - a compound system that composes prompt transformation techniques to enable LLMs to handle PDFs and multiple images of documents.

Document Inlining transcribes images and PDFs into structured text to be ingested by LLMs, using a two-step approach:

Parsing: Transcribe and parse the non-textual content
Ingest: Feed the resulting text into an LLM for reasoning, and further processing.

Key Challenges: Under the hood, our parsing pipeline solves several key challenges

Complete OCR: Simple transcription services can’t handle tables, charts and other rich document structure. We’ve built a proprietary parsing service that provides “complete OCR”, to parse tables and figures and improve LLM reasoning.
Document Structuring: We enable PDFs and multiple images to be ingested while preserving the files’ original structure
Pipeline management: We’ve built functionality that bypasses transcription for previously seen content, to avoid duplicate transcript - improving both performance and cost

Advantages: This approach provides benefits including:

Improved quality via modularization: Tasks are decomposed and handled by specialized components, allowing quality to improve through improved text model reasoning.
Simple, 1-line developer experience: The entire flow is abstracted into simple, OpenAI-compatible code changes, instead of requiring components to be manually stitched together
Input Flexibility: Directly add PDFs or multiple images
Processing Speedup: Conduct document transcription in parallel across individual pages
Model flexibility - Use any LLM, including fine-tuned and specialized models

We can see the benefits of Document Inlining end-to-end in the following example. Without Document Inlining, we prompted Qwen 2VL vision model with “How many letter Ts are there in the table in total?” and received an obviously incorrect answer. However, with Document Inlining, we can use the smarter “Qwen 2.5 72B instruct”, and receive a correct response (responses vary per model run).

While this approach excels at typical document layouts, there are still limitations when handling highly visual (little text), spatially dependent, or layout heavy content that does not translate well into structured text.

Quality Evaluation

To evaluate the effectiveness of document inlining, we conducted two experiments on a dataset of arXiv articles paired with related questions. Each article was provided in PDF form, and we randomly selected 100 article–question pairs. We then ran these pairs through selected models, using Claude 3.5-Sonnet to choose which responses were preferred. We selected Claude as the evaluator because the Anthropic API natively supports PDF ingestion.

In the first experiment, we compared an open-weight, text-only LLM (Qwen2.5-72B-Instruct) with GPT4o. The open-weight model used document inlining to process each PDF, while GPT4o received each page as an image. We found that Qwen2.5-72B-Instruct’s responses were preferred over GPT4o’s in 68% of the comparisons. Detailed results are provided in the chart below.

In the second experiment, we used the same VLM (Qwen2-VL-72B-Instruct) under two different setups: one with document inlining, and the other ingesting PDFs page by page as images. As illustrated in the chart below, document inlining led to a clear improvement in response quality.

Get Started

Get started today in our docs. Use Document Inlining with a 1-line code edit for any LLM, including serverless, on-demand or fine-tuned models. Simply follow the OpenAI API specification for vision models and append #transform=inline to the content URL. You can also use Document Inlining in our UI playground for any model by enabling the “Transform” option. See this end to end demo for more!

Document_Inlining_Playground_Demo.mov

During public preview, Document Inlining incurs no added costs compared to our typical text models. You’ll pay only for output tokens and input tokens (including transcribed content) but will NOT incur additional costs for document parsing. See docs for more info.

Document Inlining is still compatible with LLM features like structured output / json mode to extract structured information from documents. Check out the below code snippet for usage of Document Inlining with JSON mode

For early access to the dedicated Fireworks Parser for use cases that require document storage, fill out this form.

import openai
import json
from pydantic import BaseModel

client = openai.OpenAI(
    base_url = "https://api.fireworks.ai/inference/v1",
    api_key="<FIREWORKS_API_KEY>",
)

class Result(BaseModel):
    professional_associations: list[str]
    accomplishment: list[str]

response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p3-70b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://storage.googleapis.com/fireworks-public/test/sample_resume.pdf#transform=inline"
                    },
                },
                {
                    "type": "text",
                    "text": "Extract the list of professional associations and accomplishments into JSON"
                },
            ],
        }
    ],
    response_format={"type": "json_object", "schema": Result.model_json_schema()}
)
json_string = response.choices[0].message.content
result = json.loads(json_string)
print(result)

Build smarter through Compound AI

Document Inlining showcases the power of compound AI systems. With Document Inlining, instead of relying on one vision model to handle all tasks, we achieve higher-quality, faster and more cost-efficient results by using a specialized parser and reasoning model. We’ll expand Document Inlining with other input transformations, including audio file inlining and inference-time search over long documents.

Fireworks makes it easy to build compound AI systems, by providing one place for:

Inference: Run all types of models and components fast and cost-effectively
Models and modalities: Get all the models you need for your system in one place, across modalities like text, audio, image and vision understanding
Adaptability: Tune and optimize models for quality and speed to suit your use case
Compound AI: Coordinate and run components together by using Fireworks frameworks and tools like function calling and JSON mode

Keep in touch with us on Discord or Twitter. Stay tuned for more updates coming soon!