Better OCR with logprobs

Can you use logprobs to improve direct multimodal OCR?

Short answer, we can - for OpenAI models! Not so much for Gemini. Additionally, GPT does better at lower temps, Gemini does better at higher temps, and Gemini Flash makes significantly fewer mistakes than gpt-4o.

The core problem is simple: Do model confidences correlate to accuracy in OCR?

Gemini now provides token log probabilities (along with multiple candidates) in the response through Vertex AI, so let's give it a shot.

Data

We're just going to use this one page from our previous work replacing OCR with multimodal LLMs:

This is the electoral table we're testing on:

Figure 1: The test image - Form 20 Indian electoral results (same table from our GOT experiment)

This prompt is used to extract the table:

Convert the first table in this image ${imagePath} into a 2d array. Feel free to ignore formatting fluff or random things. Keep the headers. The number of columns and rows should match - there are no merged columns.

Unfortunately we can only use gemini-flash (both 001 and 002 work as of the writing of this post). We press on!

Approach

The approach was simple but took an hour or two to get working. The idea is to force JSON through a typespec, and visualize the output.

This script was used to get the output (there's a half useful vertex adapter in there as well):

Code

We get this output for our image (at temp 0 with 002):

2024-12-09T07-00-21-027Z-gemini-1.5-flash-002-polling.png.json

This is the output for 002 at temp 1.0:

2024-12-09T07-25-29-558Z-gemini-1.5-flash-002-temp-1.0-polling.png.json

This is the output for flash 001:

2024-12-09T07-26-18-299Z-gemini-1.5-flash-001-polling.png.json

Results

We created this app (with Sonnet and o1):

https://ocr-with-logprobs.vercel.app/

Visualization

What's immediately obvious is that differences are easier to see at higher temp.

Figure 2: Heatmap comparing Gemini Flash confidence scores at temperature 0.0 (left) and 1.0 (right) - differences become more visible at higher temperature

Graphing for positives and negatives using master data:

master.json

We can see how we did - not very good!

Figure 3: Gemini Flash at temperature 0.0 - log probability vs error scatter plot with incorrect predictions in red and correct in green

High temp causes significantly fewer errors, so there's less to go by:

Figure 4: Gemini Flash at temperature 1.0 - fewer errors overall, making threshold detection more difficult

There's not an easy threshold to set to catch mistakes that doesn't raise the false positives by a lot.

Comparing to GPT-4o

Interestingly, OpenAI offers the same thing. Comparing to gpt-4o, we find that the same relationships do not hold.

First, high temp is markedly worse than low temp:

High temp

Low temp

Figure 6: GPT-4o at temperature 0.0 - fewer errors with clear separation between correct (green) and incorrect (red) predictions

Low temp causes way fewer errors, and our method of using logprobs is significant enough to catch most errors even at very low thresholds.

This needs more study, to determine if this is a model-based difference, somewhat dependent on the dataset, or it's just that gemini had significantly fewer errors to begin with.

For reference, this is what 4o looks like in the dataset:

Figure 7: GPT-4o heatmap showing extensive errors (red) across the electoral table - significantly more errors than Gemini Flash

Just look at that sea of red.

Try it yourself

If you'd like to try, go here: https://ocr-with-logprobs.vercel.app/ and load your own dataset, or click on Load CMS Dataset.