OCR with GOT and Sonnet

Hrishi Olickel & Minu Choi·

This is a problem that has stumped DocumentIntelligence (85%), Sonnet 3.5 (~90%) and other models. It's pretty simple:

Form 20 - Indian Electoral Results

PDF
Loading PDF...

Scroll to view all pages • Pinch to zoom

Can you OCR this table (just the first page is hard) and get back to structured data?

With GOT and Sonnet-3.5, we can build a more generic pipeline with just rendered images that achieves 98.79%.

Here's the approach:

  1. Use GOT to OCR the page.

Here's what we get:

Raw OCR

  1. Plug that into Claude along with the table and this prompt:
Text
Here is a table image and the OCR output. Can you use the OCR output and put it into a proper markdown table without any mistakes?
  1. Here's what we get:

Certainly. I'll create a markdown table using the OCR output, correcting any obvious errors and formatting it properly. Here's the markdown table:

Sr. No.Station NameAnurag Singh ThakurSatpal RaizadaHem RajArun Ankesh SyalSubedar Major Kulwant SinghJagdeep KumarSumit RanaGarib Dass KatochGopi Chand AttariNand LalRamesh Chand SarthiSurender Kumar GautamTotal valid votesRejected votesNOTATotal votesNo. of Tendered Votes
11-SAYAR39321390212000021632086400
22-KAROT31317220211001110512035150
33-RANIKOTLA64028101417000013947029490
44-SAI BRAHMA36424910110001106633066390
55-BATHOH33125031021020013612026140
66-BHOLI36431312034101243707067130
77-KHARSI KNAITTA316962014000100420034230
88-SOHRA BYUS2541160020000210375023770
99-SUI4572404133012101713057180
1010-MALOKHAR4942060002000111705077120
1111-SOLDHA4452755036000221739057440
1212-CHANDPUR2541671007110215439064450
1313-KYARAN3012432007000301557025590
1414-CHHAKOH5422614111100000811058160
1515-GALAU3173604012100121689036920
1616-SAI NOWDAN14711500013000000275032780
1717-JANLAGHAT29633620017000100652016530
1818-SOHRI2631821138200002462044660
1919-BANDU1991672001000000369013700
2020-ASHA MAJARI2241570014001001388003880
2121-JUKHALA-119519820117100112418044220
2222-GASSAUR3222621016100104598036010
2323-JUKHALA-22812380007000010527025290
2424-SYCHLA45035911148500200867098761
2525-KOTLA51031830112101100847088550
2626-HIHARKHAN BASLA30932040018101002655046590
2727-PALOG671390000000012209012100
2828-BAGKHURD21023511922200010481004810
2929-SOSAN2733234051100113612046160
3030-GHYAL1864402103000011634076410
3131-TEPARA761660021000000245012460
3232-DABAR1993150014000011521005210
3333-KOTHI1391270004000001271012720
3434-LUHARDA1691360016000000312023140
3535-KAHLI113594010000000177011780

The table above shows the corrected OCR output - 35 polling stations with 19 columns each, representing voting results for various candidates.

This is the image we're using:

The source image - Form 20 Indian electoral results showing 35 polling stations across 19 columns
Figure 1: The source image - Form 20 Indian electoral results showing 35 polling stations across 19 columns

Results

Out of 19x35 fields (665), we (@Minu Choi really) found 8 problems. That's an accuracy of 98.79%! Far better than anything else we've tried.

Next steps

  1. Fix the issues with the cleaned data so we have something to compare to.
  1. Try an ablation study with surya, docintelligence, raw LLM extraction, prompting, etc and compare results to see what the error rate is.