OCR with GOT and Sonnet

This is a problem that has stumped DocumentIntelligence (85%), Sonnet 3.5 (~90%) and other models. It's pretty simple:

Form 20 - Indian Electoral Results

Download PDFPDF

Loading PDF...

Scroll to view all pages • Pinch to zoom

Can you OCR this table (just the first page is hard) and get back to structured data?

With GOT and Sonnet-3.5, we can build a more generic pipeline with just rendered images that achieves 98.79%.

Here's the approach:

Use GOT to OCR the page.

Here's what we get:

Raw OCR

Plug that into Claude along with the table and this prompt:

Text

Here is a table image and the OCR output. Can you use the OCR output and put it into a proper markdown table without any mistakes?

Here's what we get:

Certainly. I'll create a markdown table using the OCR output, correcting any obvious errors and formatting it properly. Here's the markdown table:

Sr. No.	Station Name	Anurag Singh Thakur	Satpal Raizada	Hem Raj	Arun Ankesh Syal	Subedar Major Kulwant Singh	Jagdeep Kumar	Sumit Rana	Garib Dass Katoch	Gopi Chand Attari	Nand Lal	Ramesh Chand Sarthi	Surender Kumar Gautam	Total valid votes	NOTA	Total votes	No. of Tendered Votes
1	1-SAYAR	393	213	9	0	2	12	0	0	0	0	2	1	632	8	640	0
2	2-KAROT	313	172	20	2	1	1	0	0	1	1	1	0	512	3	515	0
3	3-RANIKOTLA	640	281	0	1	4	17	0	0	0	0	1	3	947	2	949	0
4	4-SAI BRAHMA	364	249	1	0	1	10	0	0	1	1	0	6	633	6	639	0
5	5-BATHOH	331	250	3	1	0	21	0	2	0	0	1	3	612	2	614	0
6	6-BHOLI	364	313	12	0	3	4	1	0	1	2	4	3	707	6	713	0
7	7-KHARSI KNAITTA	316	96	2	0	1	4	0	0	0	1	0	0	420	3	423	0
8	8-SOHRA BYUS	254	116	0	0	2	0	0	0	0	2	1	0	375	2	377	0
9	9-SUI	457	240	4	1	3	3	0	1	2	1	0	1	713	5	718	0
10	10-MALOKHAR	494	206	0	0	0	2	0	0	0	1	1	1	705	7	712	0
11	11-SOLDHA	445	275	5	0	3	6	0	0	0	2	2	1	739	5	744	0
12	12-CHANDPUR	254	167	1	0	0	7	1	1	0	2	1	5	439	6	445	0
13	13-KYARAN	301	243	2	0	0	7	0	0	0	3	0	1	557	2	559	0
14	14-CHHAKOH	542	261	4	1	1	1	1	0	0	0	0	0	811	5	816	0
15	15-GALAU	317	360	4	0	1	2	1	0	0	1	2	1	689	3	692	0
16	16-SAI NOWDAN	147	115	0	0	0	13	0	0	0	0	0	0	275	3	278	0
17	17-JANLAGHAT	296	336	2	0	0	17	0	0	0	1	0	0	652	1	653	0
18	18-SOHRI	263	182	1	1	3	8	2	0	0	0	0	2	462	4	466	0
19	19-BANDU	199	167	2	0	0	1	0	0	0	0	0	0	369	1	370	0
20	20-ASHA MAJARI	224	157	0	0	1	4	0	0	1	0	0	1	388	0	388	0
21	21-JUKHALA-1	195	198	2	0	1	17	1	0	0	1	1	2	418	4	422	0
22	22-GASSAUR	322	262	1	0	1	6	1	0	0	1	0	4	598	3	601	0
23	23-JUKHALA-2	281	238	0	0	0	7	0	0	0	0	1	0	527	2	529	0
24	24-SYCHLA	450	359	1	1	1	48	5	0	0	2	0	0	867	9	876	1
25	25-KOTLA	510	318	3	0	1	12	1	0	1	1	0	0	847	8	855	0
26	26-HIHARKHAN BASLA	309	320	4	0	0	18	1	0	1	0	0	2	655	4	659	0
27	27-PALOG	67	139	0	0	0	0	0	0	0	0	1	2	209	1	210	0
28	28-BAGKHURD	210	235	1	1	9	22	2	0	0	0	1	0	481	0	481	0
29	29-SOSAN	273	323	4	0	5	1	1	0	0	1	1	3	612	4	616	0
30	30-GHYAL	186	440	2	1	0	3	0	0	0	0	1	1	634	7	641	0
31	31-TEPARA	76	166	0	0	2	1	0	0	0	0	0	0	245	1	246	0
32	32-DABAR	199	315	0	0	1	4	0	0	0	0	1	1	521	0	521	0
33	33-KOTHI	139	127	0	0	0	4	0	0	0	0	0	1	271	1	272	0
34	34-LUHARDA	169	136	0	0	1	6	0	0	0	0	0	0	312	2	314	0
35	35-KAHLI	113	59	4	0	1	0	0	0	0	0	0	0	177	1	178	0

The table above shows the corrected OCR output - 35 polling stations with 19 columns each, representing voting results for various candidates.

This is the image we're using:

Results

Out of 19x35 fields (665), we (@Minu Choi really) found 8 problems. That's an accuracy of 98.79%! Far better than anything else we've tried.

Next steps

Fix the issues with the cleaned data so we have something to compare to.

Try an ablation study with surya, docintelligence, raw LLM extraction, prompting, etc and compare results to see what the error rate is.

Sr. No.

Station Name

Anurag Singh Thakur

Satpal Raizada

Hem Raj

Arun Ankesh Syal

Subedar Major Kulwant Singh

Jagdeep Kumar

Sumit Rana

Garib Dass Katoch

Gopi Chand Attari

Nand Lal

Ramesh Chand Sarthi

Surender Kumar Gautam

Total valid votes

Rejected votes

NOTA

Total votes

No. of Tendered Votes

1-SAYAR

393

213

632

640

2-KAROT

313

172

512

515

3-RANIKOTLA

640

281

947

949

4-SAI BRAHMA

364

249

633

639

5-BATHOH

331

250

612

614

6-BHOLI

364

313

707

713

7-KHARSI KNAITTA

316

420

423

8-SOHRA BYUS

254

116

375

377

9-SUI

457

240

713

718

10-MALOKHAR

494

206

705

712

11-SOLDHA

445

275

739

744

12-CHANDPUR

254

167

439

445

13-KYARAN

301

243

557

559

14-CHHAKOH

542

261

811

816

15-GALAU

317

360

689

692

16-SAI NOWDAN

147

115

275

278

17-JANLAGHAT

296

336

652

653

18-SOHRI

263

182

462

466

19-BANDU

199

167

369

370

20-ASHA MAJARI

224

157

388

21-JUKHALA-1

195

198

418

422

22-GASSAUR

322

262

598

601

23-JUKHALA-2

281

238

527

529

24-SYCHLA

450

359

867

876

25-KOTLA

510

318

847

855

26-HIHARKHAN BASLA

309

320

655

659

27-PALOG

139

209

210

28-BAGKHURD

210

235

481

29-SOSAN

273

323

612

616

30-GHYAL

186

440

634

641

31-TEPARA

166

245

246

32-DABAR

199

315

521

33-KOTHI

139

127

271

272

34-LUHARDA

169

136

312

314

35-KAHLI

113

177

178