I am working on a task of turning raw ocr outputs from dots.ocr into nice structured jsons for 10+ logistical document types. All OCR outputs are different because the structure of the original docs are different (even within the same doc type). There is one JSON schema for every document type, schemas include stuff like number, date, products (weight/price), etc
I have 120 docs recognized by dots and jsonified by gemini 3 pro, they act like my ground truth. Now I am using a wannabe server with 5090 and 32gb of RAM to evaluate different local models in the same scaffolding. Current best results are about 65% accurate (by exact matching fields).
The aim of this market is to improve the score, and I commit to trying every answer I will be able to test in these 2 days (if I even get any)
Possible answers include <25gb model names (I will test them in the same scaffolding) + modifications to existing scaffolding (use some VLM and feed image along with the ocr tokens, use a single call for each field with this insanely fast model, etc).
Do note that I am using LM Studio, so the models by default should be compatible with it. Although 'use vllm bro' or something like that can be an approach if I end up using it for the best score.
I will resolve everything that is used in the final best score to yes, so in the best case the resolution looks like 1 model + multiple compatible approaches.