Vista de Lectura

Best way to convert varied scanned invoices to text and extract info?

Self-Hosted Alternatives to Popular Services

2 Julio 2024 at 13:09

Hey Reddit,

what would be the best way to convert scanned documents into text and extract specific information? The problem is that the documents (e.g., invoices) are always structured differently from thousands of banks, some are digital, and some are simply scanned. I tried using Invoice2data (Open-Source on Github), but if I have to create Regex for each bank and invoice format, I'll be doing it forever. Additionally, Tesseract-OCR sometimes has issues like confusing 'o' with '0'. Is there a better solution, perhaps something that learns? The goal is to extract the key information from the invoices and put it into an XML file that can be imported elsewhere.

submitted by /u/MisterCookie1234
[link] [comments]