LLM-centric pipeline for information extraction from invoices - Laboratoire d’Informatique, Systèmes, Traitement de l’Information et de la Connaissance
Communication Dans Un Congrès Année : 2024

LLM-centric pipeline for information extraction from invoices

Résumé

Extracting information from digital documents is an evolving area of research, especially with the recent advances in artificial intelligence and computer vision. Recently, Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks, including data extraction from documents. However, the accuracy of these models can be significantly affected when dealing with large or complicated documents due to the inherent complexity and variability of rich formats. In this paper, we target a specific type of complex document: financial invoices. OCR technology extracts editable and searchable data from different types of documents transformed into an image, e.g., scanned documents, and PDFs. However, OCR is highly sensitive to noise and image misalignment that frequently results into wrong extraction of texts. Moreover, OCR cannot understand the structure of a document, and leverage it to understand the semantic of the document’s content to extract structured information from document. OCR is therefore considered as a preprocessing step that need to be completed with further processing. In this paper, we use text-based LLMs to enrich the outputs of Optical Character Recognition (OCR) applied to documents to extract structured information from financial invoices. We show here, that by fusing OCR engines, including Tesseract and DocTR, with the two open-source LLM models, Llama3 and Mistral, we significantly improve the accuracy and reliability of information extraction operations on two datasets featuring business documents: SROIE and FATURA datasets.
Fichier principal
Vignette du fichier
2024323105.pdf (1.4 Mo) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04772570 , version 1 (08-11-2024)

Identifiants

  • HAL Id : hal-04772570 , version 1

Citer

Faiza Loukil, Sarah Cadereau, Hervé Verjus, Mattéo Galfré, Kavé Salamatian, et al.. LLM-centric pipeline for information extraction from invoices. International Conference on Foundation and Large Language Models (FLLM2024), Nov 2024, Dubai, United Arab Emirates. ⟨hal-04772570⟩

Collections

UNIV-SAVOIE LISTIC
50 Consultations
11 Téléchargements

Partager

More