How to Extract Invoice Key Parameter using Tesseract and end2end sequential model

3 min readMay 21, 2020

Key Invoice Parameter are highlighted with different colours. Source: SRPIE dataset for invoices.

What is AIESI?

Abstract Information Extrction from Scanned Invoice (AIESI) is an architecture to extract key information like, company name, invoice number, address, Tax amount, total amount, date, and etc. Simple OCR can only detect text from invoices. Task of AIESI is not only detect the text but we have to extract some useful key information from invoices.

What are advantage of AIESI system?

1. AIESI helps to streamline data from scanned documents.

2. For some application and system where we need to extract key information and store that into database AIESI is helpful.

3. For getting complete insight of scanned documents, AIESI help to streamline data from scanned documents. This data can be used for fast indexing, archiving database and analytics.

4. With recent advancement in deep learning, accuracy and processing time of AIESI improved significantly.

5. For industries, like banking, medical, insurance, this system is very helpful.

What is pipeline?

1. Detect bouding box in invoices

2. Extract text from bouding box

3. Extract key information from extracted text from invoice(ETFI)

What is Tesseract?

Tesseact is OCR engine. Initially started as research project in HP, later in 2005 it becomes open source. Now tesseract is maintained by Google and it is open-source. More information can be find on its official paper, https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf

For bounding box detectiong and extrcting text from scanned document, we are using tesseract engine. Many API and framwork available for this task. Google Mobile vision API, Amazon Textract, and etc. We can also use our own model for detecting bouding box and extract text enclosed in bouding box. CTPN can be used for detecting bouding box in scanned documents. Various architecture are also available for text detection in bounding box. Most methods combine CNN and RNN for extracting text from bounding box. CNN find visual features from document and RNN extract sequential features, to cop both we can detect text from bounding box.

For this post, I am using tesseract for pipeline part 1 and 2.

Key Information Extraction from extracted text from scanned documents.

Many methods available for extracting key- value pair. Simple method is rule based regular string matching in text using regular expression. Here We use character level classification using Bi-LSTM . This method is simple yet gives satisfactory result. I have used pre-trained model by Niansong Zhang, which can be find at https://github.com/zzzDavid/ICDAR-2019-SROIE/tree/master/task3/src

Model for character level classification using BiLSTM

<script src="https://gist.github.com/shreeshiv/379ff34d700c1c1e1481d5d44edc03e8.js"></script>

Complete pipelone Github repo can be find it here

shreeshiv/AIESI

Abstract Information Extraction from Scanned Invoices - shreeshiv/AIESI

github.com

How to run complete pipeline

Fork complete project from above link,
Download pre-trained model, model.pth file
Run it.

If you can alternatively run it on Google Colab, just run Jupiter file [available at repo]. Change path of output json file and input image file.

If you like this post, HIT clap! Thanks for reading. Hope you like it. Cheers! 😊.

Meanwhile I am working on a product, I call Expense. AI . It is a platform to store all invoices and bills at one location. We often lost invoices or cannot find when we need it. Using expense.AI we can store all invoices at one location. Download our FREE application on Google play store. Thanking you.

Expense.AI, Smart Expense Manager, Receipt Bank - Apps on Google Play

Expense.AI is a FREE and SAFE smart expense and receipt manager. Expense.AI is completely free to use and always. We…

play.google.com

How to Extract Invoice Key Parameter using Tesseract and end2end sequential model

What is AIESI?

What are advantage of AIESI system?

What is pipeline?

What is Tesseract?

Key Information Extraction from extracted text from scanned documents.

shreeshiv/AIESI

Abstract Information Extraction from Scanned Invoices - shreeshiv/AIESI

Expense.AI, Smart Expense Manager, Receipt Bank - Apps on Google Play

Expense.AI is a FREE and SAFE smart expense and receipt manager. Expense.AI is completely free to use and always. We…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Shreeshiv Patel

No responses yet