How to Extract Invoice Key Parameter using Tesseract and end2end sequential model

What is AIESI?
Abstract Information Extrction from Scanned Invoice (AIESI) is an architecture to extract key information like, company name, invoice number, address, Tax amount, total amount, date, and etc. Simple OCR can only detect text from invoices. Task of AIESI is not only detect the text but we have to extract some useful key information from invoices.
What are advantage of AIESI system?
1. AIESI helps to streamline data from scanned documents.
2. For some application and system where we need to extract key information and store that into database AIESI is helpful.
3. For getting complete insight of scanned documents, AIESI help to streamline data from scanned documents. This data can be used for fast indexing, archiving database and analytics.
4. With recent advancement in deep learning, accuracy and processing time of AIESI improved significantly.
5. For industries, like banking, medical, insurance, this system is very helpful.
What is pipeline?
1. Detect bouding box in invoices
2. Extract text from bouding box
3. Extract key information from extracted text from invoice(ETFI)
What is Tesseract?
Tesseact is OCR engine. Initially started as research project in HP, later in 2005 it becomes open source. Now tesseract is maintained by Google and it is open-source. More information can be find on its official paper, https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf
For bounding box detectiong and extrcting text from scanned document, we are using tesseract engine. Many API and framwork available for this task. Google Mobile vision API, Amazon Textract, and etc. We can also use our own model for detecting bouding box and extract text enclosed in bouding box. CTPN can be used for detecting bouding box in scanned documents. Various architecture are also available for text detection in bounding box. Most methods combine CNN and RNN for extracting text from bounding box. CNN find visual features from document and RNN extract sequential features, to cop both we can detect text from bounding box.
For this post, I am using tesseract for pipeline part 1 and 2.
Key Information Extraction from extracted text from scanned documents.
Many methods available for extracting key- value pair. Simple method is rule based regular string matching in text using regular expression. Here We use character level classification using Bi-LSTM . This method is simple yet gives satisfactory result. I have used pre-trained model by Niansong Zhang, which can be find at https://github.com/zzzDavid/ICDAR-2019-SROIE/tree/master/task3/src
Model for character level classification using BiLSTM
<script src="https://gist.github.com/shreeshiv/379ff34d700c1c1e1481d5d44edc03e8.js"></script>
Complete pipelone Github repo can be find it here
How to run complete pipeline
- Fork complete project from above link,
- Download pre-trained model, model.pth file
- Run it.
If you can alternatively run it on Google Colab, just run Jupiter file [available at repo]. Change path of output json file and input image file.
If you like this post, HIT clap! Thanks for reading. Hope you like it. Cheers! 😊.
Meanwhile I am working on a product, I call Expense. AI . It is a platform to store all invoices and bills at one location. We often lost invoices or cannot find when we need it. Using expense.AI we can store all invoices at one location. Download our FREE application on Google play store. Thanking you.