Data Extraction from scanned pdf files

Hints, tips and tricks for newbies

Moderators: Dorian (MJT support), JRL

Post Reply
lucasc
Newbie
Posts: 1
Joined: Mon Feb 13, 2017 3:26 am

Data Extraction from scanned pdf files

Post by lucasc » Mon Feb 13, 2017 3:39 am

Hello everyone!

I would like to share with you my current problem and I would like that you guys tell me if Macro Scheduler could do this. In my office there are tons of documents that need to be converted in digital form. Not only that: that data should be transferred to a specific excel datasheet, to form a database. I've been looking for Optical Character Recognition (OCR) Programs that could do the first part, i.e. convert scanned files into searchable files. I believe that those can be found quite easily on Internet. However, I'm clueless about the second part. Does anyone have any hint how can I transfer information from a form into an Excel datasheet using Macro Scheduler?

PS: The scanned documents are a fixed template. That means that information should always be in the same position of the page. My goal is to extract informations from this form so that I can create a database containing Client/Type of Expenses/Expense Amount.

PS2: If MacroScheduler is able to achieve this even without converting scanned files into searchable files (maybe it has a built-in OCR?), that you be the best option.

User avatar
Dorian (MJT support)
Automation Wizard
Posts: 1350
Joined: Sun Nov 03, 2002 3:19 am
Contact:

Re: Data Extraction from scanned pdf files

Post by Dorian (MJT support) » Wed Feb 15, 2017 11:51 pm

Hi,

I've written a few scripts for customers wishing to extract information from PDFs.

It usually goes something like this :

Use GetFileList to create a list of all the PDFs in a folder.

Open each PDF in turn using ExecuteFile.

Copy all, and assign that to a variable using GetClipBoard.

Now we use Regex to extract exactly which data we're looking for.

Once that's done, we use the native Excel functions to write it all to an Excel file.

Hopefully this will give you a good starting point once you've got the documents scanned. We're happy to help you, if you need us to. We can do this via our regular support department, or we can help you via our custom scripting service. Whichever works best for you.
Yes, we have a Custom Scripting Service. Message me or go here

User avatar
Marcus Tettmar
Site Admin
Posts: 7380
Joined: Thu Sep 19, 2002 3:00 pm
Location: Dorset, UK
Contact:

Re: Data Extraction from scanned pdf files

Post by Marcus Tettmar » Thu Feb 16, 2017 7:56 am

You can also try using pdftotext. Loop through the files, shell to pdftotext to extract to text files, then you can open the text up and manipulate in script. This obviously requires the PDFs to actually contain text rather than just flat images. Try it and see.
Marcus Tettmar
http://mjtnet.com/blog/ | http://twitter.com/marcustettmar

Did you know we are now offering affordable monthly subscriptions for Macro Scheduler Standard?

User avatar
Dorian (MJT support)
Automation Wizard
Posts: 1350
Joined: Sun Nov 03, 2002 3:19 am
Contact:

Re: Data Extraction from scanned pdf files

Post by Dorian (MJT support) » Thu Feb 16, 2017 11:24 am

That sounds like it would streamline things wonderfully. I'm downloading that now so I can have a play with it. I've found command line tools combined with Macro Scheduler to be a very powerful combination in the past.
Yes, we have a Custom Scripting Service. Message me or go here

User avatar
CyberCitizen
Automation Wizard
Posts: 721
Joined: Sun Jun 20, 2004 7:06 am
Location: Adelaide, South Australia

Re: Data Extraction from scanned pdf files

Post by CyberCitizen » Fri Feb 17, 2017 11:13 am

Marcus Tettmar wrote:You can also try using pdftotext. Loop through the files, shell to pdftotext to extract to text files, then you can open the text up and manipulate in script. This obviously requires the PDFs to actually contain text rather than just flat images. Try it and see.
I was going to suggest this as well, great little application.
FIREFIGHTER

Post Reply
Sign up to our newsletter for free automation tips, tricks & discounts