I am a fairly new user but am convinced that MS can help me automate this. I have an excel file that contains thousands of reference numbers and a PDF file that contains those reference numbers plus a piece of information for each reference number that I want to get into my excel file (in the same row as the reference number).
I know I can copy the reference number in the excel file to clipboard and use it to find the corresponding reference number in the PDF file using Adobe reader search function but that's where I run out of ideas to grab the piece of information that I want. The piece of information is always pre-fixed with the text "Declared Value" and follows the reference number but not always in the same position. There is other information in-between the reference number and the "Declared Value" which appears maybe 2 or 3 lines below the reference number., never the same place.
My mind is trying to understand whether the image recognition functionality would be of any use to me here - I'm not sure how though.
Any help would be appreciated.
Seasons Greetings to all !!
Finding and extracting data from PDF to excel file
Moderators: JRL, Dorian (MJT support)
-
- Newbie
- Posts: 3
- Joined: Wed Dec 20, 2006 11:43 pm
- Bob Hansen
- Automation Wizard
- Posts: 2475
- Joined: Tue Sep 24, 2002 3:47 am
- Location: Salem, New Hampshire, US
- Contact:
The Search you are looking for is usually done with Regular Expressions.
Macro Scheduler uses VBScript which, I think, includes Regular Expression support.
Is there any way you can use the Search in PDF to find the code, then grab n lines of date, where n is large enough to include what you want.
Then use VBEVAL with RegEx to parse out what you need?
This is obviously untested, just tossing something on the table to trigger the correct solution.
Macro Scheduler uses VBScript which, I think, includes Regular Expression support.
Is there any way you can use the Search in PDF to find the code, then grab n lines of date, where n is large enough to include what you want.
Then use VBEVAL with RegEx to parse out what you need?
This is obviously untested, just tossing something on the table to trigger the correct solution.
Hope this was helpful..................good luck,
Bob
A humble man and PROUD of it!
Bob
A humble man and PROUD of it!
-
- Newbie
- Posts: 3
- Joined: Wed Dec 20, 2006 11:43 pm
This sounds interesting - any suggestions on how I can grab the data from the pdf file? When I use the Adobe search function it highlights the found text but as it is not a text file I am at a loss as to how to grab data from the PDF. I think the answer will come easier once I know how to get the data.
- Bob Hansen
- Automation Wizard
- Posts: 2475
- Joined: Tue Sep 24, 2002 3:47 am
- Location: Salem, New Hampshire, US
- Contact:
In Adobe Reader, change mouse icon from hand to Text Select Tool (Just press "V".).
Now do a Find, section will be selected.
Make sure what you want copied is highlighted.
Right click selected section, select Copy. That highlighted section is now in clipboard.
When done copying in the PDF file, press "H" to change icon back to the Hand if you want to move around the file again.
Now you can use Macro Scheduler GetClipBoard> command.
The steps above are all manual, am not sure how you can automate it.
Now do a Find, section will be selected.
Make sure what you want copied is highlighted.
Right click selected section, select Copy. That highlighted section is now in clipboard.
When done copying in the PDF file, press "H" to change icon back to the Hand if you want to move around the file again.
Now you can use Macro Scheduler GetClipBoard> command.
The steps above are all manual, am not sure how you can automate it.
Hope this was helpful..................good luck,
Bob
A humble man and PROUD of it!
Bob
A humble man and PROUD of it!
Another possibility would be to use Gsview and GhostScript rather than Adobe. Gsview has a function to write all the text in a pdf to a text file. Depending upon how your data is formatted in the pdf and how it writes out to the text file, you might be able to grab your "Declared Values" by reading and parsing the text file.
Just a thought,
Dick
Just a thought,
Dick
-
- Automation Wizard
- Posts: 1101
- Joined: Fri Jan 07, 2005 5:55 pm
- Location: Somewhere else on the planet
You may well be able to do this "by hand" but it may be more cost efficient to look at solutions like able2extract.
I recently had this same problem, a consulting client wanted old pdf reports from an extinct system reformatted to excel. I tried several conversion programs and able2extract worked best for me, requiring minimal manual tidy up. They do have a trial period so can try before you buy.
I recently had this same problem, a consulting client wanted old pdf reports from an extinct system reformatted to excel. I tried several conversion programs and able2extract worked best for me, requiring minimal manual tidy up. They do have a trial period so can try before you buy.
Getting the text out of the PDF
There are some utilities that just pull the text out of of pdfs and convert them to text files. I know there is one that is a Command line version.
With a Command line version you just use Macro Scheduler to get the file list, run the command line application with the file name and then you have all of your text files. The example in the help file of MS gives a good example of getting the file name. Instead of displaying the name, just use it in the Run Program command
Once this is done you go through the text files with the getfilelist command as once you have the file name you can use it with the readln command to parse out the data you want into a csv file with macro scheduler and that's it.
You really only need a learn a few commands in MS to do this such as Readln, Position, ConCat, Length to parse the data and getfilelist to get a list of files to be processed and the Run Program command.
With a Command line version you just use Macro Scheduler to get the file list, run the command line application with the file name and then you have all of your text files. The example in the help file of MS gives a good example of getting the file name. Instead of displaying the name, just use it in the Run Program command
Once this is done you go through the text files with the getfilelist command as once you have the file name you can use it with the readln command to parse out the data you want into a csv file with macro scheduler and that's it.
You really only need a learn a few commands in MS to do this such as Readln, Position, ConCat, Length to parse the data and getfilelist to get a list of files to be processed and the Run Program command.
-
- Newbie
- Posts: 3
- Joined: Wed Dec 20, 2006 11:43 pm
Many Thanks to each of the people who responded. I have aborted trying to locate and manipulate the text in the PDF and now converting the PDF files to RTF files. I had spent time tying to use the built-in word find & replace command (Ctrl+F) and the cursor manipulation available using keyboard commands but kept coming unstuck with losing focus of the document body.
I have the program PDF Converter (from Nuance) and it appears to process the PDF files very efficiently. The suggestion about using Readln, Position, Concat and length seems the way to go so I will learn those commands and try that approach.
Thanks again to all for putting me on the right track !
Barry Bowles
I have the program PDF Converter (from Nuance) and it appears to process the PDF files very efficiently. The suggestion about using Readln, Position, Concat and length seems the way to go so I will learn those commands and try that approach.
Thanks again to all for putting me on the right track !

Barry Bowles