Home > Forum Home > Developing Business Administration Solutions > Importing Data from PDF Files Share

Importing Data from PDF Files

Excel Help for Importing Data From Pdf Files in Developing Business Administration Solutions


Forum TopicPost Reply Login

Importing Data From Pdf Files

Rate this:
(4.2/5 from 24 votes)
Happy Business Spreadsheets has developed a free Excel program to extract and import PDF data into Excel which can be downloaded and used without restriction.

There is a common need to extract and import specific data from PDF files into Excel. Since Excel does not natively support the reading of PDF content, utilities are needed to convert the PDF file content for the Excel format. Several commercial applications accomplish this; however it is often the case where only specific data is required to be imported from multiple PDF files into one structured format.

We created such an application by using VBA code in conjunction with an open source PDF to Text conversion utility, which can be found at Foolabs.

[Download the free PDF data import Excel program here]

The program relies on the conversion utility (included in the download) and all PDF files to reside in the same directory as the Excel application. Text or data to extract are defined in the Control sheet by specifying start text, end text and multiple replacements routines with wildcard support. This enables flexibility to obtain comparable data from multiple PDF files based on patterns independent of different PDF file structures.

As many extraction rules as required can be set in order to create a table of information imported by extraction rule and PDF file name. Information on how to set up rules is available within the Excel application with a help icon and cell comments. The VBA code is commented and open for modification.

Any improvements or new features to the code are welcome to be posted here so that we can update the download version to the benefit of everyone.
 Excel Business Forums Administrator
 Posted by on
 
Replies - Displaying 61 to 70 of 88Order Replies By: Most recent | Chronological | Highest Rated
Oops
Rate this:
(4/5 from 2 votes)
OK. The separation chaacter string used in the code to build the arrays of matching text is the '^^^' on the assumption that this is unlikely to be found within the content itself.

The pipe character '|' is still used to specify replacement pairs as this is independent of the source text extraction and used only to define find and replace patterns in order to clean the output as necessary.

I hope that this clarifies and if you have any ideas for improvement then we can look to implement.
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
Thanks for your reply!
I was looking through the comments, and it has been mentioned that the pipe character has been replaced with ^^^. Which does not seem to be the case for the version mentioned above. (Eventhough this is not very difficult to change..).

So I thought it was not the updated version, and was afraid it might be missing some other functions.

Thanks!
 Posted by on
Sad
Rate this:
(3/5 from 1 vote)
I just checked the download and it is the latest version.  What seems to be missing?

The Xpdf program is unmodified and used only to convert PDF conent to text format before the VBA code analyzes and extracts from it. 
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
Hello!
Thanks for this program, it runs pretty great. But the downloadable version doesn't seem to  be the newest version as describe in the comments. Could anybody verify this? And if not, is the newest  version still available?

Also one more question:
How much has been changed in the Xpdf program? or the program reliace on the original unedited version?

Thanks in advance,
Johhny!
 Posted by on
Oops
Rate this:
(3/5 from 1 vote)
Since the text search relies on the starting text and ending text,  a row of data underneath column headings can be extracted by specifying the last heading as the start text and the first text below the row required as the end text.

Rows of extracted data can be then transformed to columns via the 'Text to Columns' command in Excel.
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(4/5 from 2 votes)
Hi,

I love the file and it works for the pdf's that you gave as an example.  I am trying to use it for pdf's of bank statements.  I have another file with macros that reconciles bank activity to GL ledger activity and marks with an x anything that reconciles leaving your reconciling items.  The issue I have is we only get bank statement activity in PDF format.  Your file and code would work perfectly if I could use the control text tab and place the column headings as text to search for in the pdf and it drops in all the text details on the pdf below that column heading?  I am not sure if it is possible since it is searching for text and this is on the lines below the text.  Any suggestions?
 Anthony
 Posted by on
Grateful
Rate this:
(4/5 from 2 votes)
What can I say...

CONGRATS!!!

It's Excellent! 
It is working correctly for the test file, i'll be conducting more tests tomorrow when i get more test files.

THANK'S!

 Alexandre
 Posted by on
Grateful
Rate this:
(3/5 from 1 vote)
WOW ... Congrats !

very good... the tool is PERFECT ! ;)

just to notice: when "Keep Text Files" is checked, the tool is not working perfectly... its not going to the end of the file... for. ex. instead 120 pages, it founds the matches just in 20 or 21 pages ...

BUT, for me it is not problem at all ... i think also for the others, when they now know it ... ;)


 Posted by on
Shocked
Rate this:
(3/5 from 1 vote)
Thank you oteacher for this feedback.  The pipe character was used to separate combined results in memory during the process.  This obviously does not work when there are pipe characters in the content.

We changed this separator to three characters ^^^ in the now updated download.  This is under the assumptions that this string would be highly unlikely in the content.  We look forward to your testing results.
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
Almost there! :)
The results are displayed inline in the results.
The records are correctly identified.
The fields work ok with latin caracters like áéõç... and special caracters like $,x²,(),: (thats a big plus).
I've found a bug, if one of the fields has a pipe "|" in the value, the script seems to shift fields.
Sample: 
Field1 abc
Field2 123|456
Field3 xyz
Result:
  field1 field2 field3
Expected Result  abc 123|456  xyz 
Actual Result  abc 123  456
In this particular case, when "|" is found the replacement pairs doesnt work after the "|".
The 19/02 version displayed 123|456 as two separated lines in Multiple Instance Data table, and in the "Combined Last Instances" its displayed correctly as "123|456".
 Alexandre
 Posted by on
 Displaying page 7 of 9 

Excel templates and solutions matched for Importing Data from PDF Files:

Solutions: Export MapPoint Waypoints Survey Data Analysis