Home > Forum Home > Developing Business Administration Solutions > Importing Data from PDF Files Share

Importing Data from PDF Files

Excel Help for Importing Data From Pdf Files in Developing Business Administration Solutions


Forum TopicPost Reply Login

Importing Data From Pdf Files

Rate this:
(4.2/5 from 24 votes)
Happy Business Spreadsheets has developed a free Excel program to extract and import PDF data into Excel which can be downloaded and used without restriction.

There is a common need to extract and import specific data from PDF files into Excel. Since Excel does not natively support the reading of PDF content, utilities are needed to convert the PDF file content for the Excel format. Several commercial applications accomplish this; however it is often the case where only specific data is required to be imported from multiple PDF files into one structured format.

We created such an application by using VBA code in conjunction with an open source PDF to Text conversion utility, which can be found at Foolabs.

[Download the free PDF data import Excel program here]

The program relies on the conversion utility (included in the download) and all PDF files to reside in the same directory as the Excel application. Text or data to extract are defined in the Control sheet by specifying start text, end text and multiple replacements routines with wildcard support. This enables flexibility to obtain comparable data from multiple PDF files based on patterns independent of different PDF file structures.

As many extraction rules as required can be set in order to create a table of information imported by extraction rule and PDF file name. Information on how to set up rules is available within the Excel application with a help icon and cell comments. The VBA code is commented and open for modification.

Any improvements or new features to the code are welcome to be posted here so that we can update the download version to the benefit of everyone.
 Excel Business Forums Administrator
 Posted by on
 
Replies - Displaying 71 to 80 of 88Order Replies By: Most recent | Chronological | Highest Rated
Sad
Rate this:
(3/5 from 1 vote)
Thanks again for the feedback. There is another new version from the same link which should now list multiple instances for text found in each PDF file without the need for a separate table.  This was actually quite some change to the underlying code with the upside that it should now also be faster by processing each PDF file text only once and holding the patterns and replacements in memory.

One thing to note id that the beginning text for each replicated pattern should be first in the list of patterns. This has been specified in the cell comment for Start Text.

This all worked in the tests but you feedback will again be greatly appreciated.
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
I think this example will help you ...

in PDF:                  Bank Account                  1111111
                            Transaction Amount           369 EUR
                            Descripton                          789456123
                            Name/Adress                     Paris, 86 AV.
-------------------------------------------
                             Bank Account                  22222222
                            Transaction Amount           258EUR
                            Descripton                          147258369
                            Name/Adress                     Amsterdam, 86 AV.
-------------------------------------------
                             Bank Account                  333333333
                            Transaction Amount           147 EUR
                            Descripton                          321456849
                            Name/Adress                     Berlin, 86 AV.
---------------------------------------------
                            Bank Account                  444444444
                            Transaction Amount           789 EUR
                            Descripton                          7539514862
                            Name/Adress                     
-----------------------------------------------
OUTPUT=>
eadsheet:
  Bank Account Transaction Description Name/Adress
1  111111  369  789456123  Paris
2  222222  258  147258369  Amsterdam
3  333333  147  321456849  Berlin
4  444444  789  7539514862  

as you see, we always have an reference that is allways different and

the code should look something like this:

Do while bank account is new
     -the code what we already have(for multiple data), only if possible modified to show the results as at the table abowe
...

after doing this, think you (we also) have an very good product, which will be helpfull to a lot of people....
 Posted by on
Applaud
Rate this:
(3/5 from 1 vote)
  • Should they be left blank or filled with the previous match for that pattern.
Like the previous versions, if the new file didn't found any match it would appear as blank. I think thats a good aproach since a missing item often mean its "blank".
  • What determines whether a whole set should be deemed ready for output? At the moment we're looking at each time the first pattern is found, we look to see if all patterns have been found already, output the results, reset the cache and continue in the file.  Maybe this should be the last pattern.
I guess that the first pattern could work, if it finds a new first pattern or EOF consider it to be a complete record and starts a new one.

 Alexandre
 Posted by on
Grateful
Rate this:
(3/5 from 1 vote)
Many thanks to oteacher and zlattko for highlighting and making essentially the same observation on the multiple data results. It is annoying when the post gets lost due to the session timeout.  We'll look at extending it.

We are in the process right now of taking a fresh approach with the code based on these observations. The new code attempts to do away with the multiple instance table and list multiple content found in the combined table as before.  The issue is when one or more of the patterns is not found for any set of matches:
  • Should they be left blank or filled with the previous match for that pattern?
  • What determines whether a whole set should be deemed ready for output? At the moment we're looking at each time the first pattern is found, we look to see if all patterns have been found already, output the results, reset the cache and continue in the file.  Maybe this should be the last pattern.
Your input on these 2 issues will really help us create something usable.
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
Sh*t, I hate when a complete post gets lost... let's try again :/

The problem with the pivot table, is that if any of the records has a missing field you wouldn't be able to tell which field belongs to a record.

Sample Data:
Customer Enquire
Name: John Doe
Phone: 555-5555
Email: [email protected]
Question: What time is it?

Customer Enquire
Name: Jane Doe
Phone: 666-6666
Question: Whats 2+2?

Customer Enquire
Name: Larry Moe
Email: [email protected]
Question: To be or not to be?
---
Note that there could be missing fields, like Phone or Email. In this case we would have this rows: 3 names, 2 phones, 2 email and 3 question it would be hard to tell who has a phone, an email or both.
If one could use a special keyword, like 'Customer Enquire', as a record separator maybe the script could deal with this as beeing a different file.
 Alexandre
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)

i think that it will be much easier to manage with tha multiple data, if they are shown in order how they are found.

for Example:
I have in PDF: Bank Account, Transaction Amount, Transaction Description, Name/Adres...

and now they are in order of key Word
Transaction Amount
TransactionAmount
Transaction Amount ....
Bank Account
Bank Acount
Bank Account .... ect

so if they are shown in order how they are found, i can use an reference(in my case -Bank Account) to manage them

I think its clear what i mean ?!
 Posted by on
Oops
Rate this:
(3/5 from 1 vote)
Thanks for the feedback so far.  In terms of the layout for the new multiple instance data, the reason why it is output in a flat format is that otherwise every combination of pattern matches would need to be output in order to replicate the combined table layout.  Since this could be very large when multiplied out by each multiple instance, we decided to keep the data as a flat output in the order that the data was found in the PDF files.

One way to reformat this this as column is to use the multiple instance data table as a source for a pivot table and then layout the results in the desired fashion for analysis.  Any ideas would be appreciated.
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
Great!

Thanks!

if i get some new idea ... i will come back - to ask or to propose

 Posted by on
Applaud
Rate this:
(4/5 from 2 votes)
Great addition!
Testing the new version i've discovered that another problem is now fixed. :)
Here in Brasil we use a comma as a decimal separator, in the old version that caused the script to end the field when a comma was found inside the text. Now its working fine.
The multiple instances of text is now found and allocated in the "Multiple Instances Data" table.
Is it possible get the results in a table like the Combined Last Instances, with the columns in order? 
 Alexandre
 Posted by on
Happy
Rate this:
(4/5 from 2 votes)
It seems that a frequently required solution is to extract multiple instances of the same text pattern from one or more PDF files.  The logic behind this is slightly different than the original setup which attempts to line up multiple data for the same pattern from multiple PDF files.

The new version, which can be downloaded from the original link above, adds a second table of results in the Output sheet which lists multiple instances of data matched within and across PDF files processed. This approach retains the benefit of the original consolidation approach while adding the support for multiple data instances within each file.

Another feature added is the ability to retain the text files generated for each PDF file to assists with pattern matching setup and for alternative use of the content. The new version has cell comments with detailed information on the logic.

Please post your feedback here so that we can continue to improve this free and open source solution for importing PDF data into Excel.
 Excel Business Forums Administrator
 Posted by on
 Displaying page 8 of 9 

Excel templates and solutions matched for Importing Data from PDF Files:

Solutions: Export MapPoint Waypoints Survey Data Analysis