Home > Forum Home > Developing Business Administration Solutions > Importing Data from PDF Files Share

Importing Data from PDF Files

Excel Help for Importing Data From Pdf Files in Developing Business Administration Solutions


Forum TopicPost Reply Login

Importing Data From Pdf Files

Rate this:
(4.2/5 from 24 votes)
Happy Business Spreadsheets has developed a free Excel program to extract and import PDF data into Excel which can be downloaded and used without restriction.

There is a common need to extract and import specific data from PDF files into Excel. Since Excel does not natively support the reading of PDF content, utilities are needed to convert the PDF file content for the Excel format. Several commercial applications accomplish this; however it is often the case where only specific data is required to be imported from multiple PDF files into one structured format.

We created such an application by using VBA code in conjunction with an open source PDF to Text conversion utility, which can be found at Foolabs.

[Download the free PDF data import Excel program here]

The program relies on the conversion utility (included in the download) and all PDF files to reside in the same directory as the Excel application. Text or data to extract are defined in the Control sheet by specifying start text, end text and multiple replacements routines with wildcard support. This enables flexibility to obtain comparable data from multiple PDF files based on patterns independent of different PDF file structures.

As many extraction rules as required can be set in order to create a table of information imported by extraction rule and PDF file name. Information on how to set up rules is available within the Excel application with a help icon and cell comments. The VBA code is commented and open for modification.

Any improvements or new features to the code are welcome to be posted here so that we can update the download version to the benefit of everyone.
 Excel Business Forums Administrator
 Posted by on
 
Replies - Displaying 61 to 70 of 88Order Replies By: Most Recent | Chronological | Highest Rated
Happy
Rate this:
(3/5 from 1 vote)
To prevent the output from being cleared each time the process is run, we can comment out the line:
Call clearoutput 
This subroutine call is near the beginning of the Run_Extraction VBA routine. 
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
To recuprate the first line, one would ned to define a common text below this as the end text to then loop files and extract whatever content appears before it.
 Excel Business Forums Administrator
 Posted by on
Happy
Rate this:
(3/5 from 1 vote)
The VBA code is open for modification and integration into your projects.
 Excel Business Forums Administrator
 Posted by on
Surprised
Rate this:
(3/5 from 1 vote)
The handling of differing column widths depends a lot on the PDF content. 

If you're able to get the data into the Excel output cleanly, then the best option would be to tidy it thereafter.  Options for this include the Text to Columns feature in Excel using spaces or tabs as the delimiter, and the TRIM function to remove white space. 
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
Once you have established the start and end text, the row of numbers will be extracted as one text string. Often in this case, we need to clean the extracted data to reproduce the columns.  The Text to Columns menu item in Excel can accomplish this.
 Excel Business Forums Administrator
 Posted by on
Oops
Rate this:
(3/5 from 1 vote)
OK - if you can attach an example PDF by replying to this post notification email, we can take a look and see what is going on.
 Excel Business Forums Administrator
 Posted by on
Oops
Rate this:
(3/5 from 1 vote)
The executable file in Pdftotext.exe should be in the same folder as the Excel application and PDF files.  This is launched via command shell from the Excel application to import data.
 Excel Business Forums Administrator
 Posted by on
Oops
Rate this:
(3/5 from 1 vote)
To recuperate multiple instances of the same rule would require a modification to the code.  The only solution currently is to define unique rules for each instance.
 Excel Business Forums Administrator
 Posted by on
Oops
Rate this:
(3/5 from 1 vote)
Thanks for the feedback so far.  In terms of the layout for the new multiple instance data, the reason why it is output in a flat format is that otherwise every combination of pattern matches would need to be output in order to replicate the combined table layout.  Since this could be very large when multiplied out by each multiple instance, we decided to keep the data as a flat output in the order that the data was found in the PDF files.

One way to reformat this this as column is to use the multiple instance data table as a source for a pivot table and then layout the results in the desired fashion for analysis.  Any ideas would be appreciated.
 Excel Business Forums Administrator
 Posted by on
Grateful
Rate this:
(3/5 from 1 vote)
Many thanks to oteacher and zlattko for highlighting and making essentially the same observation on the multiple data results. It is annoying when the post gets lost due to the session timeout.  We'll look at extending it.

We are in the process right now of taking a fresh approach with the code based on these observations. The new code attempts to do away with the multiple instance table and list multiple content found in the combined table as before.  The issue is when one or more of the patterns is not found for any set of matches:
  • Should they be left blank or filled with the previous match for that pattern?
  • What determines whether a whole set should be deemed ready for output? At the moment we're looking at each time the first pattern is found, we look to see if all patterns have been found already, output the results, reset the cache and continue in the file.  Maybe this should be the last pattern.
Your input on these 2 issues will really help us create something usable.
 Excel Business Forums Administrator
 Posted by on
 Displaying page 7 of 9 

Excel templates and solutions matched for Importing Data from PDF Files:

Solutions: Export MapPoint Waypoints Survey Data Analysis