Home > Forum Home > Developing Business Administration Solutions > Importing Data from PDF Files Share

Importing Data from PDF Files

Excel Help for Importing Data From Pdf Files in Developing Business Administration Solutions


Forum TopicPost Reply Login

Importing Data From Pdf Files

Rate this:
(4.2/5 from 24 votes)
Happy Business Spreadsheets has developed a free Excel program to extract and import PDF data into Excel which can be downloaded and used without restriction.

There is a common need to extract and import specific data from PDF files into Excel. Since Excel does not natively support the reading of PDF content, utilities are needed to convert the PDF file content for the Excel format. Several commercial applications accomplish this; however it is often the case where only specific data is required to be imported from multiple PDF files into one structured format.

We created such an application by using VBA code in conjunction with an open source PDF to Text conversion utility, which can be found at Foolabs.

[Download the free PDF data import Excel program here]

The program relies on the conversion utility (included in the download) and all PDF files to reside in the same directory as the Excel application. Text or data to extract are defined in the Control sheet by specifying start text, end text and multiple replacements routines with wildcard support. This enables flexibility to obtain comparable data from multiple PDF files based on patterns independent of different PDF file structures.

As many extraction rules as required can be set in order to create a table of information imported by extraction rule and PDF file name. Information on how to set up rules is available within the Excel application with a help icon and cell comments. The VBA code is commented and open for modification.

Any improvements or new features to the code are welcome to be posted here so that we can update the download version to the benefit of everyone.
 Excel Business Forums Administrator
 Posted by on
 
Replies - Displaying 41 to 50 of 88Order Replies By: Most recent | Chronological | Highest Rated
Confused
Rate this:
(3/5 from 1 vote)
Hello, I tried to comment out various parts myself but I didnt succeed on what I'm after. What parts I need to comment out for it to work so that it wont overwrite old data?
 Posted by on
Shocked
Rate this:
(3/5 from 1 vote)
The output data is emptied before each extraction process simply under the assumption that the data has already been used. This can either be turned off by commenting out the VBA code that does this, or data can be copied out to a separate workbook for saving after each process is run. 
 Excel Business Forums Administrator
 Posted by on
Grateful
Rate this:
(3/5 from 1 vote)
Hello,

This really is "One of a kind" program, thanks for that to the developers.

I have managed to input the commands quite succesfully, but I have a problem with how the program overwrites the old data every time I run the extraction. Is that supposed to be happening? How it can be changed?
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
I couldn't get the post to recognize the spaces like I thought it would. I know that was painful to read. Let me try again.

Example:
------------------------------
(Name of 1st School) (*5-10 spaces in between*) (full URL of school website) (*line break*)

Address: (School Address) (*5-10 spaces in between*) County: (County) (*5-10 spaces in between*) Phone: (Phone #) (*line break*)

Grade Range: (Grade Range) (*5-10 spaces in between*) Enrollment: (Enrollment #) (*5-10 spaces in between*) District: (District) (*5-10 spaces in between*) Fax:(Fax) (*line break*)

Personnel: (Person's name Person's position) (There also may be additional people listed below or just one person listed with a line break for each personnel and position title) (*line break*)

(Name of 2nd School)

etc.

----------------------------------------------------- There is only a single line break between the last entry in the "Personnel" section and the name of the following school.

Update: the actual number of spaces between the personnel name and the personnel position is between 37-40. So it's not a consistent number of spaces.
 Posted by on
Oops
Rate this:
(3/5 from 1 vote)
There are several issues to address here so let's cover them individually.
  1. The date formatting from number ranges.  One way is to retain some common preceding text such as "Range:". If this is not possible then we can add some replacement pairs from 1 to 9 with adding an apostrophe to stop Excel converting the text to a date.
  2. Extra spaces. We can remove multiple spaces by using replacement pairs specifying a double space to be found and a single space to be the replacement.
  3. The name of the school.  Sometimes we can get the initial text in a row by specifying a common text which occurs at the end of the row above. Sometimes this can be further above with wildcard replacement pairs to remove everything in between.  Text formatting is irrelevant as the conversion creates plain text before being interrogated.
  4. Personnel. It is usually best to first extract as much text as needed and then clean using rules defined in the replacement pairs.  Sometimes we need to be quite creative here.
I hope that this helps.
 Excel Business Forums Administrator
 Posted by on
ConfusedHi, this is an excellent tool. Thank you for developing it. I am having a problem with the auto-formatting features for dates in excel. My data in excel looks like this: (Name of 1st School) (full URL of school website) Address: (School Address) County: (County) Phone: (Phone #) Grade Range: (Grade Range) Enrollment: (Enrollment #) District: (District) Fax:(Fax) Personnel: (Person's name Person's position) (There also may be additional people listed below or just one person listed) (Name of 2nd School) ...etc When I extract the "Grade Range", Excel is formatting what should be "1-8" or "9-12" as January 8th or September 12th. When I try to reformat after conversion back to text, the number is like 45193. Do you have a solution to keep it as "1-8", "7-12", "9-12", etc? Also, I obviously have problems with there being spaces between everything, including extra spaces within the "Personnel" line. And there is nothing unique preceding the "Name of n-th School" field. They aren't even numbered. It basically just says "Happy Mountain Example School". I would really like to extract that "Name of n-th School" field, if possible. It also creates problems with the "Personnel" field. Because if I want to get all of the "Personnel" (because there may be more than one), I have to put the end text as "Address:" because it is the only next unique text. I could use "full URL of school website", but not every school has one and nothing precedes that field. Will it help that the "Name of the n-th School" field is always in bold? A solution to either of these issues would be much appreciated. Again, thank you for this great utility.
 Posted by on
Fedup
Rate this:
(3/5 from 1 vote)
You can attach a sample PDF by replying to the notification email with details of desired extraction of this post and we can look into it.
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
Hi ,Thanks for your reply, but i couldnt still understand it completely. Do you have a mail id. I can send a sample file so that you can get a better idea. Moreover the requirement of extracting specific data from a pdf file is huge for us. If you have any tool, we can buy it too. To get started with. Here are few things. I have the insurer name at the beginning of the pdf file.   but if i set the start text as name, i get everything that has the text "name" befor it. For e.g :name of bank, etc. I am not able to understand how to set the new line things and replacement pairs data.Kindly suggest. Thanks for your time.  
 Posted by on
Shocked
Rate this:
(4/5 from 2 votes)
If data within the PDF files is tabulated we need to extract the data by rows and then use the text to columns function in Excel in order to replicate the table structure. 

To extract rows we need to specify the start and end text. End text is simple as it will be a new line by using the [new line] option.  The start text is more complicated as the extraction will return whatever is found after the starting text.  We can specify the first column text and replace it afterward is there is a common pattern. Alternatively if a common pattern exists the the last column of the prevoius row or text then we can use that with [new line] to extract the entire subsequent row. 
 Excel Business Forums Administrator
 Posted by on
Confused
Rate this:
(3/5 from 1 vote)
Hi All,My pdf files look similar to the below format. I couldnt attach  sample file as i couldnt see the option. Can you please help, with the below set up of pdf files, I need the fields underlined in a new excel sheet. How can i use your tool for the same. Kindly help me. As i work on around 80 - 90 pdf files on a daily basis to extract these underlined rows manually from each one.Thanks a lot for your help.The fields are 1.Reference number2. member id3.   Name4.Policy Number 5.Date of Birth6.mm/dd/yyyy7. Drug Name                                                      Reference Number : XXXXXXX         Member ID   XXXXXXXXXX              Telephone Number            xxxxxxxxxxxxx   Name                         XXXXXXXX         Fax Number   Policy Number         XXXXXXX               E-mail    xxxxxxxxx                            Primary Insured          Gender                     NA                    Date of Birth    NA          Marital Status             NA              Dependent Insured          Name   xxxxxxx                   (mm/dd/yyyy)     xxxxxxx          Gender  xxxxxxx                                                Relation         Spouse          Current Country of Residence           Switzerland             Date of Purchase:                                         10/01/2012     Drug Name:   XXXXXXXXX
 Posted by on
 Displaying page 5 of 9 

Excel templates and solutions matched for Importing Data from PDF Files:

Solutions: Export MapPoint Waypoints Survey Data Analysis