PDF with Python

In Data Science, you must have seen people reading CSV files and excel files to work with the data, but what about a PDF. Python is a very high level language that is the reason it is mostly getting used in Machine Learning and Artificial Intelligence. So using Python for PDF is probably as easy task. Python provides you libraries for everything. So in this Article, we will explore python for PDF. I will show you some methods for working with the data by extracting it from a PDF using Python.

PDF is one of the mostly used media to transfer information regarding presentations, links, buttons, audio and video files, and the most important thing “data”.

Python for PDF Processing

PDF with Python

If you are learning Data Science or Machine Learning, or planning to do so, one thing you need to put in your mind is that while performing tasks with the data, using excel files(the most used one), one day you will also get a PDF to perform your data science skills.

Now if your don’t how how to extract and work with the data using a PDF file, how will you mange to even start with your work. This is where Python for PDF skills will help you. Now let’s work using a PDF file with Python. You can download all the PDF files from here that I will use in this article to work with PDF with Python.

Extract Text from PDF with Python

To extract Text from a PDF using Python, you need to install a library known as PyPDF2, which you can easily install using the pip command –

pip install PyPDF2

# importing required modules 
import PyPDF2 
# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 
# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
# printing number of pages in pdf file 
print(pdfReader.numPages) 
# creating a page object 
pageObj = pdfReader.getPage(0) 
# extracting text from page 
print(pageObj.extractText()) 
# closing the pdf file object 
pdfFileObj.close() Code language: Python (python)
20
PythonBasics
S.R.Doty
August27,2008
Contents

1Preliminaries
4
1.1WhatisPython?...................................
..4
1.2Installationanddocumentation....................

Reading a Table from a PDF with Python

To read a table using python for PDF, you need to install a library known as tabula-py, which can be easily installed using the pip command:

pip install tabula-py

import tabula
# readinf the PDF file that contain Table Data
# you can find find the pdf file with complete code in below
# read_pdf will save the pdf table into Pandas Dataframe
df = tabula.read_pdf("offense.pdf")
# in order to print first 5 lines of Table
df.head()Code language: Python (python)
1Abuse and Other Offensive1-1639 CR, §3-Unnamed: 3FelonyUnnamed: 5PersonIIUnnamed: 8
0NaNConduct601(b)(2)(ii)NaNNaNNaNNaNNaNNaN
1NaNChild Abuse—physical, with deathNaNNaNNaNNaNNaNNaNNaN
22Abuse and Other Offensive1-0334 CR, §3-NaNFelonyNaNPersonIINaN
3NaNConduct601(b)(2)(i)NaNNaNNaNNaNNaNNaN
4NaNChild Abuse—physical, 1st degreeNaNNaNNaNNaNNaNNaNNaN

If your PDF file contains multiple tables:

df = tabula.read_pdf("data.pdf",multiple_tables=True)
dfCode language: Python (python)
[                     0     1    2      3    4     5      6      7   8   9   \
 0                   NaN   mpg  cyl   disp   hp  drat     wt   qsec  vs  am   
 1             Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1   
 2         Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1   
 3            Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1   
 4        Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0   
 5     Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0   
 6               Valiant  18.1    6  225.0  105  2.76  3.460  20.22   1   0   
 7            Duster 360  14.3    8  360.0  245  3.21  3.570  15.84   0   0   
 8             Merc 240D  24.4    4  146.7   62  3.69  3.190  20.00   1   0   
 9              Merc 230  22.8    4  140.8   95  3.92  3.150  22.90   1   0   
 10             Merc 280  19.2    6  167.6  123  3.92  3.440  18.30   1   0   
 11            Merc 280C  17.8    6  167.6  123  3.92  3.440  18.90   1   0   
 12           Merc 450SE  16.4    8  275.8  180  3.07  4.070  17.40   0   0   
 13           Merc 450SL  17.3    8  275.8  180  3.07  3.730  17.60   0   0   
 14          Merc 450SLC  15.2    8  275.8  180  3.07  3.780  18.00   0   0   
 15   Cadillac Fleetwood  10.4    8  472.0  205  2.93  5.250  17.98   0   0   
 16  Lincoln Continental  10.4    8  460.0  215  3.00  5.424  17.82   0   0   
 17    Chrysler Imperial  14.7    8  440.0  230  3.23  5.345  17.42   0   0   
 18             Fiat 128  32.4    4   78.7   66  4.08  2.200  19.47   1   1   
 19          Honda Civic  30.4    4   75.7   52  4.93  1.615  18.52   1   1   
 20       Toyota Corolla  33.9    4   71.1   65  4.22  1.835  19.90   1   1   
 21        Toyota Corona  21.5    4  120.1   97  3.70  2.465  20.01   1   0   
 22     Dodge Challenger  15.5    8  318.0  150  2.76  3.520  16.87   0   0   
 23          AMC Javelin  15.2    8  304.0  150  3.15  3.435  17.30   0   0   
 24           Camaro Z28  13.3    8  350.0  245  3.73  3.840  15.41   0   0   
 25     Pontiac Firebird  19.2    8  400.0  175  3.08  3.845  17.05   0   0   
 26            Fiat X1-9  27.3    4   79.0   66  4.08  1.935  18.90   1   1   
 27        Porsche 914-2  26.0    4  120.3   91  4.43  2.140  16.70   0   1   
 28         Lotus Europa  30.4    4   95.1  113  3.77  1.513  16.90   1   1   
 29       Ford Pantera L  15.8    8  351.0  264  4.22  3.170  14.50   0   1   
 30         Ferrari Dino  19.7    6  145.0  175  3.62  2.770  15.50   0   1   
 31        Maserati Bora  15.0    8  301.0  335  3.54  3.570  14.60   0   1   
 32           Volvo 142E  21.4    4  121.0  109  4.11  2.780  18.60   1   1   
 
       10    11  
 0   gear  carb  
 1      4     4  
 2      4     4  
 3      4     1  
 4      3     1  
 5      3     2  
 6      3     1  
 7      3     4  
 8      4     2  
 9      4     2  
 10     4     4  
 11     4     4  
 12     3     3  
 13     3     3  
 14     3     3  
 15     3     4  
 16     3     4  
 17     3     4  
 18     4     1  
 19     4     2  
 20     4     1  
 21     3     1  
 22     3     2  
 23     3     2  
 24     3     4  
 25     3     2  
 26     4     1  
 27     5     2  
 28     5     2  
 29     5     4  
 30     5     6  
 31     5     8  
 32     4     2  ]

To extract a specific part from a specific page from your PDF:

tabula.read_pdf("data.pdf", area=(126,149,212,462), pages=1)Code language: Python (python)
Unnamed: 0mpgcyldisphpdratwtqsecvsamgear
0Mazda RX421.06160.01103.902.62016.46014
1Mazda RX4 Wag21.06160.01103.902.87517.02014
2Datsun 71022.84108.0933.852.32018.61114
3Hornet 4 Drive21.46258.01103.083.21519.44103
4Hornet Sportabout18.78360.01753.153.44017.02003
5Valiant18.16225.01052.763.46020.22103
6Duster 36014.38360.02453.213.57015.84003
7Merc 240D24.44146.7623.693.19020.00104
8Merc 23022.84140.8953.923.15022.90104
9Merc 28019.26167.61233.923.44018.30104
10Merc 280C17.86167.61233.923.44018.90104
11Merc 450SE16.48275.81803.074.07017.40003
12Merc 450SL17.38275.81803.073.73017.60003
13Merc 450SLC15.28275.81803.073.78018.00003
14Cadillac Fleetwood10.48472.02052.935.25017.98003
15Lincoln Continental10.48460.02153.005.42417.82003
16Chrysler Imperial14.78440.02303.235.34517.42003
17Fiat 12832.4478.7664.082.20019.47114
18Honda Civic30.4475.7524.931.61518.52114
19Toyota Corolla33.9471.1654.221.83519.90114
20Toyota Corona21.54120.1973.702.46520.01103
21Dodge Challenger15.58318.01502.763.52016.87003
22AMC Javelin15.28304.01503.153.43517.30003
23Camaro Z2813.38350.02453.733.84015.41003
24Pontiac Firebird19.28400.01753.083.84517.05003
25Fiat X1-927.3479.0664.081.93518.90114
26Porsche 914-226.04120.3914.432.14016.70015
27Lotus Europa30.4495.11133.771.51316.90115
28Ford Pantera L15.88351.02644.223.17014.50015
29Ferrari Dino19.76145.01753.622.77015.50015
30Maserati Bora15.08301.03353.543.57014.60015
31Volvo 142E21.44121.01094.112.78018.60114

Export PDF to Excel

Exporting data from a PDF file to excel is also very easy with python:

tabula.convert_into("offense.pdf", "offense_testing.xlsx", output_format="xlsx")Code language: Python (python)
[{'extraction_method': 'stream',
  'top': 0.0,
  'left': 0.0,
  'width': 564.8800048828125,
  'height': 528.8800048828125,
  'data': [[{'top': 0.0, 'left': 0.0, 'width': 0.0, 'height': 0.0, 'text': ''},
    {'top': 128.59,
     'left': 253.13,
     'width': 20.36700439453125,
     'height': 4.980000019073486,
     'text': 'mpg'},
    {'top': 128.59,
     'left': 283.9,
     'width': 13.915863037109375,
     'height': 4.980000019073486,
     'text': 'cyl'},
    {'top': 128.59,
     'left': 313.23,
and many more....

Also, read – AutoML: Automated Machine Learning.

I hope you liked this article on Python for PDF. Feel free to ask your valuable questions in the comments section. Don’t forget to subscribe for my daily newsletter below to get email notifications if you like my work.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1501

Leave a Reply