In Data Science, you must have seen people reading CSV files and excel files to work with the data, but what about a PDF. Python is a very high level language that is the reason it is mostly getting used in Machine Learning and Artificial Intelligence. So using Python for PDF is probably as easy task. Python provides you libraries for everything. So in this Article, we will explore python for PDF. I will show you some methods for working with the data by extracting it from a PDF using Python.
PDF is one of the mostly used media to transfer information regarding presentations, links, buttons, audio and video files, and the most important thing “data”.
Python for PDF Processing

If you are learning Data Science or Machine Learning, or planning to do so, one thing you need to put in your mind is that while performing tasks with the data, using excel files(the most used one), one day you will also get a PDF to perform your data science skills.
Now if your don’t how how to extract and work with the data using a PDF file, how will you mange to even start with your work. This is where Python for PDF skills will help you. Now let’s work using a PDF file with Python. You can download all the PDF files from here that I will use in this article to work with PDF with Python.
Extract Text from PDF with Python
To extract Text from a PDF using Python, you need to install a library known as PyPDF2, which you can easily install using the pip command –
pip install PyPDF2
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Code language: Python (python)
20 PythonBasics S.R.Doty August27,2008 Contents 1Preliminaries 4 1.1WhatisPython?................................... ..4 1.2Installationanddocumentation....................
Reading a Table from a PDF with Python
To read a table using python for PDF, you need to install a library known as tabula-py, which can be easily installed using the pip command:
pip install tabula-py
import tabula
# readinf the PDF file that contain Table Data
# you can find find the pdf file with complete code in below
# read_pdf will save the pdf table into Pandas Dataframe
df = tabula.read_pdf("offense.pdf")
# in order to print first 5 lines of Table
df.head()
Code language: Python (python)
1 | Abuse and Other Offensive | 1-1639 CR, §3- | Unnamed: 3 | Felony | Unnamed: 5 | Person | II | Unnamed: 8 | |
---|---|---|---|---|---|---|---|---|---|
0 | NaN | Conduct | 601(b)(2)(ii) | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | Child Abuse—physical, with death | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2 | Abuse and Other Offensive | 1-0334 CR, §3- | NaN | Felony | NaN | Person | II | NaN |
3 | NaN | Conduct | 601(b)(2)(i) | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | Child Abuse—physical, 1st degree | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
If your PDF file contains multiple tables:
df = tabula.read_pdf("data.pdf",multiple_tables=True)
df
Code language: Python (python)
[ 0 1 2 3 4 5 6 7 8 9 \ 0 NaN mpg cyl disp hp drat wt qsec vs am 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 10 11 0 gear carb 1 4 4 2 4 4 3 4 1 4 3 1 5 3 2 6 3 1 7 3 4 8 4 2 9 4 2 10 4 4 11 4 4 12 3 3 13 3 3 14 3 3 15 3 4 16 3 4 17 3 4 18 4 1 19 4 2 20 4 1 21 3 1 22 3 2 23 3 2 24 3 4 25 3 2 26 4 1 27 5 2 28 5 2 29 5 4 30 5 6 31 5 8 32 4 2 ]
To extract a specific part from a specific page from your PDF:
tabula.read_pdf("data.pdf", area=(126,149,212,462), pages=1)
Code language: Python (python)
Unnamed: 0 | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 |
1 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 |
2 | Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 |
3 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 |
4 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 |
5 | Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 |
6 | Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 |
7 | Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 |
8 | Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 |
9 | Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 |
10 | Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 |
11 | Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 |
12 | Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 |
13 | Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 |
14 | Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 |
15 | Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 |
16 | Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 |
17 | Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 |
18 | Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 |
19 | Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 |
20 | Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 |
21 | Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 |
22 | AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 |
23 | Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 |
24 | Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 |
25 | Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 |
26 | Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 |
27 | Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 |
28 | Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 |
29 | Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 |
30 | Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 |
31 | Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 |
Export PDF to Excel
Exporting data from a PDF file to excel is also very easy with python:
tabula.convert_into("offense.pdf", "offense_testing.xlsx", output_format="xlsx")
Code language: Python (python)
[{'extraction_method': 'stream', 'top': 0.0, 'left': 0.0, 'width': 564.8800048828125, 'height': 528.8800048828125, 'data': [[{'top': 0.0, 'left': 0.0, 'width': 0.0, 'height': 0.0, 'text': ''}, {'top': 128.59, 'left': 253.13, 'width': 20.36700439453125, 'height': 4.980000019073486, 'text': 'mpg'}, {'top': 128.59, 'left': 283.9, 'width': 13.915863037109375, 'height': 4.980000019073486, 'text': 'cyl'}, {'top': 128.59, 'left': 313.23, and many more....
Also, read – AutoML: Automated Machine Learning.
I hope you liked this article on Python for PDF. Feel free to ask your valuable questions in the comments section. Don’t forget to subscribe for my daily newsletter below to get email notifications if you like my work.