Wikipedia is one of those largest platforms which provides almost every information for free. From your kindergarten till today, you must have visited this platform atleast once to get any information from school presentations to professional research, Wikipedia helps everybody. In this article, I will take you through how to scrape Wikipedia articles with Python.
Unlike other sources of information websites, Wikipedia has its API to scrape data from its articles. Python being a general-purpose programming language provides packages for almost every task. So we have a package named as wikipedia for Python which we can use to scrape Wikipedia articles using Python.
Scrape Wikipedia with Python
To scrape useful information from Wikipedia, you need to install a package named as wikipedia, which can be easily installed using the pip command- pip install wikipedia. I hope you have easily installed this package, now let’s start with this task by importing the necessary package we need for this task:
import wikipedia as wiki
Code language: Python (python)
To explain the use of this package, I will scrape information based on Python. So let’s start with the task to scrape Wikipedia articles. The code below will get all the search suggestions of our input. In our case, it will return the search suggestions of Python:
print(wiki.search("Python"))
Code language: Python (python)
['Python (programming language)', 'Python', 'Monty Python', 'Ball python', 'Setuptools', 'PYTHON', 'Burmese python', 'Python (missile)', 'History of Python', 'Reticulated python']
Now let’s see will the search engine on Wikipedia suggest us python if we will type only some alphabets of its spelling:
print(wiki.suggest("Pyth"))
Code language: Python (python)
python
Yes, it works, now let’s have a look how we can get the summary of an article on Wikipedia:
print(wiki.summary("Python"))
Code language: Python (python)
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released in 2000, introduced features like list comprehensions and a garbage collection system with reference counting. Python 3.0, released in 2008, was a major revision of the language that is not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3. The Python 2 language was officially discontinued in 2020 (first planned for 2015), and "Python 2.7.18 is the last Python 2.7 release and therefore the last Python 2 release." No more security patches or other improvements will be released for it. With Python 2's end-of-life, only Python 3.5.x and later are supported. Python interpreters are available for many operating systems. A global community of programmers develops and maintains CPython, a free and open-source reference implementation. A non-profit organization, the Python Software Foundation, manages and directs resources for Python and CPython development.
If you want to read the summary in another language other than English, we can also do that. I will get the same summary above in the French language:
wiki.set_lang("fr")
print(wiki.summary("Python"))
Code language: Python (python)
Python (/ˈpaɪ.θɑn/) est un langage de programmation interprété, multi-paradigme et multiplateformes. Il favorise la programmation impérative structurée, fonctionnelle et orientée objet. Il est doté d'un typage dynamique fort, d'une gestion automatique de la mémoire par ramasse-miettes et d'un système de gestion d'exceptions ; il est ainsi similaire à Perl, Ruby, Scheme, Smalltalk et Tcl. Le langage Python est placé sous une licence libre proche de la licence BSD et fonctionne sur la plupart des plates-formes informatiques, des smartphones aux ordinateurs centraux, de Windows à Unix avec notamment GNU/Linux en passant par macOS, ou encore Android, iOS, et peut aussi être traduit en Java ou .NET. Il est conçu pour optimiser la productivité des programmeurs en offrant des outils de haut niveau et une syntaxe simple à utiliser. Il est également apprécié par certains pédagogues qui y trouvent un langage où la syntaxe, clairement séparée des mécanismes de bas niveau, permet une initiation aisée aux concepts de base de la programmation.
Now let’s change the language back to English and have a look at some more insights from the article. Here I will scrape all the information we will get if we will read about python on Wikipedia:
wiki.set_lang("en")
p = wiki.page("Python")
Code language: Python (python)
To get the Title:
print(p.title)
Code language: Python (python)
Python (programming language)
To get the url of the article:
print(p.url)
Code language: Python (python)
https://en.wikipedia.org/wiki/Python_(programming_language)
To scrape the full article:
print(p.content)
Code language: Python (python)
To get all the images in the article:
print(p.images)
Code language: Python (python)
And to get all the referals used by Wikipedia in the article:
print(p.links)
Code language: Python (python)
Also, Read – Translate any Language with Python.
I hope you liked this article to scrape Wikipedia with Python. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.
Also, Read – Role of Analytics in an Organization.