When we don’t have data either, we create it, or we extract it from the internet. The three most frequent tasks I believe we all do when retrieving data(web scraping) from the internet are extracting text, URLs and tables from the web pages.
Beautiful Soup and ‘lxml’ are python libraries which we will use today to obtain data from the HTML files. To install them run the following commands in your command prompt
pip install beautifulsoup4 pip install lxml
Beautiful soup is a library which can parse anything and everything for you. Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours, take only minutes with Beautiful Soup.
With that let’s import the libraries which will be needed today
import bs4 as bs #Beautiful Soup import urllib.request #To read the URLs import pandas as pd #Will be used for reading the tables
Libraries are there so that we don’t have to write all the function which we need from scratch. Importing libraries into a program is a cost-effective way of minimising the high-risk aspect of designing, developing, and testing software that has already gone through the development cycle.
url.open() is a function to open a URL which might be a string or an object. This function returns the meta information about the page, such as headers. Today we are going to read the filmography page on Wikipedia of my all time favourite director Martin Scorsese.
From the image, it is evident that sauce is an object of type HTTPResponse which has all the information about the page. Reading this information/response is our next step. Beautiful soup supports many parser libraries each has its merits and limitations.
Most of the times I use lxml parser for HTML to read the response because it is swift and is very lenient. We provide the response object to and the type of parser we are going to learn to the Beautiful soup.
Beautiful soup will convert the given webpage into a complex tree of Python objects which will be parsed to extract and manipulate the data.
In layman words, soup has the source code of the webpage. From here we will use different functions to pull the data which we need.
title of the page
Information in HTML pages are in tags such as <title>, <p>, <a>, etc. To get information in the form of pure text without any tags attached we should use string or text attributes.
Since paragraph in HTML is under <p> tag. Whenever we have to trace paragraph in the soup(Beautiful Soup object), we use ‘p’ attribute to find it.
To fetch out all the paragraphs from the webpage we will use the function find_all(). Beautiful Soup provides many methods to trace the tree when you are searching for a tag or multiple tags. find_all() and find() are the two most used.
No human would like to read such a clunky text. Let’s make it a bit according to our taste by removing the tags and printing each paragraph separately. For eliminating the tags, we will again use string, and for printing separately we will use for loop.
for paragraph in soup.find_all('p'): print(paragraph.text)
Tag used for storing URLs in HTML is <a>. Similar to how we retrieved paragraphs we can also link too.
for url in soup.find_all('a'): print(url.get('href'))
Any webpage has some of the internal links which lead to the other pages of the website, and there are some who take us all together to a webpage of a different site.
Though we can extract data from HTML table using Beautiful Soup object which is factory standard I will tell you a way which is more comfortable than that. We all know that pandas is an excellent tool for managing data but very few people know that we can scrape a website.
dfs=pd.read_html('https://en.wikipedia.org/wiki/Martin_Scorsese_filmography') len(dfs) #11
There are total 11 tables in this webpage which are all stored in dfs object of pandas in the form of an array.
In a matter of seconds, we have all the data on the entire page which we can use it the way we want. On International Women’s day, I wrote an article which was a by-product of my small project on web scraping. I hope you would find this post helpful.