How to extract data from web pages for analysis

extract blueberries

When we don’t have data either, we create it, or we extract it from the internet. The three most frequent tasks I believe we all do when retrieving data(web scraping) from the internet are extracting text, URLs and tables from the web pages.

installation

Beautiful Soup and ‘lxml’ are python libraries which we will use today to obtain data from the HTML files. To install them run the following commands in your command prompt

pip install beautifulsoup4
pip install lxml
importing

Beautiful soup is a library which can parse anything and everything for you. Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours, take only minutes with Beautiful Soup.




With that let’s import the libraries which will be needed today

import bs4 as bs                #Beautiful Soup
import urllib.request           #To read the URLs
import pandas as pd      #Will be used for reading the tables

Libraries are there so that we don’t have to write all the function which we need from scratch. Importing libraries into a program is a cost-effective way of minimising the high-risk aspect of designing, developing, and testing software that has already gone through the development cycle.



parsing
sauce= urllib.request.urlopen('https://en.wikipedia.org/wiki/Martin_Scorsese_filmography')

url.open() is a function to open a URL which might be a string or an object. This function returns the meta information about the page, such as headers. Today we are going to read the filmography page on Wikipedia of my all time favourite director Martin Scorsese.

printing the sauce object

From the image, it is evident that sauce is an object of type HTTPResponse which has all the information about the page. Reading this information/response is our next step. Beautiful soup supports many parser libraries each has its merits and limitations.

soup=bs.BeautifulSoup(sauce,'lxml')

Most of the times I use lxml parser for HTML  to read the response because it is swift and is very lenient.  We provide the response object to and the type of parser we are going to learn to the Beautiful soup.




Beautiful soup will convert the given webpage into a complex tree of Python objects which will be parsed to extract and manipulate the data.
printing the soup object
In layman words, soup has the source code of the webpage. From here we will use different functions to pull the data which we need.

title of the page
print(soup.title)

title of the page

Information in HTML pages are in tags such as <title>, <p>, <a>, etc.  To get information in the form of pure text without any tags attached we should use string or text attributes.

print(soup.title.text)

title without tags



paragraph

Since paragraph in HTML is under <p> tag. Whenever we have to trace paragraph in the soup(Beautiful Soup object), we use ‘p’ attribute to find it.

print(soup.p)

paragraph of webpage with tags

To fetch out all the paragraphs from the webpage we will use the function find_all(). Beautiful Soup provides many methods to trace the tree when you are searching for a tag or multiple tags. find_all() and find() are the two most used.

print(soup.find_all('p'))

all paragraph with tags

No human would like to read such a clunky text. Let’s make it a bit according to our taste by removing the tags and printing each paragraph separately. For eliminating the tags, we will again use string, and for printing separately we will use for loop.

for paragraph in soup.find_all('p'):
    print(paragraph.text)

each para seperate without tags

LINKS

Tag used for storing URLs in HTML is <a>. Similar to how we retrieved paragraphs we can also link too.

for url in soup.find_all('a'):
    print(url.get('href'))

URLs

Any webpage has some of the internal links which lead to the other pages of the website, and there are some who take us all together to a webpage of a different site.

Tables

Though we can extract data from HTML table using Beautiful Soup object which is factory standard I will tell you a way which is more comfortable than that. We all know that pandas is an excellent tool for managing data but very few people know that we can scrape a website.

dfs=pd.read_html('https://en.wikipedia.org/wiki/Martin_Scorsese_filmography')

len(dfs)
#11

There are total 11 tables in this webpage which are all stored in dfs object of pandas in the form of an array.

df[1]

dataframe

In a matter of seconds, we have all the data on the entire page which we can use it the way we want. On International Women’s day, I wrote an article which was a by-product of my small project on web scraping.  I hope you would find this post helpful.