“Data Science is only Possible with Data, and in the real world, the data is usually not waiting for you in a CSV file. You have to go after it. This is where the technique called web scraping comes into the picture”
What is Web Scraping?
The term web scraping might seem intimidating if you’re an absolute beginner.
Let’s try to understand this with this simple example.
Imagine you have to fetch a large amount of data from websites and you want to do it as quickly as possible.
How would you do it without manually going to each website and getting the data?
Well, “Web Scraping” is the answer. It just makes this job easier and faster.
Web scraping is an automated method used to extract large amounts of data from websites.
The data on the websites are unstructured. It helps collect these unstructured data and store it in a structured form.
There are different ways to scrape websites such as online Services, APIs, or writing your own code.
Today I’m going to show you how to do it with python.
Is Web Scraping Legal?
Before we get our hands dirty in practical let’s first understand is it legal?
The answer is: actually it depends from website to website some of them allow it and some don’t. To know whether a website allows web scraping or not, you can look at the website’s “robot.txt” file.
Why Python is good for the Job?
Answer is quite straight…it’s just good for the job because I’m saying it –just believe me and start learning python…Lol.
Well, I’ve already made this assertion in my last Post check it out now.
So without much chit-chat let’s jump to the fun part…
There are a few libraries you will need, so fire up your terminal or command prompt and install the following libraries.
pip install requests pip install lxml pip install bs4
like always I’ll recommend you to go to and visit the official documentation of these libraries, Especially beautiful soup (bs4).
Now, we got our setup ready with all the required libraries, So let’s get started.
Example Task 0 – Grabbing the title of a page
Let’s start very simple, we will grab the title of a page. Remember that this is the HTML block with the title tag.
For this task we will use www.example.com which is a website specifically made to serve as an example domain. Let’s go through the main steps:
import requests # Step 1: Use the requests library to grab the page # Note, this may fail if you have a firewall blocking Python # Note sometimes you need to run this twice if it fails the first time res = requests.get("http://www.example.com")
This object is a requests.models.Response object and it actually contains the information from the website.
Now we use BeautifulSoup to analyze the extracted page.
Technically we could use our own custom script to look for items in the string of res.text but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file).
Using BeautifulSoup we can create a “soup” object that contains all the “ingredients” of the webpage. Don’t ask me about the weird library names, I didn’t choose them! 🙂
import bs4 soup = bs4.BeautifulSoup(res.text,"lxml") # Now you can print and check the soup object content but this is an #optional step print (soup)
Now, Let’s use the .select() to grab elements. We are looking for the ‘title’ tag, so we will pass in ‘title’.
title_tag = soup.select('title') ''' this statement is going to return a list of all the title element including there tags. that means you can perform all the list operations like indexing and looping on this object. '''
Example Task 1 – Grabbing all elements of a class
Let’s try to grab all the section headings of the Wikipedia Article on Grace Hopper from this URL: https://en.wikipedia.org/wiki/Grace_Hopper
# Similar to last example get the request object res = requests.get('https://en.wikipedia.org/wiki/Grace_Hopper') #create a soup object from request soup = bs4.BeautifulSoup(res.text,"lxml")
Now its time to figure out what we are actually looking for.
Inspect the element on the page to see that the section headers have the class “mw-headline”. Because this is a class and not a straight tag, we need to adhere to some syntax for CSS.
|Syntax to pass to the .select() method||Match Results|
|All elements with the |
|The HTML element containing the |
|All the HTML elements with the CSS |
|Any elements named |
|Any elements named |
for item in soup.select(".toctext"): print(item.text)
Example Task 3 – Getting an Image from a Website
Let’s attempt to grab the image of the Deep Blue Computer from this wikipedia article: https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)
# As usual get the request object and create a soup with it. res = requests.get("https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)") soup = bs4.BeautifulSoup(res.text,'lxml') image_info = soup.select('.thumbimage') computer = image_info ''' You can make dictionary like calls for parts of the Tag, in this case, we are interested in the src , or "source" of the image, which should be its own .jpg or .png link: ''' print(computer['src'])
Now that you have the actual src link, you can grab the image with requests and get along with the .content attribute.
image_link = requests.get('https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Deep_Blue.jpg/220px-Deep_Blue.jpg') ''' Let's write this to a file:=, not the 'wb' call to denote a binary writing of the file. ''' f = open('my_new_file_name.jpg','wb') f.write(image_link.content) f.close()
Now you can check the directory you will find a file by name my_new_file_name.jpg which is our desired picture.
Well this was tiring, Hope you guys enjoyed reading this and learned from this article, Don’t forget to check out my other awesome articles.
Keep coming back for new updates 🙂