Web Scraping for Data Science

Web Scraping for Data Science

Dec 24, 2020

As a data scientist you doesn't always get data in CSV format or databases. Sometimes you have to prepare data from various sources and then web scrapping comes in the picture. It is a most handy tool for you to collect data from websites. It is also known as web harvesting, web data mining. It is a process of automating the data extraction in an efficient and fast way.

You should have to remember some points before doing web scrapping:

Robots.txt: It is used to communicate with web crawlers and web robots. This file informs the web robot about which areas of the website should not be processed or scanned.

Terms of Service: Before doing you must check websites "Terms of Use". If a website clearly states that web scrapping is not allowed, you must respect that


Denial of Service: No matter whether you are a hacker or just a researcher, causing a Denial of Service error to a site can result in legal action taken against you.


Web Scrapping Applications in Data Science:

Machine Learning: You can collect data from various websites you can use the data to train various models for different tasks like Regression, Classification, Clustering.

Deep Learning: You can collect large amount of pictures to train your Convolutional Neural Network for image classification.

Natural Language Processing: You can collect reviews, tweets, comments from various websites to analyze sentiments of different users.

Tools for Web Scrapping:

  1. Beautiful Soup: Beautiful soup is a python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating,

    searching, and modifying the parse tree. It commonly saves programmers hours or day of work.

    pip install beautifulsoup4

  2. Mechanical soup: A python library for automating interaction with websites. Mechanical soup automatically stores and sends cookies, follow redirects, and can follow links and submit forms. It doesn't do Java script . however, this tools became unmaintained for several years as it didn't support python 3.

    pip install MechanicalSoup

  3. LXML: The LXML XML tool kit is a pythonic binding for the C libraries libxml2 and libxslt . It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native python APL, mostly compatible but superior to the well-known Element Tree APL.

    pip install lxml

  4. Scrapy: Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

    pip install Scrapy

  5. Selenium: Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms. you can use multiple programming languages like java , C# , python etc to create selenium test scripts . Testing done using the selenium tools is usually referred to as a selenium Testing.selenium software is not just a single tol but a suit of software ,each piece catering to different testing needs of an organization.

    pip install selenium

  6. Urllib: Urllib module is the URL handling module for python . It is used to fetch URLs . It uses the urlopen function and is able to fetch URLs using a variety of diffrent protocols.It collects several modules for working with URLs such as urllib . request for opening and reading URLs which are mostly HTTP, urllib.error module defines the exception classes for exception raised by urllib.request , RobotFileParser, which answers questions about whether or not a particular user agent can fetch a url on the web site that published the robots.txt file.

pip install urllib3

Enjoy this post?

Buy Mustafa Khan a coffee

More from Mustafa Khan