A beginner’s handbook to web scraping with Python

There are a great deal of a lot books to will allow you to be taught Python, but who undoubtedly reads these A to Z? (Spoiler: no longer me).

Many of us to find tutorial books precious, but I contain no longer most often be taught by reading a book front to abet. I be taught by doing a conducting, struggling, figuring some issues out, after which reading one other book. So, throw away your book (for now), and let’s be taught some Python.

What follows is a handbook to my first scraping conducting in Python. It is exclusively low on assumed data in Python and HTML. Right here is supposed to illustrate how one can get entry to online page direct material with Python library requests and parse the direct material the usage of BeatifulSoup4, as neatly as JSON and pandas. I will temporarily introduce Selenium, but I will no longer delve deeply into how one can expend that library—that subject deserves its non-public tutorial. In a roundabout arrangement I’m hoping to present you some tricks and tricks to make web scraping less overwhelming.

Installing our dependencies

Your total resources from this handbook come in at my GitHub repo. While you happen to need serve installing Python 3, investigate cross-take a look at the tutorials for Linux, Windows, and Mac.

$ python3 -m venv

$ provide venv/bin/activate

$ pip set up requests bs4 pandas

While you happen to just like the usage of JupyterLab, you would bustle the total code the usage of this notebook. There are different ways to set up JupyterLab, and right here is one among them:

# from the identical virtual atmosphere as above, bustle:

$ pip set up jupyterlab

Atmosphere a scheme for our web scraping conducting

Now now we non-public got our dependencies installed, but what does it desire to spot a webpage? 

Let’s desire a step abet and guarantee that to define our scheme. Right here is my list of requirements for a a hit web scraping conducting.

  • We are gathering data that is definitely worth the bother it takes to contain a working web scraper.
  • We are downloading data that can even be legally and ethically gathered by an online scraper.
  • We non-public some data of how that you just must perhaps to find the target data in HTML code.
  • We non-public the categorical tools: on this case, it be the libraries BeautifulSoup and requests.
  • We all know (or are willing to be taught) how one can parse JSON objects.
  • We non-public enough data abilities to make expend of pandas.

A comment on HTML: While HTML is the beast that runs the Web, what we largely favor to non-public is how tags work. A tag is a sequence of data sandwiched between attitude-bracket enclosed labels. As an illustration, right here is a pretend tag, known as “pro-tip”:

All you must know about html is how tags work </pro-tip>

We’re going to get entry to the tips in there (“All you must know…”) by calling its tag “pro-tip.” How that you just must perhaps to find and get entry to a tag will seemingly be addressed further on this tutorial. For extra of a survey at HTML fundamentals, investigate cross-take a look at this article.

What to explore in an online scraping conducting

Some desires for gathering data are extra suited for web scraping than others. My guidelines for what qualifies as a magnificent conducting are as follows.

There’s no such thing as a public API accessible for the tips. It is known as a lot more uncomplicated to raise discontinuance structured data through an API, and it could perhaps serve define both the legality and ethics of gathering the tips. There desires to be a gigantic quantity of structured data with a common, repeatable format to give an explanation for this effort. Web scraping can even be a effort. BeautifulSoup (bs4) makes this more uncomplicated, but there is never always any warding off the actual individual idiosyncrasies of net sites that would require customization. Equivalent formatting of the tips is no longer required, on the replace hand it does make issues more uncomplicated. The extra “edge conditions” (departures from the norm) present, the extra complex the scraping will seemingly be.

Disclaimer: I non-public zero actual training; the next is no longer supposed to be formal actual advice.

On the present of legality, accessing immense troves of data can even be intoxicating, but appropriate due to it be that you just would judge doesn’t mean it could perhaps calm be accomplished.

There is, fortunately, public data that can handbook our morals and our web scrapers. Most net sites non-public a robots.txt file related with the positioning, indicating which scraping actions are authorized and which would possibly perhaps be no longer. Or no longer it is largely there for interacting with engines like google (the supreme web scrapers). However, a lot of the tips on net sites is taken into legend public data. As such, some desire present of the robots.txt file as a jam of strategies barely than a legally binding dispute. The robots.txt file doesn’t take care of matters such as moral gathering and usage of the tips.

Questions I place a question to myself sooner than beginning a scraping conducting:

  • Am I scraping copyrighted fabric?
  • Will my scraping task compromise particular individual privateness?
  • Am I making an ideal sequence of requests that will perhaps perhaps overload or ruin a server?
  • Is it that you just would judge the scraping will train mental property I contain no longer non-public?
  • Are there phrases of carrier governing expend of the web direct, and am I following those?
  • Will my scraping actions diminish the worth of the customary data? (to illustrate, contain I knowing to repackage the tips as-is and in all probability siphon off online page web direct web direct visitors from the customary provide)?

After I spot a jam, I make certain I’m in a position to answer “no” to all of those questions.

For a deeper explore on the real concerns, watch the 2018 publications Legality and Ethics of Web Scraping by Krotov and Silva and Twenty Years of Web Scraping and the Computer Fraud and Abuse Act by Sellars.

Now it be time to spot!

After assessing the above, I came up with a conducting. My scheme was once to extract addresses for all Family Greenback stores in Idaho. These stores non-public an outsized presence in rural areas, so I needed to non-public what number of there are in a barely rural state.

The place to open is the jam page for Family Greenback.

To originate, let’s load up our prerequisites in our Python virtual atmosphere. The code from right here is supposed to be added to a Python file (scraper.py for those that’re seeking a establish) or be bustle in a cell in JupyterLab.

import requests # for making abnormal html requests

from bs4 import BeautifulSoup # magical tool for parsing html data

import json # for parsing data

from pandas import DataFrame as df # premier library for data group

Next, we request data from our target URL.

page = requests.get("https://areas.familydollar.com/identification/")

soup = BeautifulSoup(page.textual direct material, 'html.parser')

BeautifulSoup will desire HTML or XML direct material and remodel it right into a advanced tree of objects. Right here are plenty of general object styles that we are going to expend.

  • BeautifulSoup—the parsed direct material
  • Tag—a outmoded HTML tag, the most well-known form of bs4 ingredient that you just must perhaps bump into
  • NavigableString—a string of textual direct material within a tag
  • Comment—a different form of NavigableString

There is extra to desire present of when we explore at requests.get() output. I’ve most effective outmoded page.textual direct material() to translate the requested page into one thing readable, but there are different output styles:

  • page.textual direct material() for textual direct material (most general)
  • page.direct material() for byte-by-byte output
  • page.json() for JSON objects
  • page.raw() for the raw socket response (no thank you)

I non-public most effective labored on English-most effective net sites the usage of the Latin alphabet. The default encoding settings in requests non-public labored handsome for that. However, there would possibly perhaps be a wealthy web world previous English-most effective net sites. To yell that requests accurately parses the direct material, you would jam the encoding for the textual direct material:

page = requests.get(URL)

page.encoding = 'ISO-885901'

soup = BeautifulSoup(page.textual direct material, 'html.parser')

Taking a more in-depth explore at BeautifulSoup tags, we watch:

  • The bs4 ingredient tag is capturing an HTML tag
  • It has both a establish and attributes that can even be accessed like a dictionary: tag[‘someAttribute’]
  • If a tag has extra than one attributes with the identical establish, most effective the most well-known occasion is accessed.
  • A tag’s younger of us are accessed through tag.contents.
  • All tag descendants can even be accessed with tag.contents.
  • It is seemingly you’ll perhaps perhaps likely repeatedly get entry to the chunky contents as a string with: re.compile(“your_string”) as a replacement of navigating the HTML tree.

Resolve how one can extract related direct material

Warning: this course of can even be disturbing.

Extraction at some stage in web scraping can even be a frightening course of stuffed with missteps. I have the ideal manner to manner right here is to open with one handbook example after which scale up (this theory is upright for any programming job). Viewing the page’s HTML provide code is crucial. There are a sequence of how to contain this.

It is seemingly you’ll perhaps perhaps likely locate the total provide code of a page the usage of Python to your terminal (no longer urged). Speed this code at your individual possibility:


While printing out the total provide code for a page could perhaps work for a toy example shown in some tutorials, most new net sites non-public a huge quantity of direct material on any one among their pages. Even the 404 page is more seemingly to be stuffed with code for headers, footers, etc.

It is on the total easiest to browse the provide code through Test Page Offer to your authorized browser (appropriate-click, then engage out “locate page provide”). That is likely the most reliable manner to to find your target direct material (I will present why in a moment).

On this occasion, I’d like to to find my target direct material—an take care of, metropolis, state, and zip code—on this immense HTML ocean. Continually, a straightforward search of the page provide (ctrl + F) will yield the section where my target jam is found. After I’m in a position to undoubtedly watch an example of my target direct material (the take care of for on the least one retailer), I explore an attribute or tag that sets this direct material other than the comfort.

It could perhaps perhaps seem that first, I’d like to get web addresses for plenty of cities in Idaho with Family Greenback stores and talk over with those net sites to get the take care of data. These web addresses all look like enclosed in a href tag. Mountainous! I will are attempting browsing for that the usage of the find_all impart:

dollar_tree_list = soup.find_all('href')

dollar_tree_list

Browsing for href didn’t yield one thing else, darn. This is in a position to perhaps perhaps need failed due to href is nested in the course of the class itemlist. For the next are attempting, search on item_list. As a result of “class” is a reserved be conscious in Python, class_ is outmoded as a replacement. The bs4 characteristic soup.find_all() grew to become out to be the Swiss navy knife of bs4 functions.

dollar_tree_list = soup.find_all(class_ = 'itemlist')

for i in dollar_tree_list[:2]:

  print(i)

Anecdotally, I came during that browsing for a explicit class was once step by step a a hit manner. We’re going to be taught extra in regards to the article by finding out its form and length.

form(dollar_tree_list)

len(dollar_tree_list)

The direct material from this BeautifulSoup “ResultSet” can even be extracted the usage of .contents. Right here is known as a magnificent time to make a single handbook example.

example = dollar_tree_list[2] # a handbook example

example_content = example.contents

print(example_content)

Expend .attr to to find what attributes are present within the contents of this object. Expose: .contents on the total returns a list of precisely one merchandise, so step one is to index that merchandise the usage of the bracket notation.

example_content = example.contents[0]

example_content.attrs

Now that I’m in a position to look at that href is an attribute, that can even be extracted like a dictionary merchandise:

example_href = example_content['href']

print(example_href)

Striking together our web scraper

All that exploration has given us a course ahead. Right here’s the cleaned-up model of the common sense we realized above.

city_hrefs

= [] # initialise empty list

for i in dollar_tree_list:

    cont

=

i.

contents[0]

    href

=

cont

['href']

    city_hrefs.

append(

href

)

#  take a look at to yell that every individual went neatlyfor i in city_hrefs[:2]:



print(

i

)

The output is a list of URLs of Family Greenback stores in Idaho to spot.

That acknowledged, I calm contain no longer non-public take care of data! Now, every metropolis URL desires to be scraped to get this data. So we restart the technique, the usage of a single, handbook example.

page2 = requests.get(city_hrefs[2]) # again attach a handbook example

soup2 = BeautifulSoup(page2.textual direct material, 'html.parser')

The take care of data is nested within form= “software/ld+json”. After doing different geolocation scraping, I’ve discontinuance to acknowledge this as a general construction for storing take care of data. Fortunately, soup.find_all() also permits browsing on form.

arco = soup2.find_all(form="software/ld+json")

print(arco[1])

The take care of data is within the 2nd list member! Sooner or later!

I extracted the contents (from the 2nd list merchandise) the usage of .contents (right here is a magnificent default movement after filtering the soup). Again, as a result of output of contents is a list of 1, I listed that list merchandise:

arco_contents = arco[1].contents[0]

arco_contents

Wow, having a survey magnificent. The format provided right here is per the JSON format (also, the form did non-public “json” in its establish). A JSON object can act like a dictionary with nested dictionaries inner. Or no longer it is undoubtedly a pleasant format to work with whereas you become conscious of it (and it be indubitably a lot more uncomplicated to program than a prolonged sequence of RegEx instructions). Even supposing this structurally looks to be like a JSON object, it is some distance calm a bs4 object and wants a formal programmatic conversion to JSON to be accessed as a JSON object:

arco_json =  json.loads(arco_contents)

form(arco_json)

print(arco_json)

In that direct material is a key known as take care of that has the desired take care of data within the smaller nested dictionary. This can be retrieved thusly:

arco_address = arco_json['take care of']

arco_address

Okay, we’re extreme this time. Now I’m in a position to iterate over the list retailer URLs in Idaho:

locs_dict

= [] # initialise empty list

for link in city_hrefs:

  locpage

=

requests.

get(

link

)# request page data

  locsoup

=

BeautifulSoup

(

locpage.

textual direct material, 'html.parser')

# parse the page's direct material

  locinfo

=

locsoup.

find_all(form="software/ld+json")

# extract explicit ingredient

  loccont

=

locinfo

[1]

.

contents[0]

# get contents from the bs4 ingredient jam

  locjson

=

json.

loads(

loccont

)# convert to json

  locaddr

=

locjson

['take care of'] # get take care of

  locs_dict.

append(

locaddr

) # add take care of to list

Cleansing our web scraping outcomes with pandas

We non-public a great deal of data in a dictionary, but now we non-public got some further crud that will make reusing our data extra advanced than it desires to be. To contain some final data group steps, we convert to a pandas data frame, tumble the unneeded columns “@form” and “nation“), and take a look at the head five rows to yell that the whole lot looks to be alright.

locs_df = df.from_records(locs_dict)

locs_df.tumble(['@form', 'addressCountry'], axis = 1, inplace = Appropriate)

locs_df.head(n = 5)

Have sure to establish outcomes!!

df.to_csv(locs_df, "family_dollar_ID_locations.csv", sep = ",", index = False)

We did it! There is a comma-separated list of the total Idaho Family Greenback stores. What a wild traipse.

A pair of phrases on Selenium and data scraping

Selenium is a general utility for automatic interaction with a webpage. To present why it be crucial to make expend of at events, let’s struggle through an example the usage of Walgreens’ online page. Peek Ingredient presents the code for what’s displayed in a browser:

While Test Page Offer presents the code for what requests will construct:

When these two contain no longer agree, there are plugins bettering the provide code—so, it could perhaps calm be accessed after the page has loaded in a browser. requests can no longer contain that, but Selenium can.

Selenium requires an online driver to retrieve the direct material. It undoubtedly opens an online browser, and this page direct material continues to be. Selenium is highly effective—it’ll non-public interaction with loaded direct material in plenty of how (be taught the documentation). After getting data with Selenium, proceed to make expend of BeautifulSoup as sooner than:

url = "https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator&state=ID"

driver = webdriver.Firefox(executable_path = 'mypath/geckodriver.exe')

driver.get(url)

soup_ID = BeautifulSoup(driver.page_source, 'html.parser')

store_link_soup = soup_ID.find_all(class_ = 'col-xl-4 col-lg-4 col-md-4')

I didn’t need Selenium within the case of Family Greenback, but I contain desire it on hand for those events when rendered direct material differs from provide code.

Wrapping up

In conclusion, when the usage of web scraping to whole a meaningful job:

  • Be patient
  • Seek the advice of the manuals (these are very precious)

While you happen to could perhaps very neatly be abnormal in regards to the answer:

There are a whole bunch many Family Greenback stores in The US.

Your total provide code is:

import

requests



from

bs4

import

BeautifulSoup



import

json



from

pandas

import

DataFrame

as

df

page = requests.get("https://www.familydollar.com/areas/")

soup

=

BeautifulSoup

(

page.

textual direct material, 'html.parser')

# to find all state links state_list = soup.find_all(class_ = 'itemlist')

state_links = []

for i in state_list:

    cont

=

i.

contents[0]

    attr

=

cont.

attrs

    hrefs

=

attr

['href']

    state_links.

append(

hrefs

)

# to find all metropolis links city_links = []

for link in state_links:

    page

=

requests.

get(

link

)

    soup

=

BeautifulSoup

(

page.

textual direct material, 'html.parser')

    familydollar_list

=

soup.

find_all(

class_

= 'itemlist')

for

retailer

in

familydollar_list:

        cont

=

retailer.

contents[0]

        attr

=

cont.

attrs

        city_hrefs

=

attr

['href']

        city_links.

append(

city_hrefs

)

# to get particular individual retailer links store_links = []

for link in city_links:

    locpage

=

requests.

get(

link

)

    locsoup

=

BeautifulSoup

(

locpage.

textual direct material, 'html.parser')

    locinfo

=

locsoup.

find_all(form="software/ld+json")

for

i

in

locinfo:

        loccont

=

i.

contents[0]

        locjson

=

json.

loads(

loccont

)

are attempting

:

            store_url

=

locjson

['url']

            store_links.

append(

store_url

)

other than

:



hunch

# get take care of and geolocation data stores = []

for retailer in store_links:

    storepage

=

requests.

get(

retailer

)

    storesoup

=

BeautifulSoup

(

storepage.

textual direct material, 'html.parser')

    storeinfo

=

storesoup.

find_all(form="software/ld+json")

for

i

in

storeinfo:

        storecont

=

i.

contents[0]

        storejson

=

json.

loads(

storecont

)

are attempting

:

            store_addr

=

storejson

['take care of']

            store_addr.

update(

storejson

['geo'])

            stores.

append(

store_addr

)

other than

:



hunch

# final data parsing stores_df = df.from_records(stores)

stores_df.

tumble(['@form', 'addressCountry'],

axis

= 1,

inplace

= Appropriate)

stores_df

['Store'] = "Family Greenback"

df.to_csv(stores_df, "family_dollar_locations.csv", sep = ",", index = False)



Author’s present: This text is an adaptation of a talk I gave at PyCascades in Portland, Oregon on February 9, 2020.

Read More

Asim Written by:

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *