Python web requests/parsing is very simple, and there are several must-have modules to help with this.

Urllib edit

Urllib is the built in python module for html requests, main article is Python Programming/Internet.

try:
    import urllib2
except (ModuleNotFoundError, ImportError): #ModuleNotFoundError is 3.6+
    import urllib.parse as urllib2
    
url = 'https://www.google.com'
u = urllib2.urlopen(url)
content = u.read() #content now has all of the html in google.com

Requests edit

requests
Python HTTP for Humans
PyPi Linkhttps://pypi.python.org/pypi/requests
Pip commandpip install requests

The python requests library simplifies http requests. It has functions for each of the http requests

  • GET (requests.get)
  • POST (requests.post)
  • HEAD (requests.head)
  • PUT (requests.put)
  • DELETE (requests.delete)
  • OPTIONS (requests.options)

Basic request edit

import requests

url = 'https://www.google.com'
r = requests.get(url)

The response object edit

The response from the last function has many variables/data retrieval.

>>> import requests
>>> r = requests.get('https://www.google.com')
>>> print(r)
<Response [200]>
>>> dir(r) # dir shows all variables, functions, basically anything you can do with var.n where n is something to do
['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
  • r.content and r.text provide similar html content, but r.text is preferred.
  • r.encoding will display the encoding of the website.
  • r.headers shows the headers returned by the website.
  • r.is_redirect and r.is_permanent_redirect shows whether or not the original link was a redirect.
  • r.iter_content will iterate each character in the html as a byte. To convert bytes to string, it must be decoded with the encoding in r.encoding.
  • r.iter_lines is like r.iter_content, but will iterate each line of the html. It is also in bytes
  • r.json will convert json to a python dict if the return output is json.
  • r.rawwill return the base urllib3.response.HTTPResponse object.
  • r.status_code will return the html code sent by the server. Code 200 is success, while any other code is an error. r.raise_for_status will return an exception if the status code is not 200.
  • r.url will return the url sent.

Authentication edit

Requests has built-in authentication. Here is an example with basic authentication.

import requests

r = requests.get('http://example.com', auth = requests.auth.HTTPBasicAuth('username', 'password'))

If it is Basic Authentication, you can just pass a tuple.

import requests

r = requests.get('http://example.com', auth = ('username', 'password'))

All of the other types of authentication are at the requests documentation.

Queries edit

Queries in html pass values. For example, when you make a google search, the search url is a form of https://www.google.com/search?q=My+Search+Here&.... Anything after the ? is the query. Queries are url?name1=value1&name2=value2.... Requests has a system for automatically making these queries.

>>> import requests
>>> query = {'q':'test'}
>>> r = requests.get('https://www.google.com/search', params = query)
>>> print(r.url) #prints the final url
https://www.google.com/search?q=test

The true power is noticed in multiple entries.

>>> import requests
>>> query = {'name':'test', 'fakeparam': 'yes', 'anotherfakeparam': 'yes again'}
>>> r = requests.get('http://example.com', params = query)
>>> print(r.url) #prints the final url
http://example.com/?name=test&fakeparam=yes&anotherfakeparam=yes+again

Not only does it pass these values but also changes special characters & whitespace to html-compatible versions.

BeautifulSoup4 edit

beautifulsoup4
Screen-scraping library
PyPi Linkhttps://pypi.python.org/pypi/beautifulsoup4
Pip commandpip install beautifulsoup4
Import commandimport bs4

BeautifulSoup4 is a powerful html parsing command. Let's try with some example html.

>>> import bs4
>>> example_html = """<!DOCTYPE html>
... <html>
... <head>
... <title>Testing website</title>
... <style>.b{color: blue;}</style>
... </head>
... <body>
... <h1 class='b', id = 'hhh'>A Blue Header</h1>
... <p> I like blue text, I like blue text... </p>
... <p class = 'b'> This text is blue, yay yay yay!</p>
... <p class = 'b'>Check out the <a href = '#hhh'>Blue Header</a></p>
... </body>
... </html>
... """
>>> bs = bs4.BeautifulSoup(example_html)
>>> print(bs)
<!DOCTYPE html>
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.prettify()) #adds in newlines
<!DOCTYPE html>
<html>
 <head>
  <title>
   Testing website
  </title>
  <style>
   .b{color: blue;}
  </style>
 </head>
 <body>
  <h1 class="b" id="hhh">
   A Blue Header
  </h1>
  <p>
   I like blue text, I like blue text...
  </p>
  <p class="b">
   This text is blue, yay yay yay!
  </p>
  <p class="b">
   Check out the
   <a href="#hhh">
    Blue Header
   </a>
  </p>
 </body>
</html>

Getting elements edit

There are two ways to access elements. The first way is to manually type in the tags, going down in order, until you get to the tag you want.

>>> print(bs.html)
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.html.body)
<body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body>
>>> print(bs.html.body.h1)

However, this is inconvenient with large html. There is a function, find_all, to find all instances of a certain element. It takes in a html tag, such as h1 or p, and returns all instances of it.

>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]

This is still inconvenient in a large website because there will be thousands of entries. You can simplify it by finding classes or ids.

>>> blue = bs.find_all('p', _class = 'b')
>>> blue
[]

However, it does not bring up any results. Therefore, we might want to use our own finding system.

>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]
>>> blue = [p for p in p if 'class' in p.__dict__['attrs'] and 'b' in p.__dict__['attrs']['class']]
>>> blue
[<p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]

This checks to see if there are any classes in each of the elements and then checks to see if the class b is in the classes if there are classes. From the list, we can do something to each element, such as retrieve the text inside.

>>> b = blue[0].text
>>> print(bb)
 This text is blue, yay yay yay!