Python Programming/Web
Python web requests/parsing is very simple, and there are several must-have modules to help with this.
Urllib
editUrllib is the built in python module for html requests, main article is Python Programming/Internet.
try:
import urllib2
except (ModuleNotFoundError, ImportError): #ModuleNotFoundError is 3.6+
import urllib.parse as urllib2
url = 'https://www.google.com'
u = urllib2.urlopen(url)
content = u.read() #content now has all of the html in google.com
Requests
editPython HTTP for Humans | |
PyPi Link | https://pypi.python.org/pypi/requests |
---|---|
Pip command | pip install requests |
The python requests library simplifies http requests. It has functions for each of the http requests
- GET (requests.get)
- POST (requests.post)
- HEAD (requests.head)
- PUT (requests.put)
- DELETE (requests.delete)
- OPTIONS (requests.options)
Basic request
editimport requests
url = 'https://www.google.com'
r = requests.get(url)
The response object
editThe response from the last function has many variables/data retrieval.
>>> import requests
>>> r = requests.get('https://www.google.com')
>>> print(r)
<Response [200]>
>>> dir(r) # dir shows all variables, functions, basically anything you can do with var.n where n is something to do
['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
r.content
andr.text
provide similar html content, butr.text
is preferred.r.encoding
will display the encoding of the website.r.headers
shows the headers returned by the website.r.is_redirect
andr.is_permanent_redirect
shows whether or not the original link was a redirect.r.iter_content
will iterate each character in the html as a byte. To convert bytes to string, it must be decoded with the encoding inr.encoding
.r.iter_lines
is liker.iter_content
, but will iterate each line of the html. It is also in bytesr.json
will convert json to a python dict if the return output is json.r.raw
will return the baseurllib3.response.HTTPResponse
object.r.status_code
will return the html code sent by the server. Code 200 is success, while any other code is an error.r.raise_for_status
will return an exception if the status code is not 200.r.url
will return the url sent.
Authentication
editRequests has built-in authentication. Here is an example with basic authentication.
import requests
r = requests.get('http://example.com', auth = requests.auth.HTTPBasicAuth('username', 'password'))
If it is Basic Authentication, you can just pass a tuple.
import requests
r = requests.get('http://example.com', auth = ('username', 'password'))
All of the other types of authentication are at the requests documentation.
Queries
edit
Queries in html pass values. For example, when you make a google search, the search url is a form of https://www.google.com/search?q=My+Search+Here&...
. Anything after theĀ ? is the query. Queries are url?name1=value1&name2=value2...
. Requests has a system for automatically making these queries.
>>> import requests
>>> query = {'q':'test'}
>>> r = requests.get('https://www.google.com/search', params = query)
>>> print(r.url) #prints the final url
https://www.google.com/search?q=test
The true power is noticed in multiple entries.
>>> import requests
>>> query = {'name':'test', 'fakeparam': 'yes', 'anotherfakeparam': 'yes again'}
>>> r = requests.get('http://example.com', params = query)
>>> print(r.url) #prints the final url
http://example.com/?name=test&fakeparam=yes&anotherfakeparam=yes+again
Not only does it pass these values but also changes special characters & whitespace to html-compatible versions.
BeautifulSoup4
editScreen-scraping library | |
PyPi Link | https://pypi.python.org/pypi/beautifulsoup4 |
---|---|
Pip command | pip install beautifulsoup4 |
Import command | import bs4 |
BeautifulSoup4 is a powerful html parsing command. Let's try with some example html.
>>> import bs4
>>> example_html = """<!DOCTYPE html>
... <html>
... <head>
... <title>Testing website</title>
... <style>.b{color: blue;}</style>
... </head>
... <body>
... <h1 class='b', id = 'hhh'>A Blue Header</h1>
... <p> I like blue text, I like blue text... </p>
... <p class = 'b'> This text is blue, yay yay yay!</p>
... <p class = 'b'>Check out the <a href = '#hhh'>Blue Header</a></p>
... </body>
... </html>
... """
>>> bs = bs4.BeautifulSoup(example_html)
>>> print(bs)
<!DOCTYPE html>
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.prettify()) #adds in newlines
<!DOCTYPE html>
<html>
<head>
<title>
Testing website
</title>
<style>
.b{color: blue;}
</style>
</head>
<body>
<h1 class="b" id="hhh">
A Blue Header
</h1>
<p>
I like blue text, I like blue text...
</p>
<p class="b">
This text is blue, yay yay yay!
</p>
<p class="b">
Check out the
<a href="#hhh">
Blue Header
</a>
</p>
</body>
</html>
Getting elements
editThere are two ways to access elements. The first way is to manually type in the tags, going down in order, until you get to the tag you want.
>>> print(bs.html)
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.html.body)
<body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body>
>>> print(bs.html.body.h1)
However, this is inconvenient with large html. There is a function, find_all, to find all instances of a certain element. It takes in a html tag, such as h1 or p, and returns all instances of it.
>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]
This is still inconvenient in a large website because there will be thousands of entries. You can simplify it by finding classes or ids.
>>> blue = bs.find_all('p', _class = 'b')
>>> blue
[]
However, it does not bring up any results. Therefore, we might want to use our own finding system.
>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]
>>> blue = [p for p in p if 'class' in p.__dict__['attrs'] and 'b' in p.__dict__['attrs']['class']]
>>> blue
[<p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]
This checks to see if there are any classes in each of the elements and then checks to see if the class b is in the classes if there are classes. From the list, we can do something to each element, such as retrieve the text inside.
>>> b = blue[0].text
>>> print(bb)
This text is blue, yay yay yay!