Python 编程/Web

beautifulsoup4
屏幕抓取库
PyPi 链接	https://pypi.python.org/pypi/beautifulsoup4
Pip 命令	pip install beautifulsoup4
导入命令	import bs4

requests
为人类设计的 Python HTTP
PyPi 链接	https://pypi.python.org/pypi/requests
Pip 命令	pip install requests

Python 网页请求/解析非常简单，并且有一些必备的模块可以帮助完成此操作。

Urllib

Urllib 是 Python 内置的用于 HTML 请求的模块，主要文章是 Python 编程/互联网.

try:
    import urllib2
except (ModuleNotFoundError, ImportError): #ModuleNotFoundError is 3.6+
    import urllib.parse as urllib2
    
url = 'https://www.google.com'
u = urllib2.urlopen(url)
content = u.read() #content now has all of the html in google.com

Requests

Python requests 库简化了 HTTP 请求。它包含每个 HTTP 请求的功能

GET (requests.get)
POST (requests.post)
HEAD (requests.head)
PUT (requests.put)
DELETE (requests.delete)
OPTIONS (requests.options)

基本请求

import requests

url = 'https://www.google.com'
r = requests.get(url)

响应对象

上一个函数的响应包含许多变量/数据检索。

>>> import requests
>>> r = requests.get('https://www.google.com')
>>> print(r)
<Response [200]>
>>> dir(r) # dir shows all variables, functions, basically anything you can do with var.n where n is something to do
['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']

r.content 和 r.text 提供类似的 HTML 内容，但 r.text 更受欢迎。
r.encoding 将显示网站的编码。
r.headers 显示网站返回的头部信息。
r.is_redirect 和 r.is_permanent_redirect 显示原始链接是否重定向。
r.iter_content 将以字节的形式迭代 HTML 中的每个字符。要将字节转换为字符串，必须使用 r.encoding 中的编码进行解码。
r.iter_lines 类似于 r.iter_content，但它将迭代 HTML 中的每一行。它也是以字节形式。
r.json 将在返回输出为 JSON 时将 JSON 转换为 Python 字典。
r.raw 将返回基本 urllib3.response.HTTPResponse 对象。
r.status_code 将返回服务器发送的 HTML 代码。代码 200 表示成功，而任何其他代码表示错误。r.raise_for_status 如果状态代码不是 200，则将返回异常。
r.url 将返回发送的 URL。

身份验证

Requests 内置了身份验证。这是一个使用基本身份验证的示例。

import requests

r = requests.get('http://example.com', auth = requests.auth.HTTPBasicAuth('username', 'password'))

如果是基本身份验证，你只需要传递一个元组。

import requests

r = requests.get('http://example.com', auth = ('username', 'password'))

所有其他类型的身份验证都在 requests 文档中。

查询

HTML 中的查询传递值。例如，当你进行谷歌搜索时，搜索 URL 类似于 https://www.google.com/search?q=My+Search+Here&...。问号后面的所有内容都是查询。查询是 url?name1=value1&name2=value2...。Requests 有一个系统可以自动进行这些查询。

>>> import requests
>>> query = {'q':'test'}
>>> r = requests.get('https://www.google.com/search', params = query)
>>> print(r.url) #prints the final url
https://www.google.com/search?q=test

真正的强大之处体现在多个条目中。

>>> import requests
>>> query = {'name':'test', 'fakeparam': 'yes', 'anotherfakeparam': 'yes again'}
>>> r = requests.get('http://example.com', params = query)
>>> print(r.url) #prints the final url
http://example.com/?name=test&fakeparam=yes&anotherfakeparam=yes+again

它不仅传递这些值，还将特殊字符 & 空格更改为 HTML 兼容版本。

BeautifulSoup4

BeautifulSoup4 是一个强大的 HTML 解析命令。让我们尝试一些示例 HTML。

>>> import bs4
>>> example_html = """<!DOCTYPE html>
... <html>
... <head>
... <title>Testing website</title>
... <style>.b{color: blue;}</style>
... </head>
... <body>
... <h1 class='b', id = 'hhh'>A Blue Header</h1>
... <p> I like blue text, I like blue text... </p>
... <p class = 'b'> This text is blue, yay yay yay!</p>
... <p class = 'b'>Check out the <a href = '#hhh'>Blue Header</a></p>
... </body>
... </html>
... """
>>> bs = bs4.BeautifulSoup(example_html)
>>> print(bs)
<!DOCTYPE html>
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.prettify()) #adds in newlines
<!DOCTYPE html>
<html>
 <head>
  <title>
   Testing website
  </title>
  <style>
   .b{color: blue;}
  </style>
 </head>
 <body>
  <h1 class="b" id="hhh">
   A Blue Header
  </h1>
  <p>
   I like blue text, I like blue text...
  </p>
  <p class="b">
   This text is blue, yay yay yay!
  </p>
  <p class="b">
   Check out the
   <a href="#hhh">
    Blue Header
   </a>
  </p>
 </body>
</html>

获取元素

有两种方法可以访问元素。第一种方法是手动输入标签，按照顺序向下遍历，直到到达你想要的标签。

>>> print(bs.html)
<html><head><title>Testing website</title><style>.b{color: blue;}</style></head><body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body></html>
>>> print(bs.html.body)
<body><h1 class="b" id="hhh">A Blue Header</h1><p> I like blue text, I like blue text... </p><p class="b"> This text is blue, yay yay yay!</p><p class="b">Check out the <a href="#hhh">Blue Header</a></p></body>
>>> print(bs.html.body.h1)

但是，对于大型 HTML 来说，这很不方便。有一个函数 find_all，可以查找特定元素的所有实例。它接收一个 HTML 标签，如 h1 或 p，并返回其所有实例。

>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]

这在大型网站中仍然不方便，因为会有数千个条目。可以通过查找类或 ID 来简化它。

>>> blue = bs.find_all('p', _class = 'b')
>>> blue
[]

但是，它没有返回任何结果。因此，我们可能需要使用自己的查找系统。

>>> p = bs.find_all('p')
>>> p
[<p> I like blue text, I like blue text... </p>, <p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]
>>> blue = [p for p in p if 'class' in p.__dict__['attrs'] and 'b' in p.__dict__['attrs']['class']]
>>> blue
[<p class="b"> This text is blue, yay yay yay!</p>, <p class="b">Check out the <a href="#hhh">Blue Header</a></p>]

这将检查每个元素中是否存在任何类，然后检查是否存在类，以及是否存在类 b。从列表中，我们可以对每个元素执行某些操作，例如检索其内部的文本。

>>> b = blue[0].text
>>> print(bb)
 This text is blue, yay yay yay!