《Python 正则表达式之实战》

爬取 HTML 文档的字符编码

网页中通常是将字符编码信息放在 <head> 头部信息的 <meta> 标签中。所以我们首先利用爬虫，爬取到网站的所有 html 内容：

import requests
from bs4 import BeautifulSoup

url = 'http://www.baidu.com/'

html = requests.get(url)

利用正则表达式，得到 <head> 标签中的所有内容：

head = re.findall(r'<head>.*</head>', html.text)

注意，在 HTML 文档中，规定字符编码有两种格式，如下所示：

htmlstr1 = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />'
htmlstr2 = '<meta charset="UTF-8">'

要得到统一的规则是，字符编码全在'charset='后面，但是 htmlstr1 中比 htmlstr2 少了一个双引号 " ，所以我们得想办法忽略掉这个 " 引号的影响。

如果我们单纯的用或 | ，比如 ("|) ，双引号或者空格来匹配，是匹配不了的。这时候必须要用到正则表达式里一个无捕获组 '(?:)'。

如：[?:dog|cat] 它的有效范围是 ?: 后面 |两边的整条规则。所以这里如果是 (?:"|)（等价于(?:|")），相当于匹配双引号 " 和一个空白字符，即相当于忽略掉双引号 " 的影响。

整体代码如下：

import requests
from bs4 import BeautifulSoup
import re

url = 'http://www.baidu.com/'

html = requests.get(url)
soup = BeautifulSoup(html.text,'lxml')
head = re.findall(r'<head>.*</head>', html.text)
# 注意每次 re.findall() 返回的是一个列表
charset = re.findall(r'charset=(?:|")([\w-]+)', head[0])
print(charset[0])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python-re实例.md

python-re实例.md

《Python 正则表达式之实战》

爬取 HTML 文档的字符编码

第二个

Files

python-re实例.md

Latest commit

History

python-re实例.md

File metadata and controls

《Python 正则表达式 之 实战》

爬取 HTML 文档的字符编码

第二个

《Python 正则表达式之实战》