一、BeautifulSoup4的安裝
方法一:cmd->easy_install BeautifulSoup
方法二:從http://www./software/BeautifulSoup/bs4/download/
下載->cmd->進入下載的文件目錄->python
setuyp.py install
二、 BeautifulSoup4的使用
1、導入
from bs4 import BeautifulSoup
注意:要是BeautifulSoup的版本為3.x,則導入方式為:from BeautifulSoup import
BeautifulSoup
2、example
html文件:
html_doc = """
The Dormouse's story
Once upon a time there were
three little sisters; and their names were Elsie,
Lacie and Tillie; and they lived at the bottom of a
well.
...
"""
代碼:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
接下來可以開始使用各種功能
soup.X
(X為任意標簽,返回整個標簽,包括標簽的屬性,內(nèi)容等)
如:soup.title
#
soup.p
#
The Dormouse's story
soup.a
(注:僅僅返回第一個結(jié)果)
# Elsie
soup.find_all('a') (find_all 可以返回所有)
# [Elsie,
# Lacie,
# Tillie]
find還可以按屬性查找
soup.find(id="link3")
# Tillie
要取某個標簽的某個屬性,可用函數(shù)有 find_all,get
for link in
soup.find_all('a'):
print(link.get('href'))
#
http:///elsie
#
http:///lacie
#
http:///tillie
要取html文件中的所有文本,可使用get_text()
print(soup.get_text())
# The
Dormouse's story
# The
Dormouse's story
# Once upon
a time there were three little sisters; and their names were
#
Elsie,
# Lacie
and
#
Tillie;
# and they
lived at the bottom of a well.
# ...
如果是打開html文件,語句可用:
soup =
BeautifulSoup(open("index.html"))
BeautifulSoup中的Object
tag (對應html中的標簽)
tag.attrs
(以字典形式返回tag的所有屬性)
可以直接對tag的屬性進行增、刪、改,跟操作字典一樣
tag['class']
= 'verybold'
tag['id'] =
1
tag
#
<blockquote class="verybold"
id="1">Extremely
bold</blockquote>
del
tag['class']
del
tag['id']
tag
#
<blockquote>Extremely
bold</blockquote>
tag['class']
# KeyError:
'class'
print(tag.get('class'))
# None
X.contents
(X為標簽,可返回標簽的內(nèi)容)
eg.
head_tag =
soup.head
head_tag
#
<head><title>The
Dormouse's
story</title></head>
head_tag.contents
[<title>The Dormouse's
story</title>]
title_tag =
head_tag.contents[0]
title_tag
#
<title>The Dormouse's
story</title>
title_tag.contents
# [u'The
Dormouse's story']
解決解析網(wǎng)頁出現(xiàn)亂碼問題:
import
urllib2
2
from
BeautifulSoup import BeautifulSoup
3
4
page =
urllib2.urlopen('http://www.');
5
soup =
BeautifulSoup(page,fromEncoding="gb18030")
6
7
|