Python爬虫库BeautifulSoup怎么用-创新互联

这篇文章主要介绍Python爬虫库BeautifulSoup怎么用，文中介绍的非常详细，具有一定的参考价值，感兴趣的小伙伴们一定要看完！

创新互联公司坚持“要么做到，要么别承诺”的工作理念，服务领域包括：网站建设、成都网站设计、企业官网、英文网站、手机端网站、网站推广等服务，满足客户于互联网时代的鄂尔多斯网站设计、移动媒体设计的需求，帮助企业找到有效的互联网解决方案。努力成为您成熟可靠的网络建设合作伙伴！

一、介绍

BeautifulSoup库是灵活又方便的网页解析库，处理高效，支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。

Python常用解析库

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快、文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

二、快速开始

给定html文档，产生BeautifulSoup对象

from bs4 import BeautifulSoup
html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"""
soup = BeautifulSoup(html_doc,'lxml')

输出完整文本

print(soup.prettify())


 
 
  The Dormouse's story
 
 
 
 
  
  The Dormouse's story
  
 
 
  Once upon a time there were three little sisters; and their names were
  
  Elsie
  
  ,
  
  Lacie
  
  and
  
  Tillie
  
  ;
and they lived at the bottom of a well.
 
 
  ...

浏览结构化数据

print(soup.title) #标签及内容
print(soup.title.name) #<title>name属性
print(soup.title.string) #<title>内的字符串
print(soup.title.parent.name) #<title>的父标签name属性(head)
print(soup.p) # 第一个<p></p>
print(soup.p['class']) #第一个<p></p>的class
print(soup.a) # 第一个<a></a>
print(soup.find_all('a')) # 所有<a></a>
print(soup.find(id="link3")) # 所有id='link3'的标签</pre><pre><title>The Dormouse's story
title
The Dormouse's story
head
The Dormouse's story
['title']
Elsie
[Elsie, Lacie, Tillie]
Tillie

找出所有标签内的链接

for link in soup.find_all('a'):
  print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

获得所有文字内容

print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

自动补全标签并进行格式化

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器：lxml
print(soup.prettify())#格式化代码，自动补全
print(soup.title.string)#得到title标签里的内容

标签选择器

选择元素

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器：lxml
print(soup.title)#选择了title标签
print(type(soup.title))#查看类型
print(soup.head)

获取标签名称

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器：lxml
print(soup.title.name)

获取标签属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器：lxml
print(soup.p.attrs['name'])#获取p标签中，name这个属性的值
print(soup.p['name'])#另一种写法，比较直接

获取标签内容

print(soup.p.string)

标签嵌套选择

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器：lxml
print(soup.head.title.string)

子节点和子孙节点

html = """

  
    The Dormouse's story
  
  
    
      Once upon a time there were three little sisters; and their names were
      
        Elsie
      
      Lacie 
      and
      Tillie
      and they lived at the bottom of a well.
    
    ...
"""


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器：lxml
print(soup.p.contents)#获取指定标签的子节点，类型是list

另一个方法，child：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器：lxml
print(soup.p.children)#获取指定标签的子节点的迭代器对象
for i,children in enumerate(soup.p.children):#i接受索引，children接受内容
	print(i,children)

输出结果与上面的一样，多了一个索引。注意，只能用循环来迭代出子节点的信息。因为直接返回的只是一个迭代器对象。

获取子孙节点：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器：lxml
print(soup.p.descendants)#获取指定标签的子孙节点的迭代器对象
for i,child in enumerate(soup.p.descendants):#i接受索引，child接受内容
	print(i,child)

父节点和祖先节点

parent

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器：lxml
print(soup.a.parent)#获取指定标签的父节点

parents

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器：lxml
print(list(enumerate(soup.a.parents)))#获取指定标签的祖先节点

兄弟节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器：lxml
print(list(enumerate(soup.a.next_siblings)))#获取指定标签的后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))#获取指定标签的前面的兄弟节点

标准选择器

find_all( name , attrs , recursive , text , **kwargs )

可根据标签名、属性、内容查找文档。

name

html='''

  
    Hello

Foo Bar Jay Foo Bar

查找到的是同样的内容，因为这两个属性是在同一个标签里面的。

所以说这个text在做内容匹配的时候比较方便，但是在做内容查找的时候并不是太方便。

find用法和findall一模一样，但是返回的是找到的第一个符合条件的内容输出。

find_next_siblings()返回后面的所有兄弟节点，find_next_sibling()返回后面的第一个兄弟节点

find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点

find_all_next()返回节点后所有符合条件的节点，find_next()返回后面第一个符合条件的节点

find_all_previous()返回节点前所有符合条件的节点，find_previous()返回前面第一个符合条件的节点

以上是“Python爬虫库BeautifulSoup怎么用”这篇文章的所有内容，感谢各位的阅读！希望分享的内容对大家有帮助，更多相关知识，欢迎关注创新互联成都网站设计公司行业资讯频道！

另外有需要云服务器可以了解下创新互联scvps.cn，海内外云服务器15元起步，三天无理由+7*72小时售后在线，公司持有idc许可证，提供“云服务器、裸金属服务器、高防服务器、香港服务器、美国服务器、虚拟主机、免备案服务器”等云主机租用服务以及企业上云的综合解决方案，具有“安全稳定、简单易用、服务可用性高、性价比高”等特点与优势，专为企业上云打造定制，能够满足用户丰富、多元化的应用场景需求。

Python爬虫库BeautifulSoup怎么用-创新互联

Hello

Hello

Hello

Hello

其他资讯

网站制作

企业服务

网站建设

服务器托管