HomePage > Posts > Python Html Parsing

python bs4 lxml

Python Html Parsing

Python Parsing libs. bs4, lxml, pyquery, html.parser

Task list: :smile:

初稿
再讀
筆記
完成

概述

Python 中用來解析(Parsing) html, xml , XHTML 的工具其他：XPATH

常見的有：

bs4 (BeautifulSoup)
lxml
pyquery
python 中內建的 html.parser

BeautifulSoup

Installing Beautiful Soup

$ pip install beautifulsoup4

Installing a parser

html.parser
- 內建，速度中，容錯強，中文容錯力差
html5lib
- 速度慢，容錯最好，生成 html5 文件, 以 browser 方式解析文件
lxml
- 速度快，容錯強，需要 c library
lxml-xml (xml)
- 速度快，唯一支持 xml，需要 c library

parser 用法：(BeautifulSoup(markup, "htmlparser"))

Kinds of objects

bs4 中有4 種物件：

Tag : 標籤
- tag.name - str
- tag.attrs - dict
- tag.get -
NavigablesString : Tag 內的文字?
BeautifulSoup : 解析過的文件
Comment : XHTML 中的註解

bs4 中 class 用 class_ ，避免跟原有的關鍵字衝突

找資料的方法

Nagigation the tree: 用 CSS selector，soup.select("p.strikeout.body")
Searching the tree: find, find_all

Searching the Tree

Searching by CSS class

Search

by id : results = soup.find(id=‘ResultsContainer’)
by class: job_elems = results.find_all(‘section’, class_=‘card-content’)
Extract Text From HTML Elements: title_elem.text
Find Elements by Class Name and Text Content: python_jobs = results.find_all(‘h2’, string=‘Python Developer’)
Pass a Function to a Beautiful Soup Method: python_jobs = results.find_all(‘h2’, string=lambda text: ‘python’ in text.lower())
Extract Attributes From HTML Elements: link = p_job.find(‘a’)[‘href’]

參考連結

Library

layouts/partials/custom-footer.html Here