Python Html Parsing
Python Parsing libs. bs4, lxml, pyquery, html.parser
Task list: :smile:
概述
Python 中用來解析(Parsing) html, xml , XHTML 的工具
其他:XPATH
常見的有:
- bs4 (BeautifulSoup)
- lxml
- pyquery
- python 中內建的 html.parser
BeautifulSoup
Installing Beautiful Soup
$ pip install beautifulsoup4
Installing a parser
- html.parser
- html5lib
- 速度慢,容錯最好,生成 html5 文件, 以 browser 方式解析文件
- lxml
- lxml-xml (xml)
- 速度快,唯一支持 xml,需要 c library
parser 用法:(BeautifulSoup(markup, "htmlparser"))
Kinds of objects
bs4 中有4 種物件:
Tag
: 標籤
- tag.name - str
- tag.attrs - dict
- tag.get -
NavigablesString
: Tag
內的 文字?
BeautifulSoup
: 解析過的文件
Comment
: XHTML 中的註解 <!--
我是註解 -->
bs4 中 class
用 class_
,避免跟原有的關鍵字衝突
找資料的方法
- Nagigation the tree: 用 CSS selector,
soup.select("p.strikeout.body")
- Searching the tree:
find
, find_all
Searching the Tree
Searching by CSS class
Search
- by id : results = soup.find(id=‘ResultsContainer’)
- by class: job_elems = results.find_all(‘section’, class_=‘card-content’)
- Extract Text From HTML Elements: title_elem.text
- Find Elements by Class Name and Text Content: python_jobs = results.find_all(‘h2’, string=‘Python Developer’)
- Pass a Function to a Beautiful Soup Method: python_jobs = results.find_all(‘h2’,
string=lambda text: ‘python’ in text.lower())
- Extract Attributes From HTML Elements: link = p_job.find(‘a’)[‘href’]
參考連結
- 用Python爬取 Youtube 資訊 - 圖文課程
- RealPython: Beautiful Soup: Build a Web Scraper With Python
Library
- bs4.doc
- lxml
- pyquery
- html.parser.doc
- jusText