I found lxml.html a lot easier to work with than bs4, in case that helps anyone ...

mdaniel · on June 8, 2022

On the off chance you were not aware, bs4 also supports[0] getting parse events from html5lib[1] which (as its name implies) is far more likely to parse the text the same way the browser would

0: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index....

1: https://pypi.org/project/html5lib/

westurner · on June 8, 2022

BeautifulSoup is an API for multiple parsers https://beautiful-soup-4.readthedocs.io/en/latest/#installin... :

  BeautifulSoup(markup, "html.parser") 
  BeautifulSoup(markup, "lxml")
  BeautifulSoup(markup, "lxml-xml")
  BeautifulSoup(markup, "xml") 
  BeautifulSoup(markup, "html5lib")

Looks like lxml w/ xpath is still the fastest with Python 3.10.4 from "Pyquery, lxml, BeautifulSoup comparison" https://gist.github.com/MercuryRising/4061368 ; which is fine for parsing (X)HTML(5) that validates<

(EDIT: Is xml/html5 a good format for data serialization? defusedxml ... Simdjson, Apache arrow.js)

driscoll42 · on June 8, 2022

I was curious, so I tried that performance test you linked to on my machine with the various parsers:

    ==== Total trials: 100000 =====
    bs4 lxml total time: 110.9
    bs4 html.parser total time: 87.6
    bs4 lxml-xml total time: 0.5
    bs4 xml total time: 0.5
    bs4 html5lib total time: 103.6
    pq total time: 8.7
    lxml (cssselect) total time: 8.8
    lxml (xpath) total time: 5.6
    regex total time: 13.8 (doesn't find all p)

bs4 is damn fast with the lxml-xml or xml parsers

aumerle · on June 9, 2022

You want a proper html 5 parser that can handle non valid documents. And the fastest one is https://github.com/kovidgoyal/html5-parser over 30x faster than html5lib

somat · on June 8, 2022

Same here, I am unable to properly quantify it but there was something about the soup api I did not really like.

It may have been because I learned on the python xml.etree library in base(I moved to lxml because it has the same api but is faster and knows about parent nodes) and had a hard time with the soup api.

But I think it was the way it overloaded the selectors. I did not like the way you could magically find elements. I may have to revisit it and try and figure out why and if I still do not like it.