On the off chance you were not aware, bs4 also supports[0] getting parse events from html5lib[1] which (as its name implies) is far more likely to parse the text the same way the browser would
Looks like lxml w/ xpath is still the fastest with Python 3.10.4 from "Pyquery, lxml, BeautifulSoup comparison" https://gist.github.com/MercuryRising/4061368 ; which is fine for parsing (X)HTML(5) that validates<
(EDIT: Is xml/html5 a good format for data serialization?
defusedxml ... Simdjson, Apache arrow.js)
I was curious, so I tried that performance test you linked to on my machine with the various parsers:
==== Total trials: 100000 =====
bs4 lxml total time: 110.9
bs4 html.parser total time: 87.6
bs4 lxml-xml total time: 0.5
bs4 xml total time: 0.5
bs4 html5lib total time: 103.6
pq total time: 8.7
lxml (cssselect) total time: 8.8
lxml (xpath) total time: 5.6
regex total time: 13.8 (doesn't find all p)
Same here, I am unable to properly quantify it but there was something about the soup api I did not really like.
It may have been because I learned on the python xml.etree library in base(I moved to lxml because it has the same api but is faster and knows about parent nodes) and had a hard time with the soup api.
But I think it was the way it overloaded the selectors. I did not like the way you could magically find elements. I may have to revisit it and try and figure out why and if I still do not like it.
https://lxml.de/lxmlhtml.html