Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When you add unicode, it can become more complicated (or rather, cumbersome)...


Well, you just need to call unicode_next_character all the time instead of saying s++, similarly for whitespace, similarly for asking whether a character can initiate or continue an identifier, etc. It does not change the basic nature of the task at all.


Sure, but if you are using a language without support for unicode, and you don't use a dedicated library (which would be already using a kind of lexer, wouldn't it?), you have also to parse these unicode characters yourself.


A unicode_next_character function is very simple to write regardless of unicode support in your language.

I usually write it as a small (256 byte) lookup table where each entry tells you how many characters to skip next. If you don't use a lookup table, its 4 single-line `if` statements (and if you do it's a one liner, plus however many lines the table takes).


There's free source code for basic Unicode operations all over the internet.


Here's a video of Rob Pike, developer of Go, talking about a lexer he built for the Go text/template package:

https://www.youtube.com/watch?v=HxaD_trXwRE

It supports unicode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: