When you add unicode, it can become more complicated (or rather, cumbersome)...

jblow · on June 25, 2016

Well, you just need to call unicode_next_character all the time instead of saying s++, similarly for whitespace, similarly for asking whether a character can initiate or continue an identifier, etc. It does not change the basic nature of the task at all.

programLyrique · on June 26, 2016

Sure, but if you are using a language without support for unicode, and you don't use a dedicated library (which would be already using a kind of lexer, wouldn't it?), you have also to parse these unicode characters yourself.

yoklov · on June 26, 2016

A unicode_next_character function is very simple to write regardless of unicode support in your language.

I usually write it as a small (256 byte) lookup table where each entry tells you how many characters to skip next. If you don't use a lookup table, its 4 single-line `if` statements (and if you do it's a one liner, plus however many lines the table takes).

jblow · on June 27, 2016

There's free source code for basic Unicode operations all over the internet.

nulltype · on June 26, 2016

Here's a video of Rob Pike, developer of Go, talking about a lexer he built for the Go text/template package:

https://www.youtube.com/watch?v=HxaD_trXwRE

It supports unicode.