March, 18, 2024 archives
Time to modernize PHP’s syntax highlighting?
This blog post about “A syntax highlighter that doesn't suck” was timely because recently I had been kicking at the code for the syntax highlighter that I use on this blog. It’s a very old JavaScript package called SHJS based on GNU Source-highlight.
I created a Git repository where I imported all of the released versions of SHJS and then tried to update the included language files to the ones from the latest GNU Source-highlight release (which was four years ago), but ran into some trouble. There are some new features to the syntax files that the old Perl code in the SHJS package can’t handle. And as you might imagine, the pile of code involved is really, really old.
That new PHP package seems like a great idea and all, but I really like the idea of leveraging work that other people have done to create syntax highlighting for other languages rather than inventing another one.
On Mastodon, Ben Ramsey brought up a start he had made at trying to port Pygments, a Python syntax highlighter, to PHP.
I ran across Chroma, which is a Go package that is built on top of the Pygments language definitions. They’ve converted the Pygments language definitions into an XML format. Those don’t completely handle 100% of the languages, but it covers most of them.
At the end of the day, both GNU Source-highlight and Pygments and variants are built on what are likely to remain imprecise parsers because they are mostly regex-based and just not the same lexing and parsing code actually being used to handle these languages.
PHP has long had it’s own built-in syntax highlighting functions (highlight_string()
and highlight_file()
) but it looks like the generation code hasn’t been updated in a meaningful way in about 25 years. It just has five colors that can be configured that it uses for <span style="color: #...;">
tags. There are many tokens that it simply outputs using the same color where it could make more distinctions. If it were to instead (or also) use CSS classes to mark every token with the exact type, you could do much finer-grained syntax highlighting.
Looks like an area ready for some experimentation.