don't plan on it staying that way. I used to toss wads of my own forth-like language into LLMs to see what kinds of horrible failure modes the latest model would have in parsing and generating such code.
at first they were hilariously bad, then just bad, then kind of okay, and now anthropic's claude4opus reads and writes it just fine.
it varied. with the earlier models, generally more, trying to see if some apparition of mechanical understanding would eventually click into place.
IIRC, none of the gpt3 models did well with forth-like syntax. gpt4 generally did okay with it but could still get itself confused. claude4opus doesn't seem to have any trouble with it at all, and is happy to pick up the structures contextually, without explicit documentation of any sort.
another of my languages uses some parse transforming 'syntactic operators' that earlier models could never quite fully 'get', even with explanation. likely because at least one of them has no similar operator in popular languages. claude4opus, however, seems to infer them decently enough, and a single transform example is sufficient for it to generalize that understanding to the rest of the code it sees.
so far, claude has proved to be quite an impressive set of weights.
That is excellent, I am also using it to prototype designing languages and 3.7 and 4.0 models are really quite good for this. I haven't found substantial academic research in using LLMs for making prototype language compilers.
at first they were hilariously bad, then just bad, then kind of okay, and now anthropic's claude4opus reads and writes it just fine.