One thing I find somewhat amusing about this is that all of the generated code is against the PySpark API. And the PySpark API is itself an interop layer to the native Scala APIs for Spark.
So you have LLM-based English prompts as an interop layer to Python + PySpark, which is itself an interop layer onto the Spark core. Also, the generated Spark SQL strings inside the DataFrame API have their own little compiler into Spark operations.
When Databricks wrote PySpark, it was because many programmers knew Python but weren't willing to learn Scala just to use Spark. Now, they are offering a way for programmers to not bother learning the PySpark APIs, and leverage the interop layers all the way down, starting from English prompts.
This makes perfect sense when you zoom out and think about what their goal is -- to get your data workflows running on their cluster runtime. But it does make a programmer like me -- who developed a lot of systems while Spark was growing up -- wonder just how many layers future programmers will be forced to debug through when things go wrong. Debugging PySpark code is hard enough, even when you know Python, the PySpark APIs, and the underlying Spark core architecture well. But if all the PySpark code I had ever written had started from English prompts, it might make debugging those inevitable job crashes even more bewildering.
I haven't, in this description, mentioned the "usual" programming layers we have to contend with, like Python's interpreter, the JVM, underlying operating system, cloud APIs, and so on.
If I were to take a guess, programmers of the future are going to need more help debugging across programming language abstractions, system abstraction layers, and various code-data boundaries than they currently make do with.
As has been pointed out many times, this is similar to the steps that have led to interpreted languages like python, R, Julia, etc that make calls to C, or use JVM/LLVM, etc on up to assembly or machine code.
The leap made is certainly less defined than previous jumps, but there is some similarity in that the more specific a person writes the rules to define the program, the more potential there can be in making something powerful and efficient (if you know what you're doing).
The next big gain in capability (other than the onvious short term goal of making an LLM output a full working code base) may be in LLMs being able to choose better design, without it being specified (for example having 'search for the best algorithm', and 'make it idempotent', etc added automatically to each prompt), and to potentially write the program automatically in something like assembly (or Rust or C for better readability) directly instead of preferring python as these models tend to right now.
I don't think it's exactly the same because an LLM-based English=>Python translator is nowhere near as deterministic as compilers and assemblers. And English, being a language
whose tokens are subject to wide interpretation of meaning, may be a source of byzantine complexity. Then, of course, there is the "moving target" introduced by model upgrades and evolution in the public crawl dataset rewiring the neural network for the model's world knowledge.
There is a reason Python, as high level as it is, is still defined using an eBNF / PEG grammar[1] with only 35 or so keywords[2]. And there is a reason the Python bytecode interpreter is "just" a while loop on a minimal set of instructions[3]. All of this leads to a remarkable level of determinism, and determinism is your friend when trying to get code right. I haven't yet seen the equivalent in LLMs. I don't think it's an entirely intractable problem, but I'd be hesitant to leap straight into English language as a stable API today. I think code copilots are the right place to start. And maybe even copilots that help not just with code suggestions, but also with debug suggestions.
Things like Spark may be the main place where a lot of today's programmers really have to fight the "compiler" already compared to, say, writing Java for the JVM or writing plain Python, since these are being compiled down to parallel execution plans across distributed systems, which introduces fairly novel performance pitfalls compared to writing a linear Java or Python method that gets compiled down to basic native code.
Adding a layer on top of it is certainly going to add more fun for the engineers who get the "hey, I need this to be fast" requests from analysts who have written their own query/notebook/English prompt to build a dashboard.
It's not a bad thing, it's a very useful capability since if you're a generalist or a novice data analyst having to learn Spark or SQL to know how to do things like "get 4 week moving average sales by dept" is a big hurdle, but I think this is one of the more obvious examples of where these tools are going to result in more engineering demand in a lot of orgs, instead of less, even if things only go sideways or get crazy slow 0.1% of the time.
At the meta-level Databricks or someone should be able to build some pretty good "optimizing compilers" that feed in all the info about your execution env, data, etc, to the code generator. But any time you need to override that you're gonna suddenly need a LOT of domain knowledge.
The annoying (?) part of Scala Spark is the lack of notebook ecosystem. Also spark-submit requires a compiled jar for Scala yet only the main python script for Python. I would've loved Scala Spark if the eco system was in place.
So you have LLM-based English prompts as an interop layer to Python + PySpark, which is itself an interop layer onto the Spark core. Also, the generated Spark SQL strings inside the DataFrame API have their own little compiler into Spark operations.
When Databricks wrote PySpark, it was because many programmers knew Python but weren't willing to learn Scala just to use Spark. Now, they are offering a way for programmers to not bother learning the PySpark APIs, and leverage the interop layers all the way down, starting from English prompts.
This makes perfect sense when you zoom out and think about what their goal is -- to get your data workflows running on their cluster runtime. But it does make a programmer like me -- who developed a lot of systems while Spark was growing up -- wonder just how many layers future programmers will be forced to debug through when things go wrong. Debugging PySpark code is hard enough, even when you know Python, the PySpark APIs, and the underlying Spark core architecture well. But if all the PySpark code I had ever written had started from English prompts, it might make debugging those inevitable job crashes even more bewildering.
I haven't, in this description, mentioned the "usual" programming layers we have to contend with, like Python's interpreter, the JVM, underlying operating system, cloud APIs, and so on.
If I were to take a guess, programmers of the future are going to need more help debugging across programming language abstractions, system abstraction layers, and various code-data boundaries than they currently make do with.