One thing I find somewhat amusing about this is that all of the generated code i...

chaxor · on June 29, 2023

As has been pointed out many times, this is similar to the steps that have led to interpreted languages like python, R, Julia, etc that make calls to C, or use JVM/LLVM, etc on up to assembly or machine code. The leap made is certainly less defined than previous jumps, but there is some similarity in that the more specific a person writes the rules to define the program, the more potential there can be in making something powerful and efficient (if you know what you're doing).

The next big gain in capability (other than the onvious short term goal of making an LLM output a full working code base) may be in LLMs being able to choose better design, without it being specified (for example having 'search for the best algorithm', and 'make it idempotent', etc added automatically to each prompt), and to potentially write the program automatically in something like assembly (or Rust or C for better readability) directly instead of preferring python as these models tend to right now.

pixelmonkey · on June 30, 2023

I don't think it's exactly the same because an LLM-based English=>Python translator is nowhere near as deterministic as compilers and assemblers. And English, being a language whose tokens are subject to wide interpretation of meaning, may be a source of byzantine complexity. Then, of course, there is the "moving target" introduced by model upgrades and evolution in the public crawl dataset rewiring the neural network for the model's world knowledge.

There is a reason Python, as high level as it is, is still defined using an eBNF / PEG grammar[1] with only 35 or so keywords[2]. And there is a reason the Python bytecode interpreter is "just" a while loop on a minimal set of instructions[3]. All of this leads to a remarkable level of determinism, and determinism is your friend when trying to get code right. I haven't yet seen the equivalent in LLMs. I don't think it's an entirely intractable problem, but I'd be hesitant to leap straight into English language as a stable API today. I think code copilots are the right place to start. And maybe even copilots that help not just with code suggestions, but also with debug suggestions.

[1]: https://docs.python.org/3/reference/grammar.html

[2]: https://docs.python.org/3/reference/lexical_analysis.html#ke...

[3]: https://devguide.python.org/internals/interpreter/

majormajor · on June 30, 2023

Things like Spark may be the main place where a lot of today's programmers really have to fight the "compiler" already compared to, say, writing Java for the JVM or writing plain Python, since these are being compiled down to parallel execution plans across distributed systems, which introduces fairly novel performance pitfalls compared to writing a linear Java or Python method that gets compiled down to basic native code.

Adding a layer on top of it is certainly going to add more fun for the engineers who get the "hey, I need this to be fast" requests from analysts who have written their own query/notebook/English prompt to build a dashboard.

It's not a bad thing, it's a very useful capability since if you're a generalist or a novice data analyst having to learn Spark or SQL to know how to do things like "get 4 week moving average sales by dept" is a big hurdle, but I think this is one of the more obvious examples of where these tools are going to result in more engineering demand in a lot of orgs, instead of less, even if things only go sideways or get crazy slow 0.1% of the time.

At the meta-level Databricks or someone should be able to build some pretty good "optimizing compilers" that feed in all the info about your execution env, data, etc, to the code generator. But any time you need to override that you're gonna suddenly need a LOT of domain knowledge.

ryangibb · on June 30, 2023

Vernor Vinge wrote about 'programmer archaeologists' digging through 1000-year-old software stacks in https://en.wikipedia.org/wiki/A_Deepness_in_the_Sky#Interste...

KptMarchewa · on June 30, 2023

> So you have LLM-based English prompts as an interop layer to Python + PySpark, which is itself an interop layer onto the Spark core.

And the spark core "just" generates execution plan, which on databricks gets executed as native code.

https://docs.databricks.com/runtime/photon.html

nivekkevin · on June 29, 2023

The annoying (?) part of Scala Spark is the lack of notebook ecosystem. Also spark-submit requires a compiled jar for Scala yet only the main python script for Python. I would've loved Scala Spark if the eco system was in place.

tbcj · on June 30, 2023

What about Zeppelin?

whinvik · on June 30, 2023

Nitpick, but if you are using Photon, then the API is actually C++.