Last week I helped someone organizing and analyzing their data in Excel. As I'm using Excel only once every couple of years, I had to rewatch the wonderful "You Suck at Excel with Joel Spolsky" to be productive again. Now seeing this announcement page, I was immediately reminded of the mini-rant towards the end of the video [0]:
> On average, once every three months, there's a startup that makes a thing that they say is going to be amazing, and it's just PivotTables. They're like, "It works with Excel, and it does this amazing consolidation, and slicing and dicing of all your data, and it's amazing, and we're going to make a startup. I'm going to sell this for four hundred ninety-five dollars." And that happens at least once every three months. The trouble is, the VCs usually know about PivotTables.
Of course this product goes a little further, making suggestions what columns to analyze and chart with an LLM. But it's quite funny to me that this Microsoft Research product is reinventing the PivotTable (+PivotChart) part with Python and Pandas.
But nobody today is going to read let alone promote a blog about pivot tables. Sprinkle in LLM references, and the fad wave riders will sing its praises
It's chartjunk and should be used sparingly (e.g., on covers) rather than the actual content. I think it would be used frivolously in the hands of an undisciplined person.
Thank you for the kind words. I wouldn't even know where to start. There's a lot I can code and do, but making something into a product is something I have no experience with.
What's your recommended approach? Even if open-source, curious to learn.
Continue as an open source project by building features and refining the product. Once you have enough users you can start offering premium features and support to organizations. Maybe even apply to an accelerator? Good Luck!
The technology for this type of generation (ControlNet) is already open source and relatively straightforward to reproduce the charts demoed in that post without shenanigans.
Excellent technical work but subject to the same major questionmarks around the morality and legality of LLM business models. From the discussion section:
> Low Resource Grammars: ... LIDA depends on the underlying LLM s having some knowledge of visualization grammars as represented in text and code in its training dataset (e.g., examples of Altair, Vega, Vega-Lite, GGPLot, Matplotlib, represented in Github, Stackoverflow, etc.). For visualization grammars not well represented in these datasets
(e.g., tools like Tableau, PowerBI, etc., that have graphical user interfaces as opposed to code representations) ), the performance of LIDA may be limited without additional model fine-tuning or translation.
In other words, open source programmatic visualizations are required to feed the LLM, which then can, e.g., be licensed to corporates to accelerate various internal exploratory data analyses. A win-win for corporates and LLM providers.
If I release a novel visualization library on github under some open source license I want it to be attributed to me. I don't want some specialized LLM to be lifting and offering the same visualization concepts to unnamed corporates for a hefty fee without me ever even knowing about it and these corporates pretending they don't know where that concept is coming from.
It is you choice whether you think that is a problem and how "novel" it is. Theft after all has a very long history.
Good to know that the prevailing commercial tech culture now sees plagiarism and stealing ideas without attribution as the modern way of doing business and hopes that dressing things up under some algorithmic veil will hide the act.
I guess the pit of moral decline has no bottom. The consolation is that theft has never been the road to wealth. Once the plundering is over the only thing that is left is a wasteland.
It seems that Microsoft has finally found a way to kill the open source "cancer".
As they say, people are unwilling to understand something if their monetary gain depends on not understanding it.
Let me break it down for you. If I ask for a visualization that squares the circle and there is one repo that has an example of squaring the circle, the LLM will "arrive" at a way of squaring the circle.
If (1) an LLM is able to arrive at solutions in the same class of difficulty as the solution for the target problem and (2) it's not possible to establish the provenance of the solution actually offered by the LLM, then what's the argument for assuming that the solution is based on IP rather than constructive reasoning?
I built a tool that lets you use GPT to analyze data and build interactive graphs on the browser (https://deepsheet.dylancastillo.co/). I may try to adapt it to use LIDA or a similar approach.
No, absolutely not. How can you trust the output from such a black box system? Who is to say that the LLM won't add or remove data points to make the chart "look good"? Heaven help us if decision makers start taking this output seriously. But of course they will, because the charts will look professional and plausible, because that's what the prompt requires.
> Who is to say that the LLM won't add or remove data points to make the chart "look good"?
I don't think you're thinking creatively enough here. A good system that makes use of these concept (because it's a research project, not a product!) will likely ensure that actions the LLM takes are non-destructive and inherently undoable. For example, if the underlying data was changed by the LLM, you can statically verify that and show a warning, emit an error, or ... something else entirely!
Agreed. Our customers on the regulated side cannot use an unexplainable UI like that by law.
We take a middle ground with louie.ai of showing the database queries, data transforms, chart config, and any other decision or generation. It's nice being able to watch & check each step and then write in natural language what you want changed, so ends up feeling more like the easier side of pair programming than a blackbox.
"You are a helpful assistant highly skilled in writing PERFECT code for visualizations. Given some code template, you complete the template to generate a visualization given the dataset and the goal described. The code you write MUST FOLLOW VISUALIZATION BEST PRACTICES ie. meet the specified goal, apply the right transformation, use the right visualization type, use the right data encoding, and use the right aesthetics (e.g., ensure axis are legible). The transformations you apply MUST be correct and the fields you use MUST be correct. The visualization CODE MUST BE CORRECT and MUST NOT CONTAIN ANY SYNTAX OR LOGIC ERRORS. You MUST first generate a brief plan for how you would solve the task e.g. what transformations you would apply e.g. if you need to construct a new column, what fields you would use, what visualization type you would use, what aesthetics you would use, etc.
YOU MUST ALWAYS return code using the provided code template. DO NOT add notes or explanations." (https://github.com/microsoft/lida/blob/main/lida/components/...)
They prompted that things MUST be correct in their prompts and it reports any transformations it does to your data, it might give you some insight into its logic to test yourself against the data.
True. This is an open area of research. Tools like guidance (or other implementations of constrained decoding with llms [1,2]) will likely help improve this problem.
So instead of auding MPL once (or never because MPL doesn't have a habit of broken output) I should audit the output of this LLM for every query because it does have a habit of hallucinating?
Honestly, that's the more interesting and more difficult part. Anyone with basic training can be coerced to slice and dice schemas and configs until pretty graphs are produced. LLMs might not even be the best for that.
But knowing _what_ to look for in the data given a problem statement - that's valuable, and hard to teach. LLMs have such a broad base of "knowledge", they can be reasonably good at this in just about any domain.
I would agree -- that's why (to me at least) the recent wave of LLMs is such a big deal. They make semantic contexts accessible for interaction with code logic.
Ironically I've found GPT to be pretty terrible with plotting libraries like plotly/dash or even matplotlib compared to just about anything else in python.
I wrote a simple wrapper around Matplotlib and ChatGPT-3.5-turbo. The LLM response is a Python code that is executed to get charts. It is working very nice. Here is a repo https://github.com/mljar/plotai - you will find two videos in the readme. Maybe you should work on your prompts?
Huh, neat. It never really bothered me enough/was important enough to spend specifically time on it since i was able to just hit a button to send them back for fixing but that's good to see the extra passes aren't explicitly needed.
I find the quality of the code really questionable:
system_prompt = """
You are a an experienced data analyst that can annotate datasets. Your instructions are as follows:
i) ALWAYS generate the name of the dataset and the dataset_description
ii) ALWAYS generate a field description.
iv.) ALWAYS generate a semantic_type (a single word) for each field given its values e.g. company, city, number, supplier, location, gender, longitude, latitude, url, ip address, zip code, email, etc
You return an updated JSON dictionary without any preamble or explanation.
"""
I sort of know what I'm doing with data, so I don't want LLMs building any models for me, but I do like the concept of making my lame visualizations look more professional and slicker.
Am I missing something here? From the video and examples this looks like it's helping you make Excel charts with less suck (slightly stylized), not really building what I would consider "infographics" in the traditional marketing sense. I guess it counts as visualizations, but not what I was expecting.
> On average, once every three months, there's a startup that makes a thing that they say is going to be amazing, and it's just PivotTables. They're like, "It works with Excel, and it does this amazing consolidation, and slicing and dicing of all your data, and it's amazing, and we're going to make a startup. I'm going to sell this for four hundred ninety-five dollars." And that happens at least once every three months. The trouble is, the VCs usually know about PivotTables.
Of course this product goes a little further, making suggestions what columns to analyze and chart with an LLM. But it's quite funny to me that this Microsoft Research product is reinventing the PivotTable (+PivotChart) part with Python and Pandas.
[0]: https://youtu.be/0nbkaYsR94c?si=kkfFHZ_fyGmG3Lnj&t=2988