Hacker News new | past | comments | ask | show | jobs | submit login
Diagrams AI can, and cannot, generate (ilograph.com)
214 points by billyp-rva 42 days ago | hide | past | favorite | 68 comments



A mistake I see people repeating over and over, is never restarting their conversations with a edited initial message.

Instead of doing what the author is doing here, and sending messages back and forward, leading to a longer and longer conversation, where each messages leads to worse and worse quality replies, until the LLM seems like a dumb rock, rewrite your initial message with everything that went wrong/was misunderstood, and aim to have whatever you want solved in the first message, and you'll get a lot higher quality answers. If the LLM misunderstood, don't reply "No, what I mean was..." but instead rewrite the first message so it's clearer.

This is at least true for all ChatGPT, Claude and DeepSeek models, YMMV with other models.


Yup.

Inasmuch as these are collaborative document generators at their core, "minimally ambiguous prompt and conforming reply" is a strongly represented document structure and so we benefit by setting them up to complete one.

Likewise, "tragi-comic dialog between increasingly frustrated instructor and bumbling pupil" is also a widely represented document structure that we benefit by trying to avoid.

Chatbot training works to minimize the chance of an LLM engaging in the latter, because dialog is a intuitive interface that users enjoy, but we can avoid the problem more successfully by just providing a new and less ambiguous prompt in a new session, as you suggest.


> dialog is a intuitive interface that users enjoy

Do people enjoy chat interfaces in their workflows?

I always thought that cursor/copilot/copy.ai/v0.dev were so popular because they break away from the chat UI.

Dialog is cool when exploring but, imo, really painful when trying to accomplish a task. LLMs are far too slow to make a real fluid conversation.


This means the leading UI for LLMs - the chat - is the wrong UI, at least for some of the tasks. We should instead have a single query text field, like in search engines, that you continue to edit and refine, just like in complex search queries.


I like zed's approach, where the whole discussion is a plain text file you can edit as any text, which gives you the ability to change anything in the "discussion" regardless if it was generated by you or the llm. It makes stuff like that much simpler, ie you can correct simple stuff in the llm's response without unecessary back and forths, you can just cut parts out of the discussion to reduce context size or guide the discussion where you actually want removing distractions etc. I don't understand why the dominant approach is an actual, realistic chat interface where you can only add a new response, or in best case create "threads".


> I don't understand why the dominant approach is an actual, realistic chat interface where you can only add a new response, or in best case create "threads".

I'm not 100% sure either, I think it might just be a first-iteration UX that is generally useful, but not specifically useful for use cases like coding.

To kind of work around this, I generally keep my prompts as .md files on disk, treat them like templates where I have variables like $SRC that gets replaced with the actual code when I "compile" them. So I write a prompt, paste it into ChatGPT, notice something is wrong, edit my template on disk then paste it into a new conversation. Iterate until it works. I ended up putting the CLI I use for this here, in case others wanna try the same approach: https://github.com/victorb/prompta


I wonder if people in general would have a healthier understanding of LLMs if this mode of interaction was more common. Perhaps it would be more clear that the LLM is a very souped up autocomplete, instead of some other mind to debate.


Yes I think the same. It really felt demystifying the experience for me when I first tried zed, and I already had that belief about LLMs. But when you use an LLM through a normal chat interface, it feels different, and they are also tuned to feel that way, like talking to somebody. Maybe this is why this approach is avoided by those who have some stake on AI, even if it is better UX.


I've found the most useful LLM UIs for me are tree-like with lots of branches where you go back and forth between your prompts. You branch off anywhere, edit top or leafs as you go.

If one branch doesn't work out you go back to the last node that gave good results or the top and create another branch with a different prompt from.

Or if you want to ask something in a different direction but don't want all the baggage from recent nodes.

Example: https://exoloom.io/trees


I still think there is value in chats and retaining context. But there is also value in starting clean when necessary. Giving users control and teaching people how to use it is the way IMO.


The problem with retaining context is that it gets polluted. That pollution gets you into a latent space with errors, which probably not where you want your next token prediction to be sourced.

The reasonable alternative is a chat interface that lets you edit any text, the AI response or your prompts, and regenerate from any point. This is why I use the API "playground" interfaces or something like LibreChat. Deepseek at least has prompt editing/regeneration.


> This means the leading UI for LLMs - the chat - is the wrong UI

For coding, I'd agree. But seemingly people use LLMs for more than that, but I don't have any experience myself. But I agree with the idea that we haven't found the right UX for programming with LLMs yet. I'm getting even worse results with Aider, Cursor and all of those, than just my approach outlined above, so that doesn't seem like the right way either.


I've also started adding "Ask any questions you think are relevant before starting" to the end of my prompts. It usually results in at least one question that addresses something I didn't think to add to my prompt.


I’ve been saying “stop writing code until we agree what needs to be done”.


It seems like the author in fact did do this. They asked Claud the same message. I really doubt they repeated the entire conversation to get to that point, but I may be wrong.

From personal experience, I agree with you, but I wouldn't make the critique here as it is far from a magic bullet. Honestly, with the first stuff it seems faster to learn mermaid and implement it yourself. Mermaid can be learned in a rather short time, the basic syntax is fairly trivial and essentially obvious. As an added benefit, you then get to have this knowledge and use it later on. This will certainly feel slower than the iterative back and forth with a LLM -- either by follow-up conversations or refining your one shot -- but I'm not convinced it will be a huge difference in time as measured by the clock on the wall[0]

[0] idk, going back and forth with an LLM and refining my initial messages feels slow to me. It reminds me of print statement debugging in a compiled language. Lots of empty time.


> It seems like the author in fact did do this.

It doesn't seem like that to me. At one point in the article: "There are also a few issues [...] Let’s fix with the prompt" and then a prompt that is referring the previous message. Almost all prompts after that seem to depend on the context before them.

My point is that instead of doing that, revise the original initially message so the very first response from the LLM doesn't contain any errors, because (in my experience) that's way easier and faster than trying to correct errors by adding more messages, since they all (even O1 Pro) seem to lose track of what's important in the conversation really fast.


I'm just saying I don't think they repeated that whole process for Claud.


100%

To be honest, this would help a lot of person-implemented iteration too, if it was biologically feasible to erase a conversation from a brain.


alright, time for you to go watch Eternal Sunshine of the Spotless Mind so that you can disabuse yourself of that notion


I built Plandex[1], an open source AI coding agent, partly to enable this workflow.

It has `log` and `rewind` commands that allow you to easily back up to any previous point in the conversation and start again from there with an updated prompt. Plandex also has branches, which can be helpful for not losing history when using this approach.

You’re right that it’s often a way to get superior results. Having mistakes or bad output in the conversation history tends to beget more mistakes and bad output, even if you are specifically directing the LLM to fix those things. Trial and error with a new prompt and clean context avoids this problem.

1 - https://plandex.ai

P.S. I wrote a bit about the pros and cons of this approach vs. continuing to prompt iteratively in Plandex’s docs here: https://docs.plandex.ai/core-concepts/prompts#which-is-bette...


I tried this approach when attempting to get Deepseek-r1 and GrokV3 to create a simple CUDA application. It was necessary because the iterative approach kept leading to hangs and divergent behaviors. I still wasn't able to get a working application, however.


I love Claude, but whomever works on their UI needs to be slapped a bit. Code output covering the stop button on my laptop, page lockups on iPhone/Chrome with certain artifacts (even after reload), crazy slow typing on the computer and refusal to “continue” chat with a cheaper model. Simply providing a summary of the chat on running out of tokens would let me start another conversation, or at least a warning I was getting close.


Markov Chain system doesn't like Markov Chain input.


In my experience this only marginally improves things. It constantly offers new ways to be wrong.


That’s too much work. I’d rather ask the LLM to rewrite my first message for me. And the UI should then give me an option to “start new chat from suggested prompt.”


> I’d rather ask the LLM to rewrite my first message for me

I guess you can do that too, as long as you start a new conversation afterwards. Personally I found it much easier to keep prompts in .md files on disk, and paste them into the various interfaces when needed, and then I iterate on my local files if I notice the first answer misunderstood/got something wrong. Also lets you compose prompts which is useful if you deal with many different languages/technologies and so on.


We use mermaidjs as a supercharged version of chain-of-thought for generating some sophisticated decompositions of the intent.

Then we injected the generated mermaid diagrams back into subsequent requests. Reasoning performance improves for a whole variety of applications.


Neat idea!

Could you go into a bit more detail on how you encode the intent?


Any simple examples?


Random thoughts:

Sketching backed by automated cleanup can be good for entering small diagrams. There used to be an iOS app based on graphviz: http://instaviz.com

Constraint-based interactive layout may be underinvested, as a consequence of too many disappointments and false starts in the 1980s.

LLMs seem ill-suited to solving the optimization of combinatorial and geometric constraints and objectives required for good diagram layout. Overall, one has to admire the directness and simplicity of mermaid. Also, it would be great to someday see a practical tool with the quality and generality of the ultra-compact grid layout prototype from the Monash group, https://ialab.it.monash.edu/~dwyer/papers/gridlayout2015.pdf (2015!!)


Oh wow, thank you for linking that paper. I've been working an interactive tool for a while and have been musing on new constraint and layout types to add. Anecdotally it seems a lot of mainstream graph layout algorithms work well for small to mediumish complexity inputs, but then quickly start generating visual spaghetti. So this looks incredibly apropos for me.


App is unavailable in the US :(


Thanks for link to the Monash's paper.

>LLMs seem ill-suited to solving the optimization of combinatorial and geometric constraints and objectives required for good diagram layout.

I think this is where LLM distance NLP cousin can be of help namely CUE since fundamentally it's based on feature structure from the deterministic approach of NLP unlike LLM that's stochastic NLP [1],[2],[3].

Based on the Monash's paper, Constraint Programming (CP) is one of the popular approaches that's being used for the automatic grid layout.

Since CUE is a constraint configuration language belong to CP, and its NLP background should make it easier and seamless to integrate with LLM. If someone somehow can crack this then it will be a new generation LLM that can perform good and accurate diagramming via prompts and it will be a boon for the architect, designer and engineer. Talking about engineer, if this approach can also be used for IC layout design (analog and digital) not only for diagrams, it will easily disrupt the multi-billion dollars industry for the very expensive software for IC design and man powers.

I hope I'm not getting ahead of myself, but ultimately this combo can probably solve the "holy grails" problem mentioned towards the end of the paper's conclusions regarding layout model that somehow incorporates routing in a way that is efficiently solvable to optimality. After all some people in computer science consider CP as "holy grails" of programming [4].

Please someone somehow make a start up, or any existing YC startup like JITX (Hi Patrick) can look into this potential fruitful endeavor of hybrid LLM combo for automated IC design [5].

Perhaps your random thoughts are not so random but deterministic non-random in nature, pardon the pun.

[1] Cue – A language for defining, generating, and validating data:

https://news.ycombinator.com/item?id=20847943

[2] Feature structure:

https://en.m.wikipedia.org/wiki/Feature_structure

[3] The Logic of CUE:

https://cuelang.org/docs/concept/the-logic-of-cue/

[4] Solving Combinatorial Optimization Problems with Constraint Programming and OscaR [video]:

https://m.youtube.com/watch?v=opXBR00z_QM

[5] JITX: Automatic circuit board design:

https://www.ycombinator.com/companies/jitx


Related - a nice time saver that I've been using since they added image recognition support to ChatGPT has been taking a quick snap of my crudely drawn hand sketched diagrams (on graph paper) with my phone and asking ChatGPT to convert them to mermaid UML syntax.


Comments like these are why I come to hacker news! I'm working on a project right now where I've been learning mermaid, but have gotten to the point where it would be easier for me to draw it out and convert this way by a lot. I'll try this!


I was thinking about the similar topic and started to wonder if I can generated a diagram of a large codebase.

I thought that LLMs are great at compressing information and thought of putting it to good use by compressing a large codebase into a single diagram. Since entire codebase doesn't fit in the context window, I built a recursive LLM tool that calls itself.

It takes two params: * current diagram state, * new files it needs to expand the diagram.

The seed set would be an empty diagram and an entry point to source code. And I also extended it to complexity analysis.

It worked magically well. Here are couple of diagrams it generated: * https://gist.github.com/priyankc/27eb786e50e41c32d332390a42e... * https://gist.github.com/priyankc/0ca04f09a32f6d91c6b42bd8b18...

If you are interested in trying out, I've blogged here: https://updates.priyank.ch/projects/2025/03/12/complexity-an...


GPT 4o is not particularly good at this kind of logic, at least compared to other current models. Trying something that is at least in the top 10 from this WebDev Areans leaderboard: https://web.lmarena.ai/leaderboard would help.

Make sure it is allowed to think before doing (not necessarily in a dedicated thinking mode, it can be a regular prompt to design a graph before implementing it; make sure to add in a prompt who the graph is for (e.g. "a clean graph, suitable for a blog post for technical audience").


You have got more patience than me. I have tried to use these tools to generate (basic) network diagrams and by the time I reached your third step I already knew that it was time to quit and draw it out myself. Diagrams need to be correct and accurate otherwise they're just art. I also need any amendments to be made to the same diagram, not to have it regenerated each time.

I do like the idea of another commenter here who takes a photo of their whiteboard and instructs the AI tool to turn it into a structured diagram. That seems to be well within reach of these tools.


Claude does quite alright. Across one and a half year I did more than several dozens of Mermaid diagrams of all kinds, and only the most complex perhaps were out of reach.

It also really depends on the printing.


Printing?


hmmm let me remember what did I want to say. hmmm hmm hmmm.

depends on the prompting I guess :D

sorry


The "AI" we have now is just a tweening algorithm on a different medium. You won't be able to get it to do anything specific, except when that's a point between 2 existing works. As for this blog, it's nigh unreadable for those not following the current fad web frameworks. Who's to say the user doesn't have to log in to get to the gateway? Gateway can mean different things. Why can the user choose to upload images instead of logging in? What was the purpose of the log in?


I have had good success with D2 diagrams with Claude: https://victorbjorklund.com/build-diagrams-as-code-with-d2-d...

They have icons for common things like cloud things.


Ive had similar results asking chatgpt to generate inout files for graphviz "dot". Pretty good. E.g. i asked it to summarize a complex article and draw the people named and their relationships. I also got it embellish the diagrams a little but, but it needed a lot of guidance to know what kind of nodes to add.

But it was good at arranging the elements in timeline order for example.


Thanks for writing this up. Some questions for the author:

Interesting perspective but it’s a bit incomplete without a comparison of various models and how they perform.

Kind of like Simon Willison’s now-famous “pelican on a bicycle” test, these diagrams might be done better by some models than others.

Second, this presents a static picture of things, but AI moves really fast! It’d also be great to understand how this capability is improving over time.


I talk about this, kind-of, in my article about process visualization (in German, available behind paywall and in print). It‘s not rigorous in the sense that I give points, but a picture emerges along the way. Based on the full set of practical examples there, I would recommend the „v1“ of Claude 3.5 Sonnet. GPT 4.5 also looks good, but I haven‘t run the full suite.

https://www.heise.de/ratgeber/Prozessvisualisierung-mit-gene...


Try asking llm to generate plantuml markup (use case, statechart, etc) which has some other diagram types in addition to mermaid markup. Then paste it into the free plantuml renderer. Works pretty well.

I also experimented with bpmn markup (xml). Realized there are already repos on GitHub creating bpmn diagrams from prompt.

You can also ask llms to create svg.


Bpmn diagrams from prompt? Did you try any that was good?


Plantuml works pretty well with openai models.


plantuml also supports AWS icons


Sonnet 3.7 is perticularly good to generated xml diagrams that can be imported into draw.io. If you are using Cline, Windusurf or Cursor, you can ask it to create the xml file and immediately open it up. Combine it together with CONTEXT.md or ARCHITECTURE.md and you can get a very good overview of the codebase and have discussions around it.


FWIW, I think this article could just as accurately be titled "Diagrams Developers can, and cannot, generate".

I'm mainly speaking to the ability to read IaC code ([probably of any library but at LEAST in my case] cdk, pulumi, terraform, cloudformation, serverless) and be able to infer architectural flow from it. It's really not conducive to that use case.

I could also, kidding/not kidding, be speaking to the range of abilities for "mid" and "senior" developers to know and convey such flows in diagrams.

But really my point is this feels like more validation that AI doesn't provide increased ability, it provides existing (and demonstrated) ability faster with less formalized context. The "less formalized context" is what distinguishes it from programs/code.


I wrote about the same general topic (or more narrowly: process visualization) in German iX magazine, also available here: https://www.heise.de/ratgeber/Prozessvisualisierung-mit-gene... (€)

Rather than relying on end-user products like ChatGPT or Claude.ai, this article is based on the „pure“ model offerings via API and frontends that build on these. While the Ilograph blog ponders „AI’s ability to create generic diagrams“, I‘d conclude: do it, but avoid the „open“ models and low-cost offerings.


Have more success with asking for a detailed workflow print then a d2/mermaid output. No problems with creating a ASCI diagram either and using that for a manual d2 can be done fast enough.


Why just stick to Mermaid? I expect that there is a lot more material with regards to SVG that large models have been trained on. And it's a fairly simple format. Asking it to create diagrams in SVG format gives it much more flexibility. Of course there may be a bit less consistency, but there are ways around that (e.g. giving an example/template to follow).

Simon Willison has shown that current models aren't very good at creating an SVG of a pelican on a bicycle, but drawing a box diagram in SVG is a much simpler task.


Because few ever write diagrams in SVG. SVG is an output format, not an input format, and asking for a diagram in SVG is asking for the model to translate whatever you're asking about (e.g. names of systems you expect the model to know, like AWS, or regular code that needs to be turned into a diagram) into some hidden diagram-ish form, and then generate an SVG out of it.

Can't see it working without letting the model output an intermediary form in PlantUML or Mermaid or Dot - going straight to SVG is cramming too much work into too few tokens.

For the same reason, textual diagram formats are better for iterative work. SVG is too open-ended, and carries little to no semantics. Diagramming language are all about semantics, have fewer degrees of freedom, and much less spurious token noise.


> Because few ever write diagrams in SVG. SVG is an output format, not an input format...

Aside from "try Inkscape", that sounds like a human problem not an LLM problem.

LLMs output what they input, and if diagrams in blog articles or docs are SVG, they merrily input SVG, and associate it with the adjacencies.

One might as well say MidJourney won't work because few ever make paintings using pixels. You're asking it to translate whatever you're asking about (e.g. scenes and names of painters you'd expect the model to know, like Escher or DaVinci), into some hidden imagined scene, render that as brush strokes of types of paint on textured media, and then generate a PNG out of it.


> Aside from "try Inkscape", that sounds like a human problem not an LLM problem.

Absolutely do not "try Inkscape", unless you like your LLM choking on kilobytes of tokens it takes to describe the equivalent of "Alice -> Bob" in PlantUML. 'robjan is correct in comparing SVG to a compiled program binary, because that's what SVG effectively is.

Most SVG is made through graphics programs (or through conversion of other formats made in graphics programs), which add tons of low-level noise to the SVG structure (Inkscape, in particular). And $deity forbid you then minify / "clean up" the SVG for publication - this process strips what little semantic content is there (very little, like with every WYSIWYG tool), turning SVG into programming equivalent of assembly opcodes.

All this means: too many degrees of freedom in the format, and dearth of quality examples the model could be trained on. Like with assembly of a compiled binary, LLM can sort of reason about that, but it won't do a very good job at it, and it's a stupid idea in the first place.

> One might as well say MidJourney won't work because few ever make paintings using pixels.

One might say that if asking LLM to output a raster image (say, PPM/PBM format, which is made of tokenizer-friendly text!) - and predictably, LLM will suck at outputting such images, and suck even worse at understanding them.

One might not say that about Midjourney. Midjourney is not an LLM, it's (backed by) a diffusion model. Those are two entirely different beasts. LLM is a sequential next token predictor, a diffusion model is not; it does something more like global optimization across fixed-sized output, in many places simultaneously.

In fact, I bet a textual diffusion model (there are people working on diffusion-based language models) would work better for outputting SVG than LLMs do.


With diagrams, it's still worth getting the code for the same reason we ask LLMs to write software using a programming language rather than directly giving compiled output.


I ask AI to generate diagrams in LaTeX, works well for me.


I used AI to generate some UML diagrams on a loosely coupled system - just fed it the actual classes where only names identify the actual links. It did quite a good job there.

It was a well defined domain so I guess the training data argument doesn't fit for stuff that is within a "natural" domain like graphs. LLMs can infer the behavior based on naming quite well.


I have found LLMs to be very good at the kind of code -> diagram task presented here. Fire up superwhisper[1] and stream-of-consciousness away about why you want the diagram, which bits are important, who the audience is, and so on. Then iterate a few times. Works brilliantly for even very complex things, including 5000 line CDK files.

It's disingenuous to conclude that AI is no good at diagramming after using an impotent prompt AND refusing to iterate with it. A human would do no better with the same instructions, LLMs aren't magic.

This is the same as my previous comment https://news.ycombinator.com/item?id=42524125

[1] https://superwhisper.com/


Fair point. What the author does mention is that if you have to do a lot of work getting proper results out of AI (and potentially contend with hallucinations), you may as well do the actual work yourself and be more confident about the end result.

That being said, I think part of the potential is repeatability. Once you've done the work of property prompting for the desired result, you can often save the adjusted prompts (or a variation of it) for later use, giving you a flying start on subsequent occasions.


Although searching provides better results, it is certain that attempting to copy or directly use these images would infringe on someone's copyright. In a broader sense, using AI can also be related to copyright infringement; the court must first defeat the AI provider before it can reach the user.


Given the pace of development in this space, it is probably worth noting in the title that this is from November 2024 so the results might be a bit dated.


Author here. It's a fair point and it would be worth revisiting, if not now then within the year.

That said, I wouldn't expect things to change too drastically. TFA goes into details, but in short LLMs are already quite good at whiteboarding (where you interactively describe the diagram you want). They're also really bad at generating a diagram from an existing system. In either case, small, incremental improvements won't really help; you'd need a large change to move the needle.



[yet]




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: