Hacker Newsnew | past | comments | ask | show | jobs | submit | pscanf's commentslogin

The problem with using markdown for this is that it's unstructured, so when the LLM does calculations on it, extracting data is error prone, you're never sure what exactly gets extracted, and it's then a hassle to verify (like the author had to do).

In my app Superego (https://github.com/superegodev/superego, shameless plug) I use structured JSON documents precisely for this reason, because with a well-defined schema the LLM can write TypeScript functions that must compile. This doesn't guarantee correctness, of course, but it actually goes a long way.

But doing my taxes was a use case I hadn't considered, and it's actually pretty neat! I'll be trying it myself next month (though I'm not looking forward to it).


> I use structured JSON documents precisely for this reason, because with a well-defined schema the LLM can write TypeScript functions that must compile.

The intention of this is to reduce hallucination on information extraction, right?

Also, how do you convert your docs / information into JSON documents?


> The intention of this is to reduce hallucination on information extraction, right?

Correct.

> Also, how do you convert your docs / information into JSON documents?

Right now you have to add it yourself to the database. The idea is that you use Superego as the software in which you record your expenses / income / whatever, so the data is naturally there already.

But I'm also working on an "import from csv/json/etc" feature, where you drop in a file and the AI maps the info it contains to the collections you have in the database.


Have you looked at qmd by any chance? Thoughts on that for sqlite vs qmd in this case? https://github.com/tobi/qmd

Oh, interesting, I didn't know about this project, thanks for sharing!

I tried to implement something like this for the search functionality, but ended up going with "old school" lexical search instead.

Mostly because, in my experimentation, vector search didn't perform significantly better, and in some case it performed worse. All the while being much more expensive on several fronts: indexing time (which also requires either an API or a ~big local model), storage, search time, and implementation complexity.

And Superego's agent actually does quite well with the lexical search tool. The model usually tries a few different queries in parallel, which approximates a bit semantic search.


That would simplify things, but in my opinion that is still too high a hurdle. I'm all about privacy and FOSS, but I don't self-host anything (except for my personal website).

I wish that more apps would instead move the logic to the client and use files on file syncing services as databases. Taking tasks as an example, if a task board were just a file, I could share it with you on Dropbox / Drive / whatever we both use, and we wouldn't need a dedicated backend at all.

The approach has limitations (conflict resolution, authorization, and latency are the big ones), but it is feasible and actually completely fine for lots of apps.


I only use GitHub (and actions) for personal open-source projects, so I can't really complain because I'm getting everything for free¹. But even for those projects I recently had to (partially) switch actions to a paid solution² because GitHub's runners were randomly getting stuck for no discernible reason.

¹ Glossing over the "what they're getting in return" part. ² https://www.warpbuild.com/


I'm building an app that is, in a way, a modern take on Lotus Notes (https://github.com/superegodev/superego), and I couldn't feel this more:

> It is hard, today, to explain exactly what Lotus Notes was.

Whenever I try to explain what it does to a non-tech person, I'm met with confused looks that make me quickly give up and mumble something like "It's for techies and data nerds". I think to myself "they're not my target audience".

But I actually would like them to be, at some point. In the 90s "the generality and depth of its capabilities meant that it was also just plain hard to use", but now LLMs lower a lot the barrier to entry, so I think there can be a renaissance of such malleable¹ platforms.

Of course, the user still needs to "know what they need" and see software as something that can be configured and shaped to their needs which, with "digital literacy" decreasing, might be a bigger obstacle than I think.

¹ https://www.inkandswitch.com/malleable-software


If you squint, Notion and Coda are childish versions of Lotus Notes.

One noted science fiction author, C.J. Cherryh, notes, “It is perfectly okay to write garbage --- as long as you edit brilliantly.”[1] --- for a while I've been wondering if this adage was applicable to Vibe-coding, and your methodology would seem to be a reasonable approach/response to get the benefits of this and to shield against the detriments, and to ensure that a human developer understands the code before committing.

1 - https://www.goodreads.com/quotes/398754-it-is-perfectly-okay...


> your methodology would seem to be a reasonable approach/response to get the benefits of this and to shield against the detriments

If you're referring to the sandboxing / isolation of each app, I agree. Plus, the user can change the app quite easily, so if when they spot a bug, they can tell the agent to fix it (and cross their fingers!).

> ensure that a human developer understands the code before committing

Just to clarify: for Superego's app there's no human developer oversight, though. At least for the ones the user self-creates. Obviously the user will check that the app they just made works, but they might not spot subtle bugs. I employ some strategies to _decrease the likelihood of bugs_ (I wrote a bit about it here https://pscanf.com/s/351/, if you're interested), but of course only formal verification would ensure there aren't any.


I was referring more to your commentary/explanation about it not being a Vibe-coded app.

Yeah, I can see that one is on their own recognizance when letting an LLM run unsupervised.


Ah yeah, I understand now. And I also agree with the quote then! (Though it does change the nature of the job, and it's not terribly enjoyable...)

Hopefully you can find some way to keep the enjoyment in this.

Thanks for sharing. The demo linked below looked pretty cool, I think this might be a nice complement to Glamorous Toolkit in some of my personal and work flows.

Just watched your demo here:

https://youtu.be/vB3xo2qn_g4?si=y2udkdfezSR9ktUO

Pretty cool!


I like that you can generate new programs from within the system.

That's something I miss with Notion. I basically want a Notion but extensible and malleable like Emacs.


Yes! That's more or less the angle I'm going for. I mean, I don't aim just yet for Emacs-levels of malleability, but at least for something where you can create some useful day to day personal tools.

Is there a story behind the old guy in the logo?

It must be a nod to Freud (i.e. id, ego, and super ego)

Correct. Admittedly, graphic design is not even my passion, so there's probably lots of room for improvement. But at this point I've grown accustomed to the friendly face. :D

Many people seem to associate "ego" with negative connotation.

The name gives a weird vibe. But, it's free and it's your project so, whatever. ¯ \ _ ( ツ ) _ / ¯


Yeah, I agree, though it wants to be slightly provocative as well: it's all about you, your data, your software, your rights.

Ah... Ok, that makes sense.

Very cool project!

Question regarding the pluggable js engine: I have an electron app where I'm currently using QuickJS to run LLM-generated code. Would edge.js be able (theoretically) to use electron's v8 to get a "sanboxed within electron" execution environment?


Yes, this should be fully possible.

We actually believe Edge.js will a great use case for LLM-generated code.


naively, based on their install.sh script, you'd pick the correct edge.js executable and shell out to that. I'm sure there's some more integral means, but if you wanted a quick test, that should be easily setup.


I quite like the GPT models when chatting with them (in fact, they're probably my favorites), but for agentic work I only had bad experiences with them.

They're incredibly slow (via official API or openrouter), but most of all they seem not to understand the instructions that I give them. I'm sure I'm _holding them wrong_, in the sense that I'm not tailoring my prompt for them, but most other models don't have problem with the exact same prompt.

Does anybody else have a similar experience?


These little 5.4 ones are relatively low latency and fast which is what I need for voice applications. But can't quite follow instructions well enough for my task.

That's really the story of my life. Trying to find a smart model with low latency.

Qwen 3.5 9b is almost smart enough and I assume I can run it on a 5090 with very low latency. Almost. So I am thinking I will fine tune it for my application a little.


I've had such the opposite experience, but mainly doing agentic coding & little chat.

Codex is an ice man. Every other model will have a thinking output that is meaningful and significant, that is walking through its assumptions. Codex outputs only a very basic idea of what it's thinking about, doesn't verbalize the problem or it's constraints at all.

Codex also is by far the most sycophantic model. I am a capable coder, have my charms, but every single direction change I suggest, codex is all: "that's a great idea, and we should totally go that [very different] direction", try as I might to get it to act like more of a peer.

Opus I think does a better job of working with me to figure out what to build, and understanding the problem more. But I find it still has a propensity for making somewhat weird suggestions. I can watch it talk itself into some weird ideas. Which at least I can stop and alter! But I find its less reliable at kicking out good technical work.

Codex is plenty fast in ChatGPT+. Speed is not the issue. I'm also used to GLM speeds. Having parallel work open, keeping an eye on multiple terminals is just a fact of life now; work needs to optimize itself (organizationally) for parallel workflows if it wants agentic productivity from us.

I have enormous respect for Codex, and think it (by signficiant measure) has the best ability to code. In some ways I think maybe some of the reason it's so good is because it's not trying to convey complex dimensional exploration into a understandable human thought sequence. But I resent how you just have to let it work, before you have a chance to talk with it and intervene. Even when discussing it is extremely extremely terse, and I find I have to ask it again and again and again to expand.

The one caveat i'll add, I've been dabbling elsewhere but mainly i use OpenCode and it's prompt is pretty extensive and may me part of why codex feels like an ice man to me. https://github.com/anomalyco/opencode/blob/dev/packages/open...


“every single direction change I suggest, codex is all: "that's a great idea, and we should totally go that [very different] direction", try as I might to get it to act like more of a peer.“

That’s not the model, that’s a personality setting you can change in the codex config file.

Set it to Pragmatic, and ask it (not command it) about your new direction in planning mode.

It will tell you if your idea is not good for the given project. It’s an excellent peer.


> I've had such the opposite experience

Yeah, I've actually heard many other people swear by the GPTs / Codex. I wonder what factors make one "click" with a model and not with another.

> Codex is an ice man.

That might be because OpenAI hides the actual reasoning traces, showing just a summary (if I understood correctly).


OpenClaw guy (he's Austrian, it's relevant) much prefers Codex over Claude and articulated it as being due to Claude's output feeling very "American" and Codex's output feeling very "German", and I personally really agree with the sentiment.

As an American, Claude feels much more natural to me, with the same overly-optimistic "move fast, break things" ethos that permeates our culture. It takes bigger swings (and misses) at harder-to-quantify concepts than Codex, cuts corners (not intentionally, but it feels like a human who's just moving too fast to see the forest for the trees in the moment), etc. Codex on the other hand feels more grounded, more prone to trying to aggregate blind spots, edge cases, and cover the request more thoroughly than Claude. It's far more pedantic and efficient, almost humorless. The dude also claimed that most of the Codex team is European while Claude team is American, and suggested that as an influence on why this might be.

Anyways, I've found that if I force Claude and Codex to talk to each other, I can get way better results and consistency by using Claude to generate fairly good plans from my detailed requests that it passes to Codex for review and amendment, Claude incorporates the feedback and implements the code, then Codex reviews the commit and patches anything Claude misses. Best of both worlds. YMMV


Oh, interesting perspective. I'm Italian, but from an Alpine valley not far from Austria, so I don't know what I should prefer. :D

But joking aside, putting it like that I'd think I'd prefer the German/Codex way of doing things, yet I'm in camp Claude. But I've always worked better with teammates that balance my fastidiousness, so maybe that's my answer.


Claude Code now hides thinking as well unless you turn on an undocumented setting:

https://github.com/anthropics/claude-code/issues/31326#issue...

https://x.com/nummanali/status/2032451025500528687


Opinions are my own.

For agentic work, both Gemini 3.1 and Opus 4.6 passed the bar for me. I do prefer Opus because my SIs are tuned for that, and I don't want to rewrite them.

But ChatGPT models don't pass the bar. It seems to be trained to be conversational and role-playing. It "acts" like an agent, but it fails to keep the context to really complete the task. It's a bit tiring to always have to double check its work / results.


I find both Opus 4.6 and GPT-5.4 have weaknesses but tend to support each other. Someone described it to me jokingly as "Claude has ADHD and Codex is autistic." Claude is great at doing something until it gets done and will run for hours on a task without feedback, Codex is often the opposite: it will ask for feedback often and sometimes just stop in the middle of a task saying it's done with step 1 of 5. On the other hand, Codex is a diligent reviewer and will find even subtle bugs that Claude created in its big long-running "until its done" work mode.


Seems like the diagnoses are backwards, in this case. Claude usually stays on task no matter what, but lately Opus 4.6 is showing signs of overuse. I never used to get overload/internal server error messages, but I've seen about a half-dozen of them today alone. And it has been prone to blowing off subtasks that I'd have expected it to resolve.


Yea absolutely. I am using GPT 5.2 / 5.2 Codex with OpenCode and it just doesn't get what I am doing or looses context. Claude on the other side (via GitHub Copilot) has no problem and also discovers the repository on it's own in new sessions while I need to basically spoonfeed GPT. I also agree on the speed. Earlier today I tasked GPT 5.2 Codex with a small refactor of a task in our codebase with reasoning to high and it took 20 minutes to move around 20 files.


I don't know any reason to use 5.2, when 5.3 is quite a bit faster.


If using OpenAI models, use the Codex desktop app, it runs circles around OpenCode.


Can you educate me as to what makes Codex app superior using the same GPT model in both? Thx in advance!


Usually it's the prompts, or the model is tuned to the specific first-party tools. Sometimes that gives an edge over the generic tools, unfortunately.


It's the harness.


Same, and I can't put my finger on the "why" either. Plus I keep hitting guard rails for the strangest reasons, like telling codex "Add code signing to this build pipeline, use the pipeline at ~/myotherproject as reference" and codex tells me "You should not copy other people's code signing keys, I can't help you with this"


Are you requesting reasoning via param? That was a mistake I was making. However with highest reasoning level I would frequently encounter cyber security violation when using agent that self-modifies.

I prefer Claude models as well or open models for this reason except that Codex subscription gets pretty hefty token space.


Yes, I think? But I was talking more specifically about using the models via API in agents I develop, not for agentic coding. Though, thinking about it, I also don't click with the GPT models when I use them for coding (using Codex). They just seem "off" compared to Claude.


I am also talking about agents I'm developing. They just happen to be self-modifying but they're not for agentic coding. You have to explicitly send the reasoning effort parameter. If you set effort to None (default for gpt-5.4) you get very low intelligence.


Ah OK sorry, I misinterpreted. But yes, I double checked one case and I am indeed setting the parameter explicitly (defaulting to medium effort). But no luck. It feels like the model ignores what I'm telling it.

For example, I pass it a list of database collections and tools to search through them, ask a question that can very obviously be answered with them, and it responds with "I can’t tell yet from your current records" (just tested with GPT 5.4-mini).

But I've prodded it a bit more now, and maybe the model doesn't want to answer unless it can be very very confident of the answer it produces. So it's sort of a "soft refusal".


I like GPT models in Codex, for a fully vibecoded experience (I don't look at code) for my side-projects. In there, they really get the job done: you plan, they say what they'll do, and it shows up done. It's rare I need to push back and point out bugs. I really can't fault them for this very specific use-case.

For anything else, I can't stand them, and it genuinely feels like I am interacting with different models outside of codex:

- They act like terribly arrogant agents. It's just in the way they talk: self-assured, assertive. They don't say they think something, they say it is so. They don't really propose something, they say they're going to do it because it's right.

- If you counter them, their thinking traces are filled with what is virtually identical to: "I must control myself and speak plainly, this human is out of his fucking mind"

- They are slow. Measurably slow. Sonnet is so much faster. With Sonnet models, I can read every token as it comes, but it takes some focusing. With GPT, I can read the whole trace in real-time without any effort. It genuinely gives off this "dumb machine that can't follow me" vibe.

- Paradoxically, even though they are so full of themselves, they insist upon checking things which are obvious. They will say "The fix is to move this bit of code over there [it isn't]" and then immediately start looking at sort of random files to check...what exactly?

- I feel they make perhaps as many mistakes as Sonnet, but they are much less predictable mistakes. The kind that leaves me baffled. This doesn't have to be bad for code quality: Sonnet makes mistakes which _might_ at points even be _harder_ to catch, so might be easier to let slip by. Yet, it just imprints this feeling of distrust in the model which is counter-productive to make me want to come back to it

I didn't compare either with Gemini because Gemini is a joke that "does", and never says what it is "doing", except when it does so by leaving thinking traces in the middle of python code comments. Love my codebase to have "But wait, ..." in the middle of it. A useless model.

I've recently started saying this:

- Anthropic models feel like someone of that level of intelligence thinking through problems and solving them. Sonnet is not Opus -- it is sonnet-level intelligence, and shows it. It approaches problems from a sensible, reasonably predictable way.

- Gemini models feel like a cover for a bunch of inferior developers all cluelessly throwing shit at the wall and seeing what sticks -- yet, ultimately, they only show the final decision. Almost like you're paying a fraudulent agency that doesn't reveal its methods. The thinking is nonsensical and all over the place, and it does eventually achieve some of its goals, but you can't understand what little it shows other than "Running command X" and "Doing Y".

On a final note: when building agentic applications, I used to prefer GPT (a year ago), but I can't stand it now. Robotic, mechanic, constantly mis-using tools. I reach for Sonnet/Opus if I want competence and adherence to prompt, coupled with an impeccable use of tools. I reach for Gemini (mostly flash models) if I want an acceptable experience at a fraction of the price and latency.


> They act like terribly arrogant agents

Oh I feel that. I sometimes ask ChatGPT for "a review, pull no punches" of something I'm writing, and my god, the answers really get on my nerves! (They do make some useful points sometimes, though.)

> On a final note: when building agentic applications, I used to prefer GPT (a year ago), but I can't stand it now. Robotic, mechanic, constantly mis-using tools. I reach for Sonnet/Opus if I want competence and adherence to prompt, coupled with an impeccable use of tools. I reach for Gemini (mostly flash models) if I want an acceptable experience at a fraction of the price and latency.

Yeah, this has been almost exactly my experience as well.


A bit off topic, but reading your post I suddenly realized that if I read it three years ago I’d assume you’re either insane or joking. The world moved fast looking back.


> cyber security violation

Would you mind expanding on this? Do you mean in the resulting code? Or a security problem on your local machine?

I naively use models via our Copilot subscription for small coding tasks, but haven't gone too deep. So this kind of threat model is new to me.


No, I mean literal API response. They think I'm using it to hack. See related Github issue: https://github.com/anomalyco/opencode/issues/15776

I don't use OpenCode but looks like it also triggered similar use. My message was similar but different.


Ahhh okay, I see. Thanks!


I ran 5.4 Pro on some data analytics (admittedly it was 300+ pages). It took forever. Ran the same on Sonnet 4.6, night and day difference. I understand it's like using a V8 engine for a V4 task, but I was curious. These new models look promising though. I'd rather use something like a Haiku most of the time over the best rated. I'm not a rocket scientist or solving the mysteries of the universe. They seem to do a great job 80% of the time.


As a user, I'm of course very excited about v7. As a developer of an app that _integrates_ TypeScript, however, I feel a bit uneasy seeing the API feature still marked as "not ready" on the roadmap.

On the other hand, I can understand leaving it as the last thing to do, after the foundation has set. Also, the TypeScript team has really done an amazing job all these years with backward compatibility, so that is extremely reassuring.

Maybe my uneasiness is just impatience to get to use the sped-up version in my app as well! :)


FWIW I tried it with the VSCode preview in two monorepos we have - one frontend and one backend GraphQL servers with complex types - absolutely no issues (except one breaking change in tsconfig with baseUrl being removed) and fast compile times.


Same experience here with a pnpm workspace monorepo. The baseUrl removal was the only real friction — we were using it as a path alias root, had to move everything to subpath imports.

  The moduleResolution: node deprecation is the one I'd flag for anyone not paying attention yet. Switching to nodenext forced us to add .js extensions to all      
  relative imports, which was a bigger migration than expected.

  Compilation speed improvement is real though. Noticeably faster on incremental builds.


As a developer that integrates typescript into client-side web pages and writes lots of custom plugins, I'm extremely nervous about the port to Go.

I think the last estimate I saw was the the Go port compiled to WASM was going to be about triple the size of the minified JS bundle.


Very cool!

I'm particularly impressed by the bookmark "trick" to install it on a page. Despite having spent 15 years developing for the browser, I had somehow missed that feature of the bookmarks bar. But awesome UX for people to try out the tool. Congrats!


Thanks!

Bookmarklets are such an underrated feature. It's super convenient to inject and test scripts on any page. Seemed like the perfect low-friction entry point for people to try it out.

Spent some time on that UX because the concept is a bit hard to explain. Glad it worked!


I was also spammed (twice) by voice.ai.

You mention GDPR, which also "applies" to me, though I wonder if what they're doing is actually illegal. I mean, after all, I'm putting my email on GitHub precisely to give people a way to contact me.

Of course, I do that naïvely, assuming good faith, not expecting _companies_ to use it to spam me. So definitely what they're doing is, at the very least, in poor taste.


> I'm putting my email on GitHub precisely to give people a way to contact me.

They’re not only looking at the public email in your profile, they’re also looking at your committer email (git config user.email). You could argue that you’re not putting that out for people to contact you.

(I’ve used that trick a couple times to reach out to people, too, but never mass emailing.)


Is there any company that will take my money to solve GDPR issues? And by solve I mean sue the spammers? For last few years I saw they "try" to look legit, by claiming addresses are managed by some Hungarian/Spanish shell company, hoping no one will be able to afford pursuing infractions over borders.


There's probably a law against it, but I've always thought a legal company could make decent money taking cases like this in bulk for free, on the condition that they get to keep all the compensation, while the "client" still gets the satisfaction of punishing the offending party.


On the U.S., only Attorneys General can go after violators of the CAN-SPAM Act.

It needs to be modified like how individuals can go after telemarketers.


That’s pretty much class action lawsuits!


This is hard, because private right of action in Europe is often very limited, and the damages are low.

THe US basically has a "private police force" for certain laws, notably the ADA. Many people are against this, I personally think it's a great idea and something countries should be doing a lot more of of.


> Is there any company that will take my money to solve GDPR issues? And by solve I mean sue the spammers?

A lawyer


They spammed me as well.


How do you sync the data out of Garmin? Something like https://github.com/matin/garth, or syncing directly from the watch?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: