Hacker Newsnew | past | comments | ask | show | jobs | submit | hamiltont's commentslogin

Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use boolean criteria instead, then weight manually e.g.

- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N

Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps

Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition


I use this approach for a ticket based customer support agent. There are a bunch of boolean checks that the LLM must pass before its response is allowed through. Some are hard fails, others, like you brought up, are just a weighted ding to the response's final score.

Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).


Funny, this move is exactly what YouTube did to their system of human-as-judge video scoring, which was a 1-5 scale before they made it thumbs up/thumbs down in 2010.

I hate thumbs up/down. 2 values is too little. I understand that 5 was maybe too much, but thumbs up/down systems need an explicit third "eh, it's okay" value for things I don't hate, don't want to save to my library, but I would like the system to know I have an opinion on.

I know that consuming something and not thumbing it up/down sort-of does that, but it's a vague enough signal (that could also mean "not close enough to keyboard / remote to thumbs up/down) that recommendation systems can't count it as an explicit choice.


Here's the discussion from back in the day when this changed: https://news.ycombinator.com/item?id=837698

In practice, people generally didn't even vote with two options, they voted with one!

IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.


> IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.

No, they got rid of them most likely because advertisers complained that when they dropped some flop they got negative press from media going "lmao 90% dislike rate on new trailer of <X>".

Stuff disliked to oblivion was either just straight out bad, wrong (in case of just bad tutorials/info) and brigading was very tiny percentage of it.


Oh, didn't they remove the dislike count after people absolutely annihilated one of their yearly rewind with dislikes?

It was removed after some presidential speeches attracted heavy dislikes.

The original sin is argued to be the Youtube Rewind 2018. But it took them until 2021 to roll it out.

well, people annihilated every of their rewinds with dislikes. But yeah, that might've contributed.

YouTube never got rid of downvotes they just hid the count. Channel admins can still see it and it still affects the algorithm

Youtube always kept downvotes and the 'dislike' button, the change (which still applies today) was that they stopped displaying the downvote count to users - the button never went away though.

Visit a youtube video today, you can still upvote and downvote with the exact same thumbs up or down, the site however only displays to you the count of upvotes. The channel owners/admins can still see the downvote count and the downvotes presumably still inform YouTube's algorithms.


There is also an independent "Return Youtube Dislike" browser extension that shows the dislike numbers. It's very convenient.

That doesn't show the real number, only "a combination of scraped dislike stats and estimates extrapolated from extension user data."

I think that just the absence in official app and the existence of this tool makes this point largely irrelevant. Company in question could easily reverse this decision overnight as the data exist, but absent that people adjust to an available proxy estimate. It is interesting though, because it shows clear intent of "we don't want to show actual sentiment".

The official youtube stats (views, comments, upvotes) are not real/real-time either. But that's the best we have. And dislike numbers are in the same universe of credibility and closeness to reality. It's definitely good enough.

If you want downvote data be more precise, do your part and install the extension! :-)


How come accuracy has only 50% weight?

“You’re absolutely right! Nice catch how I absolutely fooled you”


Yes, absolutely. This aligns with what we found. It seems to be necessary to be very clear on scoring (at least for Opus 4.5).

This actually seems really good advice. I am interested how you might tweak this to things like programming languages benchmarks?

By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.


Not sure I'm fully following your question, but maybe this helps:

IME deep thinking hgas moved from upfront architecture to post-prototype analysis.

Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging

With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate

When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.

The shift: from "design away problems" to "evaluate into solutions."


Isn’t this just rubrics?

its a weighted decision matrix.

Hi - I'm working on something like this because I needed it too ;-)

It's not yet ready for release, but I should be ready for beta-test within 2 months. If you're interested I would be happy to add you to my list of "people to notify when I am ready to beta test"


yes please


ok, will do


We adopted Amazon's 6-pager strategy for meetings. It has been incredible helpful. TLDR - one person writes a document explaining the proposed feature, max length 6 pages. That one person does 80% of the work (whatever your org wants e.g. timeline, target customers, planned KPIs and/or ROI).

Process: Start the meeting, turn on your camera (surprisingly important for us), and set a timer. Everyone quietly reads for ~20min. AFTER they finish reading, restart and engage via comments - positive/agreement comments are great, questions are useful, answering/responding to other's questions is encouraged. Post 20min, one speaker goes through the comments. Many (70% is common for us) will already have been resolved or will not require discussion. Long comment chains or lack of consensus on a comment chain is where we spend the discussion time. Common pitfall - Speaker DOES NOT present the document, that just wastes 20min and bores everyone to death. Small nuance - speaker (doc preparer) is already familiar with doc, so they watch the timer. They also check in e.g. at 15 give a 5min warning, at 20min check if anyone needs more time, etc.

Huge net positive for us: - No more "pre-read" (no one does it), we just say that a 45min meeting is when you BOTH get the information AND discuss is - Post the "20min" we are already 80% on the same page - Far far fewer "basic" questions that kill time - those were answered by the doc - Quiet reading gives people time to mentally switch/shift from their last 5 meetings and come up to speed. Less "anxious energy" when talking - Increase collaboration - "Type A" talkers cannot (as easily) dominate the comments. Leaves "airshare" for quieter team members to participate via comments. We've begun encouraging all to leave at least 4-5 comments

Negatives: - REQUIRES senior team member support, else 1 senior "ass" will break the rules and start talking - Writing the document is time consuming and not easy. This really becomes one persons "work" to gather the data from all stakeholders in a 1-1 fashion, organize it, summarize it


Yes and No.

No because the official OSM tile layer is heavily subsidized by Fastly (€720k last I checked) and rendering by AWS (€40k)

Yes because technically it would use fewer resources thus easier on AWS+Fastly and also easier to self-host

In last risk assessment I read closely(1) OSM noted "If we lost [Fastly] sponsorship, we would likely cut off all third-party access to the standard tile layer and run a small number of Varnish servers."

As I understand it, primary drivers for vectors was not cost more improving internationalization, generally enabling client-side rendering decisions, and driving a modern toolchain that would net multiple follow-on benefits

I'm a bit behind, there is more recent info available at (2)

1.https://operations.osmfoundation.org/2024/01/25/owg_budget.o... 2. https://osmfoundation.org/wiki/Finances


I wonder if it would be feasible to distribute tile data in a DHT. There is a single source of truth that could seed the DHT and users would share the load of serving the data. It'd have to be opt-in on individual nav devices for privacy and battery reasons, but anyone running their own OSM proxy would have an incentive participate.


Way too much latency I am sure.


Yet you forget that tiles based maps are plays very nicely even with simplest HTTP caching (even multiple layers of them). Compared to vector stuff that needs caches that are range requests aware, or some magic block storage.

I somehow prefer to stick to tile based maps because caching, easy rendering and I also care about sat images, with cannot be vectorized.

I think we need both of those.


These are tile-based maps — vector tiles, rather than raster tiles.

Any caching you do on raster tiles also works here.


Oh okey, So I confused them with PMTiles..


Anyone else wanting to see the original content of /unicorn? Cats are great and all, but we want unicorns!!

<3 wonderful project. Brings back memories of excitedly writing HTML in my drawing notebook and daydreaming what the pages would look like


> ...these are small enough to enter the body’s cells and tissues

Same. No training to identify concern vs FUD in health news, but I'm erring on side of caution. We are limiting exposure to plastic drink bottles in general (sport drinks in plastic during/after children sporting events are common where I live)


My ad experience has changed dramatically after I leaned into multiple browser personas. In personal persona, I hate ads with passion and do everything to get rid of them (browser extensions, premium membership, etc). I value my personal time and the ads are totally useless.

In work persona, I suddenly have found ads are actually useful. Often find myself choosing to spend 30 seconds watching a YouTube ad because it is relevant to topics I need to be aware of as a CTO. It's clear my daily browsing history influences the ads I am seeing, and I see useful information. Been looking into SIEM tools lately, and via an ad I was just made aware of some data center appliances for security. I clicked to their website and browsed a while to learn what was available. When you have some real challenges to solve and the targeting is on point, ads can be a great news feed.

Clearly segmenting my browser history into one persona where I am actively looking for solutions vs my personal persona where I want to be left alone helped the feeds target me.

Still, surreal feeling to intentionally choose to watch an ad...


Yes, ads should be locked up inside services that users specifically choose to use if and when they want.

Those services should not have overlapping features, like providing mail, social media, or general search, for example, as that would be a clear conflict of interest.


> Been looking into SIEM tools lately, and via an ad I was just made aware of some data center appliances for security.

Would you not have come cross them if you were actively searching for data center appliances for security? Were Ads the only way to find them?


You would be far better served by taking advice of someone you've hired than taking the advice from YouTube ads about snake-oil...


Think I read somewhere that of one of their content production differentiators is their direct-to-consumer approach. Classically lots of content was produced for the "average" consumer. Netflix can use their subscriber data to create low-cost content for extremely niche consumers, who might love that extremely relevant production (think super edgy, super graphic, super cartoon, etc - the type of extremes not covered by the average).

Not sure how much this holds true anymore, as now many big players have direct-to-customer streaming, but just sharing since it was a neat thought when I first read it


This seems like a way of sugar-coating the actual content strategy Netflix deploys, which has much more to do with product placement than it does content production. Their strategy is to align their content productions with the brands that best correlate with their subscriber base. In so doing, they can create lucrative deals with brands where their products are intricately woven into the stories/narratives of the show.

As an example, Stranger Things featured an average of 9 minutes of product placement for each episode of their third season. [0] The company claims they did not receive any payments from brands for this placement [1], but they likely received other extremely valuable considerations in the form of payment instead.

[0] https://www.ama.org/marketing-news/product-placement-in-stra...

[1] https://www.fastcompany.com/90380266/more-product-placements...


I hope those advertisers feel like they got their money's worth, because I can't remember a single brand from Stranger Things


I feel like if you were able to remember those product placements that would be a failure for the advertisers.

The best product placements are the ones that go unnoticed.


How could you have missed all the Coca-Cola in that show?!


Dungeons & Dragons?!


This is saying "if an app opens a webview, the app can monitor your browsing activity inside that webview."

It is written vaguely and should be re-written to be precise, but as they are going for "end user" language here I can understand that it is hard to communicate to non-technical users that "embedded browser" and "browser" are different things given that they have similar UX and similar functionality.

A common use case of an embedded webview is an app that uses a website for some portion of a user flow, IME this is typically when there is a B2B2C business relationship. I think it can also happen for an OAuth2 integration but I'd expect there are some iOS native SDKs that are preferred. IME, many businesses use "web SDKs" instead of native libraries, and their integration guide will say something like "have your app open a webview to URL X, then user does Y as we have agreed, then we will close the webview" (occasionally, a few will use hooks in the webview to communicate result information to the native app).


That was my original assumption but how can you be so sure? I think you’re being too hopeful here.

Also calling webview “outside the app” is a bit of a stretch


Not a stretch at all, it is perfectly reasonable to consider an app and an website embedded by the app as two completely different things. First there is no guarantee the WebView will open to a website owned/operated by the same entity that owns/operates the app, so it is definitely "outside the app". From the user's privacy perspective, you also want to communicate that just because the website might be branded "Facebook" and be run by Facebook and maybe you trust FB with your messages, but if it's an embedded browser opened via a WebView then the app can technically snoop on the private message you are typing into the WebView


I think you’re right but they really need to update the description for this


It doesn’t say “outside of the app”. It says “information about the content you have viewed, which is not part of the app, such as websites”, which is completely different.

This category exists for apps that embed a webview.

Safari is sandboxed. There is no way to get to its data like history.


There is some strange allure for spending time crafting Dockerfiles. IMO it's over glorified - for most situations the juice is not worth the squeeze.

As a process for getting stuff done, a standard buildpack will get you a better result than a manual Dockerfile for all but the most extreme end of advanced users. Even for those users, they are typically advanced in a single domain (e.g. image layering, but not security). While buildpacks are not available for all use cases, when available I can't see a reason to use a manual Dockerfile for prod packaging

For our team of 20+ people, we actively discourage Dockerfiles for production usage. There are just too many things to be an expert on; packers get us a pretty decent (not perfect) result. Once we add the packer to the build toolchain it becomes a single command to get an image that has most security considerations factored in, layer and cache optimization done far better than a human, etc. No need for 20+ people to be trained to be a packaging expert, no need to hire additional build engineers that become a global bottleneck, etc. I also love that our ops team could, if they needed, write their own buildpack to participate in the packaging process and we could slot it in without a huge amount of pain


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: