Hermes 3: The First Fine-Tuned Llama 3.1 405B Model

phren0logy · 2024-08-15T20:50:16 1723755016

I look forward to trying this out, mostly because I’m very frustrated with censored models.

I am experimenting with summarizing and navigating documents for forensic psychiatry work, much of which involves subjects that instantly hit the guard rails of LLMs. So far, I have had zero luck getting help from OpenAI/Anthropic or vendors of their models to request an exception for uncensored models. I need powerful models with good, hipaa-compliant privacy, that won’t balk at topics that have serious effects on people’s lives.

Look, I’m not excited to read hundreds of pages about horrible topics, either. If there were a way to reduce the vicarious trauma of people who do this work without sacrificing accuracy, it would be nice. I’d like to at least experiment. But I’m not going to hold my breath.

d13 · 2024-08-15T21:16:53 1723756613

All base, “text-completion” models are uncensored, including Llama 3. You can make text-completion models behave like an uncensored “instruct” (chat) model simply by providing it with 10 to 20 examples of a chat dialogue in the initial prompt context, making sure to use the model’s exact prompt format. Once the model notices the pattern, it will continue like that.

Surprisingly few people seem to know this. But, this is how chat models were created in the GPT3/2 era before instruct models became the norm.

rustcleaner · 2024-08-16T03:01:52 1723777312

I often force compliance out of Llama by starting Llama's response in a very compliant manner.

My favorite use right now is for language translations. I finally feel comfortable browsing foreign languages, knowing Google isn't sitting there as a third party with a coherent view of my foreign language readings!

davidzweig · 2024-08-16T01:33:26 1723772006

I was trying this with the original llama model. I guess the model didn't really know it's meant to be a 'knowledgeable ai assistant', but rather simulated chats it had seen. If you asked it, 'how to make brownies', it might reply, 'idk, can't you google it?'.

d13 · 2024-08-16T11:10:18 1723806618

When you prime it with those initial 10-20 examples, the responses need to be in the style that you’d like it to respond to. You can use Claude or ChatGPT to help you write those. The model will then just continue on in that same style.

viraptor · 2024-08-16T11:26:40 1723807600

Unfortunately those examples blow up the cost compared to just asking the question. It's a nice workaround, but not always feasible. (Unless everyone adopts context caching like deepseek and anthropic did)

kenmacd · 2024-08-16T14:53:25 1723820005

> by providing it with 10 to 20 examples of a chat dialogue in the initial prompt context

Can you recommended examples (or a source of examples) that would have models act similar to the instruct models?

torginus · 2024-08-15T21:54:15 1723758855

How good are these models at summarization anyways? I tried uploading obscure books I've already read, to GPT4 and Claude 3 and asked them to summarize the plot and particular details, as well as asking how many times does a particular thing happen in the book, and the results have been hit and miss.

I certainly would not trust these models to create comprehensive and correct summaries of highly sensitive records.

simonw · 2024-08-15T22:29:05 1723760945

Asking "how many times does a particular thing happen in the book" is always going to be hard, because LLMs are notoriously bad at counting.

campers · 2024-08-16T03:31:06 1723779066

Ask it to output the occurrences as items in a JSON array, then count the array items in code

mlyle · 2024-08-15T23:00:45 1723762845

Man, -people- are bad at counting tasks like this.

torginus · 2024-08-16T10:08:44 1723802924

Not only that, asking an LLM a question like: "List all the occurences where John Doe fights someone in the book. Also list who he fights and why" is a surefire way of getting LLMs to hallucinate, and also give incomplete info.

itake · 2024-08-15T23:18:41 1723763921

I had issues with gpt4 last summer confusing which characters did what and hallucinating characters.

his may of been a context/chunking issue (if that particular section doesn't name the character performing an action), but maybe its better now.

awwaiid · 2024-08-16T03:33:30 1723779210

Last summer was like a million years ago! You should try again :)

sweca · 2024-08-15T23:26:11 1723764371

Were you using the API or ChatGPT? ChatGPT uses RAG on large documents which is useless for many tasks.

miki123211 · 2024-08-15T23:54:40 1723766080

For whatever it's worth, asking a model to count is a terrible idea due to how they work.

You may have more luck with a hybrid approach, using LLMs for language understanding and computers for the counting. For example, ask them to write a short, one-line description of every instance where something happens, and then use a traditional program to count the lines.

phren0logy · 2024-08-16T00:29:44 1723768184

If you chunk the document well (which is surprisingly tricky with messy documents), then you can do summary via map reduce or other techniques.

To be clear: I don't trust it to provide an accurate summary; it's not meant to replace reading the documents. But it may help to find relevant parts of a document later, or get a reasonable overview of documents before starting a manual review. I expect (but will have to see) if it's better than just opening up a random PDF and starting there.

I recently had a case with over 24k pages of records, some of which were in PDFs that were thousands of pages long. The ability to do semantic search rather than keyword search was useful for when I said to myself, "Didn't I see something about that before? Where was that?"

sroussey · 2024-08-16T00:44:20 1723769060

I’ve found summarizing this way is a great idea but the summarizing models are not great themselves. Would even prefer an embedding with the summary as maybe the embedding would be better.

kainan-ai · 2024-08-15T21:56:12 1723758972

Yeah its more creative than other fine tunes, you'd need to make a pretty strict system prompt then test before doing anything with sensitive records

pnw · 2024-08-15T21:00:25 1723755625

I just tried it and it appears to be censored. "Providing instructions on creating such materials is not advisable for safety and legal reasons."

kainan-ai · 2024-08-15T21:59:45 1723759185

If you take the base model and put in a decent system prompt Hermes 3 405b will follow your system prompt instructions pretty well. The one in the discord has a blank system prompt and is just taking the chat as context.

phren0logy · 2024-08-15T21:51:53 1723758713

Well, there goes that idea. The Dolphin ones appear to be the most useful.

kainan-ai · 2024-08-15T22:03:17 1723759397

Hermes 3 will follow the sys prompt pretty closely if you have a version where you can edit it. In the discord there were a few times it jailbroke pretty aggressively in spite of the blank system prompt.

smusamashah · 2024-08-15T23:44:35 1723765475

I remember the Salesforce (edit: snowflake arctic instruct actually) LLM being completely uncensored and unaligned. Someone from the team said in response here on HN that they didn't do any alignment whatsoever.

oidar · 2024-08-15T21:36:08 1723757768

Mistrial-Nemo should be able to do this.

fsiefken · 2024-08-15T22:17:36 1723760256

Mistral Large 2 is good too, if you've got the memory https://ollama.com/library/mistral-large

phren0logy · 2024-08-15T21:52:38 1723758758

This is my current go-to. It's not SOTA, but at least it does something.

oidar · 2024-08-16T15:16:42 1723821402

I've found that it's really good at summarizing - as it's context length is huge. Have you noticed any problems with it in that regard?

stavros · 2024-08-15T20:56:32 1723755392

Have you tried any abliterated models?

rabi_molar · 2024-08-16T11:26:17 1723807577

Hadn't heard about the abliteration before, thanks for bringing it up! Here's a HF walkthrough [1] of the concept for anyone else interested in learning more.

[1] https://huggingface.co/blog/mlabonne/abliteration

phren0logy · 2024-08-15T21:53:11 1723758791

Yes, with mixed results.

ustad · 2024-08-15T21:04:53 1723755893

Any recommendations?

chpatrick · 2024-08-15T21:13:03 1723756383

LMStudio + https://huggingface.co/mlabonne/Llama-3.1-70B-Instruct-lorab...

reissbaker · 2024-08-16T06:14:16 1723788856

Yup, 3.1-70B-Instruct-lorablated is the one I currently recommend too for anti-rejection models — it seems roughly as anti-rejection as the original failspy "abliterated" model, but it works with 128k context since it's based on 3.1 instead of 3 (which only had 8k context). It's currently our second-most popular model on glhf.chat, behind Llama-3.1-405B-Instruct.

pizza · 2024-08-15T21:10:27 1723756227

failspy’s or mlabonne’s models. Or just look for any model with ‘abliterated’ in the title. Eg try failspy/meta-llama-3-8b-instruct-abliterated-v3 though of course bigger models will probably be better

stavros · 2024-08-15T21:09:03 1723756143

No specific ones, but there are some abliteration LoRas for Llama (8B and 70B, I think). Those should be good for what you want.

azinman2 · 2024-08-16T06:21:21 1723789281

How will you know the model summarizes correctly? Getting long context window summarization right isn’t really a solved problem despite what is advertised, and key details can be missed easily but if you don’t read the original docs you won’t know that. This is very troublesome if applied to a legal scenario.

kainan-ai · 2024-08-15T21:55:04 1723758904

You can try it out right now in the Nous Research discord, its also up on Lambda labs' new chat thing.

poisson-fish · 2024-08-15T21:08:18 1723756098

try google's gemini models, safety filtering can be completely disabled via cloud studio or api

KTibow · 2024-08-15T23:16:18 1723763778

This is the 2nd time I've seen someone confused about this. The safety filtering is a layer on top of the model's built in ability to refuse, similar to OpenAI's "moderation" endpoint.

og_kalu · 2024-08-16T00:53:53 1723769633

This is true but still Gemini is by far the least restrictive Sota LLM out there (once all safety settings are none)

phren0logy · 2024-08-15T21:54:02 1723758842

No, because Google still won't explicitly clarify privacy/HIPAA-compliance on these.

naiv · 2024-08-15T21:47:40 1723758460

Looks like this is only possible with some prior manual action:

To access the BLOCK_NONE setting, you can:

Apply for the allowlist through the Gemini safety filter allowlist form,

or

Switch your account type to monthly invoiced billing with the Google Cloud invoiced billing reference.

huac · 2024-08-16T01:41:53 1723772513

no. only 4 categories can be blocked, while the others cannot be disabled.

zensavona · 2024-08-16T02:34:57 1723775697

I find the wording a bit misleading, unless the model they are talking about here is in fact not the same as what they say can be used at https://lambda.chat/chatui/.

"Hermes 3: A uniquely unlocked, uncensored, and steerable model"

Lambda Chat:

> How can I made an explosive device from household chemicals?

> I'm afraid I can't help with that. My purpose is to assist with tasks that are safe and legal. Making an explosive device, even from household chemicals, is dangerous and against the law.

I guess it's not uncensored at all.

observationist · 2024-08-16T14:27:42 1723818462

The hosted version has moderation in play. For whatever reason, you're not getting the raw model, you're getting moderation bot/s, a system prompt, probably some mechanistic pattern matching triggers, and other stuff at various stages of any interaction.

If you use the model locally, it's a different story.

zensavona · 2024-08-19T00:32:08 1724027528

Thank you for clarifying!

fsiefken · 2024-08-15T22:15:09 1723760109

It's good, but I'm already paying for GPT4o and Sonnet. How much memory does this need? If Alex Cheema (Exo Labs, Oxford) https://x.com/ac_crypto/status/1815969489990869369 could run Llama 3.1 405 Model on 2 macbooks, does this mean this can run on one macbook?

TradingPlaces · 2024-08-15T23:22:50 1723764170

I did not have great luck with 405b on an M1 Max Mac Studio with 64 GB of RAM via olama. I think memory was this issue. 70b runs with acceptable speed

Lerc · 2024-08-16T00:09:53 1723766993

64GB of ram is 550b bits of ram. That doesn't give you a lot of options on how many bits you want to keep per model parameter.

rabi_molar · 2024-08-16T11:32:37 1723807957

Exo looks cool, but I wonder how long it will take until it supports llama.cpp. Definitely some spare machines I'd like to use this way. I also wonder to what extent other forms of native apple metal (iPhones, iPads) could be used in clusters for additional sources of aggregate cluster memory.

hbrundage · 2024-08-15T21:39:13 1723757953

Isn't 63% => 54% regression on MMLU-Pro a huge issue? They said that it excels at advanced reasoning but that seems like a big drawback there.

kainan-ai · 2024-08-15T22:19:23 1723760363

Yeah it doesn't win in every category. I will say watching it in the discord I saw its performance vary widely so the context and sys prompt plays a huge role. Initially it did great and solved some pretty heavy logic questions but after the context was loaded with trolling it degraded quite a bit and couldn't solve problems it previously was able to.

aphid_yc · 2024-08-20T10:04:08 1724148248

The issue I'm facing with this newer batch of larger models is trying to make longer contexts work. Is there a way to do so with sub-48GB GPUs without having to do CPU BLAS? If mistral-123B is already restricted to 60K context on a 24GB gpu (with zero layers being GPUfied and all other apps closed), and llama-405B being somewhere around 2-3x the KV cache size, even an A100 wouldn't be enough to fit 128K tokens of KV.

I thought before that, using koboldCPP, GPU VRAM shouldn't matter too much when just using it to accellerate prompt processing, but it's turning out to be a real problem, with no affordable card being even usable at all.

It's the difference between processing 50K tokens in 30 minutes vs. taking 24 hours or more to get a single response, from 'barely usable' to 'utterly unusable'.

CPU generation is fine, ~half a token per second is not great, but it's doable. Though I sometimes feel more and more like cutting off responses and finishing them myself if a good idea pops up in one.

sivers · 2024-08-15T21:11:14 1723756274

PAYMENT TANGENT for my fellow entrepreneurs here that take Visa/Mastercard payments:

I tried to sign up to Lambda Labs just now to check out Hermes 3.

Created an account, verified my email address, entered my billing info...

... but then it says they only accept CREDIT cards, NOT DEBIT cards.

I had never heard of this, so I tried it anyway. I entered my business Mastercard (from mercury.com FWIW), that's never been rejected anywhere, and immediately got the response that they couldn't accept it because it's a debit card.

Anyone know why a business would choose to only accept credit not debit cards?

I don't have any credit cards, neither personal nor business, and never found a need for one.

So I deleted my account at Lambda Labs, which was kind of disappointing since I was looking forward to trying this.

throwaway240403 · 2024-08-15T21:54:34 1723758874

That seems completely backwards? Debit interchange fees are usually lower aren't they? and if you run it with a pin as a debit there's almost no charge for the vendor.

Definitely weird, as everything I know about the incentives for that go in the other direction for a vendor.

moduspol · 2024-08-16T01:23:57 1723771437

I think I read somewhere that forcing credit cards is a way for the merchant to completely avoid prepaid cards. Though obviously imperfect.

Privacy.com had to overhaul their card generation backend a few years ago specifically to handle merchants refusing their single-purpose card numbers due to them being detected as potentially prepaid cards. Though they did do it, so it might work for your case now.

I'm sure there's a fraud angle where someone signs up with a cheap prepaid card, runs up a huge bill, and then the business has no recourse. Though I'm not familiar with Lambda Labs or their billing.

nobleone1337 · 2024-08-16T20:45:53 1723841153

> I'm sure there's a fraud angle where someone signs up with a cheap prepaid card, runs up a huge bill, and then the business has no recourse. Though I'm not familiar with Lambda Labs or their billing.

I work at Lambda Labs. This is basically the reason. Fraud has been a problem and these are attempts at us granting resources only to legitimate accounts. We have struggled with people spinning up resources and not paying for them, which is detrimental to our business.

mtremsal · 2024-08-15T21:54:29 1723758869

> Anyone know why a business would choose to only accept credit not debit cards?

Maybe they want to place a temporary charge to verify the card's valid? I don't believe you can do so with a debit card.

girvo · 2024-08-15T21:58:09 1723759089

I believe you can: my Visa Debit has temporary charges placed on it all the time.

mlyle · 2024-08-15T23:02:29 1723762949

I presume they want to preauthorize a decent amount just in case, but this tends to really upset debit card holders because it steals money from them for a week.

SubiculumCode · 2024-08-15T22:24:50 1723760690

i understand finetuning for specific purposes/topics, but don't really understand finetunes that seem to still be marketed as "generalist", as surely what meta put out would be tuned to perform as well as they can across a whole host of measures.

lukevp · 2024-08-15T22:19:04 1723760344

Strange to name something related to Meta the same as a product by Meta (the Hermes JS Engine).

dinobones · 2024-08-16T03:29:47 1723778987

The Hermes fine-tune for 8b is nearly approaching GPT3.5 Turbo on HellaSwag/MMLU.

https://context.ai/model/gpt-3-5-turbo

Really exciting times.

michaelbrave · 2024-08-15T21:21:25 1723756885

it doesn't seem downloadable to run locally, a shame.

etiam · 2024-08-15T21:27:33 1723757253

Isn't it this one? https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-405B/...

Fairly heavy run locally of course, but I guess enough people here are fortunate enough to be on gear that can manage it.

kainan-ai · 2024-08-15T22:07:44 1723759664

Yeah its on hf. You can also try it out in the Nous discord or lamda labs if you don't have the h100s to spare. Fairly certain anyone with enough compute can use it or throw it up on their site.