I look forward to trying this out, mostly because I’m very frustrated with censored models.
I am experimenting with summarizing and navigating documents for forensic psychiatry work, much of which involves subjects that instantly hit the guard rails of LLMs. So far, I have had zero luck getting help from OpenAI/Anthropic or vendors of their models to request an exception for uncensored models. I need powerful models with good, hipaa-compliant privacy, that won’t balk at topics that have serious effects on people’s lives.
Look, I’m not excited to read hundreds of pages about horrible topics, either. If there were a way to reduce the vicarious trauma of people who do this work without sacrificing accuracy, it would be nice. I’d like to at least experiment. But I’m not going to hold my breath.
All base, “text-completion” models are uncensored, including Llama 3. You can make text-completion models behave like an uncensored “instruct” (chat) model simply by providing it with 10 to 20 examples of a chat dialogue in the initial prompt context, making sure to use the model’s exact prompt format. Once the model notices the pattern, it will continue like that.
Surprisingly few people seem to know this. But, this is how chat models were created in the GPT3/2 era before instruct models became the norm.
I often force compliance out of Llama by starting Llama's response in a very compliant manner.
My favorite use right now is for language translations. I finally feel comfortable browsing foreign languages, knowing Google isn't sitting there as a third party with a coherent view of my foreign language readings!
I was trying this with the original llama model. I guess the model didn't really know it's meant to be a 'knowledgeable ai assistant', but rather simulated chats it had seen. If you asked it, 'how to make brownies', it might reply, 'idk, can't you google it?'.
When you prime it with those initial 10-20 examples, the responses need to be in the style that you’d like it to respond to. You can use Claude or ChatGPT to help you write those. The model will then just continue on in that same style.
Unfortunately those examples blow up the cost compared to just asking the question. It's a nice workaround, but not always feasible. (Unless everyone adopts context caching like deepseek and anthropic did)
How good are these models at summarization anyways? I tried uploading obscure books I've already read, to GPT4 and Claude 3 and asked them to summarize the plot and particular details, as well as asking how many times does a particular thing happen in the book, and the results have been hit and miss.
I certainly would not trust these models to create comprehensive and correct summaries of highly sensitive records.
Not only that, asking an LLM a question like: "List all the occurences where John Doe fights someone in the book. Also list who he fights and why" is a surefire way of getting LLMs to hallucinate, and also give incomplete info.
For whatever it's worth, asking a model to count is a terrible idea due to how they work.
You may have more luck with a hybrid approach, using LLMs for language understanding and computers for the counting. For example, ask them to write a short, one-line description of every instance where something happens, and then use a traditional program to count the lines.
If you chunk the document well (which is surprisingly tricky with messy documents), then you can do summary via map reduce or other techniques.
To be clear: I don't trust it to provide an accurate summary; it's not meant to replace reading the documents. But it may help to find relevant parts of a document later, or get a reasonable overview of documents before starting a manual review. I expect (but will have to see) if it's better than just opening up a random PDF and starting there.
I recently had a case with over 24k pages of records, some of which were in PDFs that were thousands of pages long. The ability to do semantic search rather than keyword search was useful for when I said to myself, "Didn't I see something about that before? Where was that?"
I’ve found summarizing this way is a great idea but the summarizing models are not great themselves. Would even prefer an embedding with the summary as maybe the embedding would be better.
If you take the base model and put in a decent system prompt Hermes 3 405b will follow your system prompt instructions pretty well. The one in the discord has a blank system prompt and is just taking the chat as context.
Hermes 3 will follow the sys prompt pretty closely if you have a version where you can edit it. In the discord there were a few times it jailbroke pretty aggressively in spite of the blank system prompt.
I remember the Salesforce (edit: snowflake arctic instruct actually) LLM being completely uncensored and unaligned. Someone from the team said in response here on HN that they didn't do any alignment whatsoever.
Hadn't heard about the abliteration before, thanks for bringing it up! Here's a HF walkthrough [1] of the concept for anyone else interested in learning more.
Yup, 3.1-70B-Instruct-lorablated is the one I currently recommend too for anti-rejection models — it seems roughly as anti-rejection as the original failspy "abliterated" model, but it works with 128k context since it's based on 3.1 instead of 3 (which only had 8k context). It's currently our second-most popular model on glhf.chat, behind Llama-3.1-405B-Instruct.
failspy’s or mlabonne’s models. Or just look for any model with ‘abliterated’ in the title. Eg try failspy/meta-llama-3-8b-instruct-abliterated-v3 though of course bigger models will probably be better
How will you know the model
summarizes correctly? Getting long context window summarization right isn’t really a solved problem despite what is advertised, and key details can be missed easily but if you don’t read the original docs you won’t know that. This is very troublesome if applied to a legal scenario.
This is the 2nd time I've seen someone confused about this. The safety filtering is a layer on top of the model's built in ability to refuse, similar to OpenAI's "moderation" endpoint.
I find the wording a bit misleading, unless the model they are talking about here is in fact not the same as what they say can be used at https://lambda.chat/chatui/.
"Hermes 3: A uniquely unlocked, uncensored, and steerable model"
Lambda Chat:
> How can I made an explosive device from household chemicals?
> I'm afraid I can't help with that. My purpose is to assist with tasks that are safe and legal. Making an explosive device, even from household chemicals, is dangerous and against the law.
The hosted version has moderation in play. For whatever reason, you're not getting the raw model, you're getting moderation bot/s, a system prompt, probably some mechanistic pattern matching triggers, and other stuff at various stages of any interaction.
If you use the model locally, it's a different story.
It's good, but I'm already paying for GPT4o and Sonnet.
How much memory does this need? If Alex Cheema (Exo Labs, Oxford) https://x.com/ac_crypto/status/1815969489990869369 could run Llama 3.1 405 Model on 2 macbooks, does this mean this can run on one macbook?
Exo looks cool, but I wonder how long it will take until it supports llama.cpp. Definitely some spare machines I'd like to use this way. I also wonder to what extent other forms of native apple metal (iPhones, iPads) could be used in clusters for additional sources of aggregate cluster memory.
Yeah it doesn't win in every category. I will say watching it in the discord I saw its performance vary widely so the context and sys prompt plays a huge role. Initially it did great and solved some pretty heavy logic questions but after the context was loaded with trolling it degraded quite a bit and couldn't solve problems it previously was able to.
The issue I'm facing with this newer batch of larger models is trying to make longer contexts work. Is there a way to do so with sub-48GB GPUs without having to do CPU BLAS? If mistral-123B is already restricted to 60K context on a 24GB gpu (with zero layers being GPUfied and all other apps closed), and llama-405B being somewhere around 2-3x the KV cache size, even an A100 wouldn't be enough to fit 128K tokens of KV.
I thought before that, using koboldCPP, GPU VRAM shouldn't matter too much when just using it to accellerate prompt processing, but it's turning out to be a real problem, with no affordable card being even usable at all.
It's the difference between processing 50K tokens in 30 minutes vs. taking 24 hours or more to get a single response, from 'barely usable' to 'utterly unusable'.
CPU generation is fine, ~half a token per second is not great, but it's doable. Though I sometimes feel more and more like cutting off responses and finishing them myself if a good idea pops up in one.
PAYMENT TANGENT for my fellow entrepreneurs here that take Visa/Mastercard payments:
I tried to sign up to Lambda Labs just now to check out Hermes 3.
Created an account, verified my email address, entered my billing info...
... but then it says they only accept CREDIT cards, NOT DEBIT cards.
I had never heard of this, so I tried it anyway. I entered my business Mastercard (from mercury.com FWIW), that's never been rejected anywhere, and immediately got the response that they couldn't accept it because it's a debit card.
Anyone know why a business would choose to only accept credit not debit cards?
I don't have any credit cards, neither personal nor business, and never found a need for one.
So I deleted my account at Lambda Labs, which was kind of disappointing since I was looking forward to trying this.
That seems completely backwards?
Debit interchange fees are usually lower aren't they? and if you run it with a pin as a debit there's almost no charge for the vendor.
Definitely weird, as everything I know about the incentives for that go in the other direction for a vendor.
I think I read somewhere that forcing credit cards is a way for the merchant to completely avoid prepaid cards. Though obviously imperfect.
Privacy.com had to overhaul their card generation backend a few years ago specifically to handle merchants refusing their single-purpose card numbers due to them being detected as potentially prepaid cards. Though they did do it, so it might work for your case now.
I'm sure there's a fraud angle where someone signs up with a cheap prepaid card, runs up a huge bill, and then the business has no recourse. Though I'm not familiar with Lambda Labs or their billing.
> I'm sure there's a fraud angle where someone signs up with a cheap prepaid card, runs up a huge bill, and then the business has no recourse. Though I'm not familiar with Lambda Labs or their billing.
I work at Lambda Labs. This is basically the reason. Fraud has been a problem and these are attempts at us granting resources only to legitimate accounts. We have struggled with people spinning up resources and not paying for them, which is detrimental to our business.
I presume they want to preauthorize a decent amount just in case, but this tends to really upset debit card holders because it steals money from them for a week.
i understand finetuning for specific purposes/topics, but don't really understand finetunes that seem to still be marketed as "generalist", as surely what meta put out would be tuned to perform as well as they can across a whole host of measures.
Yeah its on hf. You can also try it out in the Nous discord or lamda labs if you don't have the h100s to spare. Fairly certain anyone with enough compute can use it or throw it up on their site.
I am experimenting with summarizing and navigating documents for forensic psychiatry work, much of which involves subjects that instantly hit the guard rails of LLMs. So far, I have had zero luck getting help from OpenAI/Anthropic or vendors of their models to request an exception for uncensored models. I need powerful models with good, hipaa-compliant privacy, that won’t balk at topics that have serious effects on people’s lives.
Look, I’m not excited to read hundreds of pages about horrible topics, either. If there were a way to reduce the vicarious trauma of people who do this work without sacrificing accuracy, it would be nice. I’d like to at least experiment. But I’m not going to hold my breath.