Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!
Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral
Serious side effects from long term high dose vitamin B6 include peripheral neuropathy. Doses exceeding 500 to 1000 mg per day pose the greatest risk but prolonged intake of lower doses may also result in this side effect. The Therapeutic Goods Association of Australia has found that peripheral neuropathy can occur at doses of less than 50 mg.
In pregnant adult women, vitamin B6 is likely safe at a dose of up to 100 mg per day. In adolescent pregnancy, vitamin B6 is likely safe at a dose of up to 80 mg per day.
In lactating adult women, vitamin B6 is likely safe at a dose of up to 100 mg per day. In adolescents who are lactating, vitamin B6 is likely safe at a dose of up to 80 mg per day.
In children, vitamin B6 is likely safe at a daily dose of 30 mg (1-3 yrs), 40 mg (4-8 yrs), 60 mg (9-13 yrs) and 80 mg (14-18 yrs).
The likelihood of side effects increases at doses higher than 200 mg per day.
In 2023, the European Food Safety Authority set an upper limit of vitamin B6 of 12 mg per day for adults, and 2.2 to 10.7 mg per day for infants and children.
I worked in automated customer support, and I agree with you. By default, we automated 40% of all requests. It becomes harder after that, but not because the problems the next 40% face are any different, but because they are unnecessarily complex.
A customer who wants to track the status of their order will tell you a story about how their niece is visiting from Vermont and they wanted to surprise her for her 16th birthday. It's hard because her parents don't get along as they used to after the divorce, but they are hoping that this will at the very least put a smile on her face.
The AI will classify the message as order tracking correctly, and provide all the tracking info and timeline. But because of the quick response, the customer will write back to say they'd rather talk to a human and ask for a phone number they can call.
The remaining 20% can't be resolved by neither human nor robot.
I made my own distributed render orchestrator that supports Cycles + custom plugins. It uses Modal’s cloud compute APIs to spawn jobs on up to 20x containers with an L40S GPU (like 80% as fast as a 4090 with tons more VRAM) each. It ain’t cheap but it’s absurdly fast, and much easier in terms of cash flow than outright buying the equivalent GPUs.
Superconductivity is fascinating. I don't know how people were able to come up with the explanations. Crudely, the reduced temperature means less jiggling of the metal lattice. This in turn makes it possible for the nuclei to be pushed around by electrons to form essentially sound waves (phonons) in the lattice (think of the lattice compressing and expanding due to interplay with electrons). At a certain temperature and therefore a certain frequency of lattice oscillation, electrons pair up to form "Cooper pairs" - they move in concert due to the lattice movement. What's crazy is that cooper pairs become a sort of pseudoparticle, and their quantum behaviour is different to regular electrons. Cooper pairs have integer spin (as opposed to half-integer spin), so they no longer obey the Pauli exclusion principle and all the electrons in the entire material basically form one giant condensate that extends through the whole material and can all occupy the same lowest energy quantum state.
If H800 is a memory-constrained model that NVIDIA built to avoid the Chinese export ban on H100 with equivalent fp8 performance,
it makes zero sense to believe Elon Musk, Dario Armodei and Alexandr Wang's claims that DeepSeek smuggled H100s.
The only reason why a team would allocate time on memory optimizations and writing NVPTX code rather than focusing on posttraining is if they severely struggled with memory during training.
This is a massive trick pulled by Jensen, take the H100 design whose sales are regulated by the government, make it look 40x weaker and call it H800, while conveniently leaving 8-bit computation as fast as H100. Then bring it to China and let companies stockpile without disclosing production or sales numbers, and have no export controls.
Eventually, after 7 months, US govt starts noticing the H800 sales and introduces new export controls, but it's too late. By this point, DeepSeek has started research using fp8. They slowly build bigger and bigger models, work on the bandwidth and memory consumptions, until they make r1 - their reasoning model.
I use supermaven and cline with my own API key, a setup superior to cursor imo. Tried to go back to gh copilot yesterday but couldn't bear it for a full workday, and reverted to my previous arrangement.
There exists an interesting connection between Boost Converters and Hydraulic Rams [1].
A Hydraulic Ram is device that can pump water from a stream to a higher location by harnessing the kinetic energy of the stream, no other power source required.
The equations for the two devices are essentially the same, only the units change.
The way this works (and I'm obviously taking a high level view here) is by comparing what is being played to what is being captured. There is an inherent latency in between what is called the capture stream (the mic) and the reverse stream (what is being output to the speakers, be it people taking or music or whatever), and by finding this latency and comparing, one can cancel the music from the speech captured.
Within a single process, or tree of processes that can cooperate, this is straightforward (modulo the actual audio signal processing which isn't) to do: keep what you're playing for a few hundreds milliseconds around, compare to what you're getting in the microphone, find correlations, cancel.
If the process aren't related there are multiple ways to do this. Either the OS provides a capture API that does the cancellation, this is what happens e.g. on macOS for Firefox and Safari, you can use this. The OS knows what is being output. This is often available on mobile as well.
Sometimes (Linux desktop, Windows) the OS provides a loopback stream: a way to capture the audio that is being played back, and that can similarly be used for cancellation.
If none of this is available, you mix the audio output and perform cancellation yourself, and the behaviour your observe happens.
Source: I do that, but at Mozilla and we unsurprisingly have the same problems and solutions.
On a related note: recently, I released a visualization of all MLP neurons inside the llama3 8B model. Here is an example "derivative" neuron which is triggered when talking about the derivative concept.
I think most of the interesting applications for these small models are in the form of developer-driven automations, not chat interfaces.
A common example that keeps popping up is a voice recorder app that can provide not just a transcription of the recording (which you don't need an LLM for), but also a summary of the transcription, including key topics, key findings, and action items that were discussed in a meeting. With speaker diarization (assigning portions of the transcript to different speakers automatically), it's even possible to use an LLM to assign names to each of the speakers in the transcript, if they ever identified themselves in the meeting, and then the LLM could take that and also know who is supposed to be handling each action item, if that was discussed in the meeting. That's just scratching the surface of what should be possible using small LLMs (or SLMs, as Microsoft likes to call them).
An on-device LLM could summarize notifications if you have a lot of catching up to do, or it could create a title for a note automatically once you finish typing the note, or it could be used to automatically suggest tags/categories for notes. That LLM could be used to provide "completions", like if the user is writing a list of things in a note, the user could click a button to have that LLM generate several more items following the same theme. That LLM can be used to suggest contextually-relevant quick replies for conversations. In a tightly-integrated system, you could imagine receiving a work phone call, and that LLM could automatically summarize your recent interactions with that person (across sms, email, calendar, and slack/teams) for you on the call screen, which could remind you why they're calling you.
LLMs can also be used for data extraction, where they can be given unstructured text, and fill in a data structure with the desired values. As an example, one could imagine browsing a job posting... the browser could use an LLM to detect that the primary purpose of this webpage is a job posting, and then it could pass the text of the page through the LLM and ask the LLM to fill in common values like the job title, company name, salary range, and job requirements, and then the browser could offer a condensed interface with this information, as well as the option to save this information (along with the URL to the job posting) to your "job search" board with one click.
Now, it might be a little much to ask a browser to have special cases for just job postings, when there are so many similar things a user might want to save for later, so you could even let the user define new "boards" where they describe to a (hopefully larger) LLM the purpose of the board and the kinds of information you're looking for, and it would generate the search parameters and data extraction tasks that a smaller LLM would then do in the background as you browse, letting the browser present that information when it is available so that you can choose whether to save it to your board. The larger LLM could still potentially be on-device, but a more powerful LLM that occupies most of the RAM and processing on your device is something you'd only want to use for a foreground task, not eating up resources in the background.
LLMs are interesting because they make it possible to do things that traditional programming could not do in any practical sense. If something can be done without an LLM, then absolutely... do that. LLMs are very computationally intensive, and their accuracy is more like a human than a computer. There are plenty of drawbacks to LLMs, if you have another valid option.
GPUs don't rasterise line-by-line like old software renderers, and they basically only know how to rasterise triangles. If you're rendering on GPU then I think the approach in the article is pretty good, despite its apparent complexity.
This is a great series of blog posts about how GPU based rendering is structured (long but excellent if you're interested):
You cant compare a real technical field with hundreds of years of real world use to some social science nonsense. Linguists have never made useful or testable predictions of any kind. Geometric computer vision has been refining camera calibration to the point where we use stereo to find things billions of light years away.
I remember a WAMU radio show talking about a woman who was abused by her father.
She wrote the 5 steps of forgiveness which as I more or less recall go something
like this:
The 5 steps of asking for forgiveness.
The offender to the offended should:
1. Describe in detail exactly what he/she/they did that cause the offense.
2. Describe in detail how those actions made the offended feel.
3. Explain in detail why he/she/they took such actions.
4. Describe in detail what corrective actions he/she/they should take.
5. Describe in detail what the offended could do to move forward.
Now the interesting part is that the woman's father was already dead
when she wrote the steps. She wrote a letter using the 5 steps as if she was her father, who denied to his death that the abuse ever happened, asking for forgiveness.
It helped her.
I wished I could remember the name of the book or the radio show but it was
years ago.
var speed = 50
// Prime the result output
for (col of [rin, bin, gin]) {
rin.valueAsNumber = 0
bin.valueAsNumber = 0
gin.valueAsNumber = 0
}
rin.dispatchEvent(new Event('change'));
async function tryit(col, incr) {
// Increment a single color
col.valueAsNumber = col.valueAsNumber + incr
col.dispatchEvent(new Event('change'));
submit.click()
await (new Promise(resolve => setTimeout(resolve, speed)));
var res_text = result.innerText.split(/[ ()%]/)[4]
if (res_text === "Splendid!") {
throw new Error("Finished")
}
return (parseInt(res_text))
}
async function trymany() {
// We need to iterate at least twice due to rounding
// in result percentage, sometimes making neighbouring
// colors have the same result.
var last_res = 0, max_tries = 3;
while (--max_tries > 0) {
for (col of [rin, gin, bin]) {
while (true) {
var new_res = await tryit(col, 1)
if (last_res >= new_res) {
// set last value and break
await tryit(col, -1)
break
}
last_res = new_res
}
}
}
}
await trymany()
It's supposed to be a completely uninformed prior. The only information, as you can see, is introduced in the formula with the above explanation.
Forecasting in general is a silly affair. As a data scientist I get all kinds of requests for all kinds of predictions, and my only real job is to part those requesting parties from their money.
I would also recommend going through Callum McDougall/Neel Nanda's fantastic Transformer from Scratch tutorial. It takes a different approach to conceptualizing the model (or at least, it implements it in a way which emphasizes different characteristics of Transformers and self-attention), which I found deeply satisfying when I first explored them.
Diet changes can make a huge difference. Earlier the better.
On many support forums I see a pattern of people using the various meds with mixed results.
Then people starting to experiment with which foods they seem to do better with. Then discovering the auto immune protocol diet.
A lot of people don’t wanna go to the effort, so won’t try it and complain about it.
People that try it and actually follow it often have amazing results, and all their symptoms calm down.
A lot of people just cannot give up the pizza and chips though.
I'm hardly the best person to give a point-by-point on how modern neural networks work. The original paper that kind of brought together a bunch of ideas that were floating around is called "Attention is All You Need" in 2017 (and those folks are going to win a Turing almost certainly) and built on a bunch of `seq2seq` and Neural Turing Machine stuff that was in the ether before that.
Karpathy has a a great YouTube series where he gets into the details from `numpy` on up, and George Hotz is live-coding the obliteration of PyTorch as the performance champion on the more implementation side as we speak.
Altman being kind of a dubious-seeming guy who pretty clearly doesn't regard the word "charity" the same way the dictionary does is more-or-less common knowledge, though not often mentioned by aspiring YC applicants for obvious reasons.
Mistral is a French AI company founded by former big hitters at e.g. DeepMind that brought the best of the best on 2023's public domain developments into one model in particular that shattered all expectations of both what was realistic with open-weights and what was possible without a Bond Villain posture. That model is "Mixtral", an 8-way mixture of experts model using a whole bag of tricks but key among them are:
- gated mixture of experts in attention models
- sliding window attention / context
- direct-preference optimization (probably the big one and probably the one OpenAI is struggling to keep up with, probably more institutionally than technically as probably a bunch of bigshots have a lot of skin in the InstructGPT/RLHF/PPO game)
It's common knowledge that GPT-4 and derivatives were mixture models but no one had done it blindingly well in an open way until recently.
SaaS companies doing "AI as a service" have a big wall in front of them called "60%+ of the TAM can't upload their data to random-ass cloud providers much less one run by a guy recently fired by his own board of directors", and for big chunks of finance (SOX, PCI, bunch of stuff), medical (HIPAA, others), defense (clearance, others), insurance, you get the idea: on-premise is the play for "AI stuff".
A scrappy group of hackers too numerous to enumerate but exemplified by `ggerganov` and collaborators, `TheBloke` and his backers, George Hotz and other TinyGrad contributors, and best exemplified in the "enough money to fuck with foundation models" sense by Mistral at the moment are pulling a Torvalds and making all of this free-as-in-I-can-download-and-run-it, and this gets very little airtime all things considered because roughly no one sees a low-effort path to monetizing it in the capital-E enterprise: that involves serious work and very low shady factors, which seems an awful lot like hard work to your bog-standard SaaS hustler and offers almost no mega data-mining opportunity to the somnobulent FAANG crowd. So it's kind of a fringe thing in spite of being clearly the future.
Custom frame by Halbe Rahmen, who I think make the most magnificent frames in the world, with the most loving team behind them. Not a shill; whenever I interact with the Halbe staff for anything I feel it.
> 1. the parameters are there to make the model better at learning textual patterns, not arithmetic patterns.
There's nothing "textual" about the tokens. They are arbitrary identifiers. There's nothing "textual" about transformers! The fact that e.g. GPT-4 can accept images as input, and that its textual performance improved as a result, and that transformers are also being used for text-to-speech models should have already communicated this.
> 2. Next token prediction is a terrible way to perform arithmetic.
This is just attempting to resolve our disagreement with pure assertion. It's certainly less efficient to use an artificial intelligence to do arithmetic. But whether it's efficient is a different question than how likely it is to be possible.
> 3. Perhaps most importantly, the loss function does not incentivise being good at arithmetic at all.
This is blatantly untrue. The same argument would suggest that LLMs can't do anything that wasn't exactly in their training set already. But they can.
I was responsible for safe for flight inspections on military aircraft and the photo included in that post is completely insane to me.
Those bolts being loose (and they are BIG bolts) would mean multiple people in the installation process didn’t do their jobs, and signed their life on the line saying they did.
When I did maintenance, there was someone (QA) there to witness every torqued bolt, inspect every safety wire and installed part.
The only leaderboard this model is good on is the HuggingFace LLM Leaderboard which is known to be manipulated and victim to gross overfit. The Lmsys Arena Leaderboard is a better representation of the best models.
This is poetry. I have been working on a personal project for the last 10 years that replaces every other app I used to use - E-mail, calendar, and all the others we all use on a daily basis - and every time someone sees me using it they ask "Wow, this is amazing, how do I download it!?", and the answer is always the same: you don't.
There's a beauty to engineering something having yourself as the target user, and no one else. I'm 100% convinced this project single-handedly keep my mental wellbeing in check, and it provides me with a constant source of hopefulness and happiness to the future - that no company/salary could ever offer me. My exclusive, differential, unique characteristic against the world, my joker card.
Transformer neural networks are not smart enough for this to work. OpenAI's own research shows that LLMs don't care about contradictions in the training data[1]: if your LLM's training data has 700 sentences saying "A is true" and 300 saying "A is false," then if you ask the LLM about A, it has a 7/10 chance of saying "A is true" and a 3/10 chance of saying "A is false."
In general it's easy to prove that GPT-4 (or any other LLM) doesn't actually understand what words mean:
- ask it factual questions in English
- ask it to translate those factual questions to Swahili (it should do so with >99% technical accuracy)
- ask it answer those Swahili-language factual questions in Swahili
- translate the answers back to English
It will do much worse at Swahili than English even though its technical accuracy at translation is almost flawless. GPT-4 has a ton of English-language sentences about the topic "cat," and a handful of Swahili-language sentences about the topic "paka," and understands that the translation of "cat" is "paka." But it has no understanding that the English-language facts about cats are automatically true if they are translated in Swahili.
This is not how humans work! I don't think gullibility will be solved unless we have an LLM that actually understands that words mean something.
[1] It is too early for me to find the papers, but I am summarizing two different things:
1) training data needs to have certain things repeated in order for them to "stick" when prompting the LLM - the more often a fact is in the training data, the more likely it is to repeat that fact when prompted
2) absent RLHF, larger LLMs are more likely to endorse conspiracy theories than smaller LLMs. This is because smaller LLMs are trained on reliable datasets like Wikipedia, whereas larger LLMs start including Reddit, bodybuilder forums, old Geocities pages belonging to weird cults, etc. So smaller LLMs have no sentences in their training data saying "Bush did 9/11," but a larger LLM might have 5% conspiracy theories and therefore a 5% chance of endorsing conspiracy theories.
I bought 4 USB hubs just to switch the keyboard and mouse, it's just so annoying how most devices are so terrible. Randomly you have to physically detach the hub several times a day. If you plug webcams and/or USB mics, it only gets even worse.
I'm using Synergy software and it works well, but I still want a proper KVM that can allow for webcams, mics, audio, etc. Features like moving only a group of plugged devices via keyboard shortcuts.
ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL
or
./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99
Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!
Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral