tinygrad’s small set of operations and laziness made it easy to implement. Tho my overall sense is that neural network verification is currently more of a research interest than something practical.
I wonder if the authors have tried incorporating error feedback from Lean into their models.
Work from 2023 [1] showed general purpose models did better when they were able to incorporate error feedback, humans incorporate error feedback, but none of the SOTA models on minif2f seem to.
In a way, DeepSeek Prover's subgoal decomposition is a partial-step towards error/proof-state feedback. (DS Prover breaks down a proof into subgoals and attacks each subgoal separately with batched sampling, then puts the pieces back together.)
This is distinct from the approach of the previous SOTA for an open-weights model (Kimina Prover) which generated at the full-proof level.
While it was very impressive to see Kimina's ability to generate medium-length proofs (think AIME-level problems) without sub-goals or feedback at intermediate steps, it's likely that at least subgoal decomposition will be required for longer proofs (think IMO-level problems.)
I certainly agree that where and how error/proof state feedback is best incorporated (training data synthesis / reward function / CoT during inference / etc.) is a fascinating area of research. (It's rumored that GDM's AlphaProof does use proof state / lean feedback already.)
There’s something compelling about these, especially w.r.t. their ability to generalize. But what is the vision here? What might these be able to do in the future? Or even philosophically speaking, what do these teach us about the world? We know a 1D cellular automata is Turing equivalent, so, at least from one perspective, NCA/these aren’t terribly suprising.
these are going to be the dominant lifeforms on earth exceeding bacteria, plants and humans in terms of energy consumption
cellular automata that interact with their environment, ones that interact with low level systems and high level institutions. to some approximation we, humans are just individual cells interacting in these networks.
the future of intelligence aint llms, but systems of automata with metabolic aspects. automata that co-evolve, consume energy and produce value. ones that compete, ones that model each other.
we're not being replaced, we're just participants in a transformation where boundaries between technological and cellular systems blur and eventually dissolve. i'm very thankful to be here to witness it
I'll have what this guy is smoking. Those visualizations are pretty, though.
I can imagine this being useful for implementing classifiers and little baby GenAI-adjacent tech on an extremely tiny scale, on the order of several hundred or several thousand transistors.
Example: right now, a lot of the leading-edge biosensors have to pull data from their PPG/ECG/etc chips and run it through big fp32 matrices to get heart rate. That's hideously inefficient when you consider that your data is usually coming in as an int16 and resolution any better than 1bpm isn't necessary. But, fp32 is what the MCU can do in hardware so it's what you gotta do. Training one of these things to take incoming int16 data and spit out a heart rate could reduce the software complexity and cost of development for those products by several orders of magnitude, assuming someone like Maxim could shove it into their existing COTS biosensor chips.
yes absolutely: current systems are wildly inefficient. the future is one of extreme energy efficiency.
re smoking: sorry let me clarify my statement. these things will be the dominant life forms on earth in terms of metabolism, exceeding the energy consumption of biological systems, over 1k petawatt hours per year, dwarfing everything else
the lines betwen us may blur metaphorically, we'll be connected to them how we're connected to ecosystems of plants and bacteria. these systems will join and merge in the same way we've merged with smartphones -- but on a much deeper level
Okay so another way to put it is that these are gonna be the software we run on lots of computers in the future. Why this particular model of intelligence and not some other one?
If "we" all got together and reduced permitting costs, "our" homes would be much more valuable because a developer can build a highrise where the house is.
This is better known as the Table Maker's Dilemma.
Say you have some computable number p, that means you can compute a (rational) approximation p' to p within any given tolerance eps > 0 (i.e. you know |p - p'| < eps). To determine whether p > 0, p = 0, or p < 0, you compute an approximation p' to a certain tolerance eps. If p' > eps then you know p > 0, if p < -eps then you know p < 0, otherwise you need a better approximation... Without further knowledge about p, there is no point where you can assert p = 0.
Only x=0 is undecidable. It's because you have to check an infinite number of digits past the decimal point to see if all of them are zero, and that will not halt.
If x and y are the same number, then a Turing machine that compares the digits of x and y to determine whether x < y will never terminate. Doesn’t this mean that x < y is not decidable?
> A Turing machine as an infinite long tape, and even if it takes 10 times the time to the heat death of the universe x<y will halt.
> Finite time is not practical time.
Sorry for my daftness but if x and y are both a transcendental number, for example pi, then x and y have infinitely many digits so the Turing machine that determines whether x < y is true will not halt, even with an infinitely long tape, or will it? What am I misunderstanding?
If you could recommend a source that explains the basics of these arguments, I’d appreciate it.
For an introduction to real real analysis is an old book typically called the 'Baby Rudin' Principles of Mathematical Analysis by Walter Rudin.
It is tiny and expensive, as most college math text books are once you get past a certain level. No idea if there is something better now, it is from the 1950's and there are used copies in most used book shops for cheap (or there was). It has no definition of 'equals' but does have a concept of equivalent relations.
Dummit and Foote's Abstract Algebra book is a great reference, especially with the problem with AI pollution. Coming from a General Relativity background, it still took me two years to work through on my own.
I don't know any suggestions on a more modern approach of starting with type or category theory.
I am not really interested in constructivist math, it is a problem to get around for me, so maybe someone who is can chime in on a better path.
While not rigorous, this paper does a good job at explaining just how muddy the concept of 'equality' is.
Haven't seen the paper, but I'd guess the argument is something like "if the domain for x includes 0, then [edit: whenever x = 0] the Turing machine will get into an infinite loop looking for the first inverted bit[1]. _None_ of the three states will ever be provably true or false. If the domain _excludes_ 0 then you can prove that the machine will hit a non-zero bit in finite time, thus terminate, and either x<0 or x>0 will be shown true. But if you exclude 0 then you've also shown that the =0 state will never be true."
[1] multiple zero bits and a one bit is >0, multiple 1 bits and a zero bit is <0. Infinite zero bits (even if followed by a one bit) is +0, infinite 1 bits (even if followed by a zero bit) is -0.
An exact comparison is undecidable. You can only say if the numbers sufficiently differ. This is equivalent to saying that x > 0 + ε or x < 0 - ε, where ε > 0. If you represent numbers in binary, ε = 2^(-n), where n is the maximum number of bits you allocate to represent a number. Obviously you have n < ∞ in any finite machine.
Nope. You're thinking about this mathematically, not algorithmically. The question is essentially to decide whether the output of a TM is all zeros or whether it will eventually output a 1. If it outputs a 1 you will know that in a finite amount of time by running the program, but if it never outputs a 1 then you can't tell that by running the program, and it's impossible in general to tell any other way because that would be tantamount to solving the halting problem.
Here is a concrete example: write a program that systematically tests all numbers for counterexamples to the Goldback conjecture. Every time a number fails to be a counterexample, output a zero, otherwise output a 1. Consider the output of that program to be a real number expressed as a binary decimal. If that number is not zero you will know in a finite amount of time, but if it is zero you will not know that until someone proves the Goldbach conjecture.
The order relation on computable reals is undecidable - this is well known, and the parent's argument is correct. If both x<y and x>y were decidable then you could easily decide equality too by just checking them and seeing if the answer to both was "no".
Tracing back through this thread, the term "deciable" was introduced in a very unfortunate way:
ekez: why is it impossible to decide whether x<0, x=0, or x>0 as in Example 1?
lisper: Only x=0 is undecidable. It's because you have to check an infinite number of digits past the decimal point to see if all of them are zero, and that will not halt.
teraflop: That cannot be true, because if both x<0 and x>0 are decidable then x=0 is also decidable.
Both lisper (me) and teraflop are correct. The problem is that inequality is not decidable, it's semi-decidable. It's teraflop who introduced the word "decidable" into the discussion, not me, but I didn't pay enough attention to that at that time.
Right, I agree that they are semidecidable, thanks for clarifying. My interpretation of "only x=0 is undecidable" was that you were claiming the other two were decidable (which is a reasonable reading imo). But it sounds like we don't actually disagree here.
> Nope. You're thinking about this mathematically, not algorithmically.
I don't know what you mean by this, and I'm not sure how it relates to what I said. I'm using "decidable" in the strict, computer-science sense of the word.
The statement in example 1 of the paper, which we're discussing, is about computable real numbers. A computable real number is one for which there is a Turing machine that can calculate it to any desired finite precision in finite time.
A semidecidable problem is one for which there is a program that halts if the answer is "yes", but may not halt if the answer is "no". The halting problem is in this category.
Given a computable real number x, asking whether x<0 or x>0 are semidecidable but not decidable problems.
x > 0 and x < 0 are both semidecidable. Which is to say, they can be decided, but their converses cannot. So if either of those are true, they will halt (by running them in parallel, and halting once one halts), but if neither of them are, it may not halt.
If you remove 0 from possible inputs, you no longer need to worry about that possibility, and the problem is then decidable.
Edit distance seems like a strange way to test if the model understands arithmetic ([1], Figure 3). I think `1+3=3` would be equally as correct as `1+1=9`?
Why not consider how far off the model is `abs(actual-expected)`? I wonder if there is an inflection point with that metric.
It depends on how you do arithmetic. If you're a human and you do column addition, 12345+35791=58136 is just as big of a mistake as 48146 (the actual result is 48136). It's just one mistaken column in both. Binary half-adders work the same way.
We don't really know how LLMs do arithmetic. Maybe token edit distance would be interesting, but either way it doesn't really change the claim of the paper.
Yeah, and as an aside I wonder how hard it would be to train an LLM to do addition, multiplication, etc, human style? Presumably it should be possible at least in step-by-step style (as substitute for short-term memory), the same way that we do it.
Without using an algorithmic approach, it seems an LLM can only learn a bunch of partially correct heuristics, and attempt to generalize over examples.
I've played with this a bit in the past, and came to the conclusion that GPT-3 seems to have learnt to compare the size of numbers (whether accurately or via heuristics), and would get the approximate size of an answer right (depending on the task), even if not the actual value right. I seem to recall it also doing this for tasks like asking for a prime number greater than a particular value.
I mean, is it efficient to teach them addition human style instead of heuristics of when to call the right function?
Imagine you could say 'calc' in your brain, and some separate subcomponent of your brain that is far more power efficient could return an answer almost instantly? You would not focus on understanding addition/subtraction, more you would focus on on when to use addition/subtraction/multiplication or whatever.
> is it efficient to teach them addition human style
Not if math is your only goal, but there'd be value in making the models more powerful so that they could learn to do simple things like this (and not so simple things) by themselves. You can't have a tool for everything, and hopefully future AGI can itself do things that are more than just a mashup of existing tool capabilities.
> ChatGPT specifically seems to absolutely adore the phrase [complex and multifaceted], using it at every opportunity to explain higher level concepts.
I'm curious why this is. Before the echo chamber, did the LLM lean some ontology of speech?
The title is an allusion to the experiment which samples the output of the 12th layer of LLaMA-7b. Surprisingly, there is a fair bit of structure on the principle components of those samples.
Usually, tagged memory at a hardware level, goes along with support for tagged operations in the processor.
In the LISP machines for example, you had an add instruction. Which would happily work correctly on pointers, floats, and integers depending on the data type, at the machine code level. Offers safety and also makes the the compiler simpler.
But where this really shines is in things like, well, lists, since the tags can distinguish atoms from pairs and values like nil, fairly complex list-walking operations can be done in hardware, and pretty quickly too. It also makes hardware implementation of garbage collection possible.
This is just my intuition, but I suspect, these days, it all works out to about the same thing in the end. You use some of the cache for code that implements pointer tagging, or you can sacrifice some die area away from cache for hardwired logic doing the same thing. It probably is in the same ballpark of complexity and speed.
Speed, like is usual with hw vs software implementations of things. But the difference is these days less than it used to be, because processors spend most of their time stalled on all kinds of things and the ALU processing capacity is underused most of the time.
An interesting question for a modern ISA design would be to figure out how to make tagged words and memory work well with SIMD.
> An interesting question for a modern ISA design would be to figure out how to make tagged words and memory work well with SIMD.
Not sure what the issue might be. Let’s say you’re doing a multiply-add: you’d call the one for the data type you want and if any operand were of the wrong type you’d get a fault. Am I missing something?
Consider that your SIMD instruction might itself take a tag mask and the ALU need only do equality on the tag field. In fact it could do that in parallel with the ALU op; on mismatch you could simply discard the current state and abort the operation. However realistically you’d want the same set of SIMD instructions as an I tagged architecture anyway.
Also I expect any compiler would assume that the contents of an array subject to SIMD computation would be homogeneous anyway, perhaps trying to enforce it elsewhere.
In any case this doesn’t seem like a big deal to me…but I could be wrong!
I guess to really get into it a good start would be to work with existing SIMD and take a quantitative approach to what is actually the most hot part of it. I wonder if any existing language implementations (eg Common Lisp) attempt to do these kinds of things in the first place.
https://github.com/0xekez/tinyLIRPA
tinygrad’s small set of operations and laziness made it easy to implement. Tho my overall sense is that neural network verification is currently more of a research interest than something practical.