It can be 10,000 requests a day on static HTML and non-existent, PHP pages. That's on my site. I'd rather them have Christ-centered and helpful content in their pretraining. So, I still let them scrape it for the public good.
It helps to not have images, etc that would drive up bandwidth cost. Serving HTML is just pennies a month with BunnyCDN. If I had heavier content, I might have to block them or restrict it to specific pages once per day. Maybe just block the heavy content, like the images.
Btw, anyone tried just blocking things like images to see if scaping bandwidth dropped to acceptable levels?
The author has some good points. I'll highlight a few:
1. C language is small enough for one book.
2. It's been described, implemented, and fixed so many times on the Internet that the pretraining data is full of this.
3. I'll add that the Rust pretraining data probably has the data structures, C integration, etc.
4. There's entire, working compilers and articles about them in the pretraining data.
So, the pretraining data alone probably has most of the code at least in C form and maybe nearly memorized compared to other languages. The difficulty is probably more like building a CRUD app.
A student in a compiler class would be doing more challenging work having way, way, way, fewer examples to start with before they built their compiler.
What I will say is, like for CRUD apps, it proves they can automate a bit more than they used to. If they proved it on niche compilers, it might prove a highly-useful capability for researchers. I think they should test it on one of the new languages with neat features but one or partial implementation. Really prove it out.
I think one could also use a subset compatible with a formal semantics of C. Maybe the C semantics in K Framework, CompCert C, or C0 from Verisoft. Alternatively, whatever is supported in open-source, verification tooling.
Then, we have both a precise semantics and tools to help produce robust output.
One might have different profiles with different permissions. A network service usually wouldn't need your hone directory while a personal utility might not need networking.
Also, that concept could be mixed with subprocess-style sandboxing. The two processes, main and sandboxed, might have different policies. The sandboxed one can only talk to main process over a specific channel. Nothing else. People usually also meter their CPU, RAM, etc.
INTEGRITY RTOS had language-specific runtimes, esp Ada and Java, that ran directly on the microkernel. A POSIX app or Linux VM could run side by side with it. Then, some middleware for inter-process communication let them talk to each other.
I think what these papers prove is my newer theory that organized science isn't scientific at all. It's mostly unverified claims by people rewarded for throwing papers out that look scientific, have novelty, and achieve policy goals of specific groups. There's also little review with dissent banned in many places. We've been calling it scientism since it's like a self-reinforcing religion.
We need to throw all of this out by default. From public policy to courtrooms, we need to treat it like any other eyewitness claim. We shouldn't beleive anything unless it has strong arguments or data backing it. For science, we need the scientific method applied with skeptical review and/or replication. Our tools, like statistical methods and programs, must be vetted.
Like with logic, we shouldn't allow them to go beyond what's proven in this way. So, only the vetted claims are allowed as building blocks (premises) in newly-vetted work. The premises must be used how they were used before. If not, they are re-checked for the new circumstances. Then, the conclusions are stated with their preconditions and limitations to only he applied that way.
I imagine many non-scientists and taxpayers assumed what I described is how all these "scientific facts" and "consensus" vlaims were done. The opposite was true in most cases. So, we need to not onoy redo it but apply scientific method to the institutions themselves assessing their reliability. If they don't get reliable, they loose their funding and quickly.
(Note: There are groups in many fields doing real research and experimental science. We should highlight them as exemplars. Maybe let them take the lead in consulting for how to fix these problems.)
> We need to throw all of this out by default. From public policy to courtrooms, we need to treat it like any other eyewitness claim.
If you can't trust eyewitness claims, if you can't trust video or photographic or audio evidence, then how does one Find Truth? Nobody really seems to have a solid answer to this.
It's specific segments of people saying we can't trust eyewitness claims. They actually work well enough that we run on them from childhood to adulthood. Accepting that truth is the first step.
Next, we need to understand why that is, which should be trusted, and which can't be. Also, what methods to use in what contexts. We need to develop education for people about how humanity actually works. We can improve steadily over time.
On my end, I've been collecting resources that might be helpful. That includes Christ-centered theology with real-world application, philosophies of knowledge with guides on each one, differences between real vs organized science, biological impact on these, dealing with media bias (eg AllSides), worldview analyses, critical thinking (logic), statistical analyses (esp error spotting), writing correct code, and so on.
One day, I might try to put it together into a series that equips people to navigate all of this stuff. For right now, I'm using it as a refresher to improve my own abilities ahead of entering the Data Science field.
> It's specific segments of people saying we can't trust eyewitness claims.
Scientists that have studied this over long periods of times and diverse population groups?
I've done this firsthand - remembered an event a particular way only to see video (in the old days, before easy video editing) and find out it... didn't quite happen as I remembered.
That's because human beings aren't video recorders. We're encoding emotions into sensor data, and get blinded by things like Weapon Focus and Selective Attention.
Much of what many learned about life came from their parents. That included lots of foundational knowledge that was either true or worked well enough.
You learned a ton in school from textbooks that you didn't personally verify.
You learned lots from media, online experts, etc. Much of which you couldn't verify.
In each case, they are making eyewitness claims that are a mix of first-hand and hearsay. Many books or journals report others' claims. So, even most education involves tons of hearsay claims.
So, how do scientists raised, educated, and informed by eyewitness claims write reports saying eyewitness testimony isn't reliable? How do scientists educated by tons of hearsay not believe eyewitness testimony is trustworthy?
Or did they personally do the scientific method on every claim, technique, machine, circuit, etc they ever considered using? And make all of it from first principles and raw materials? Did they never believe another person's claims?
Also, "scientists that have studied this over long periods of times and diverse population groups" is itself an eyewitness claim and hearsay if you want us to take your word for it. If we look up the studies, we're believing their eyewitness claims on faith while we've validated your claim that theirs exist.
It's clear most people have no idea how much they act on faith in others' word, even those scientists who claim to refute the value of it.
Leslie Lamport came up with a structured method to find errors in proof. Testing it on a batch, he found most of them had errors. Peter Guttman's paper on formal verification likewise showed many "proven" or "verified" works had errors that were spottes quickly upon informal review or testing. We've also see important theories in math and physics change over time with new information.
With the above, I think we've empirically proven that we can't trust mathmeticians more than any other humans We should still rigorously verify their work with diverse, logical, and empirical methods. Also, build ground up on solid ideas that are highly vetted. (Which linear algebra actually does.)
The other approach people are taking are foundational, machine-checked, proof assistants. These use a vetted logic whose assistant produces a series of steps that can be checked by a tiny, highly-verified checker. They'll also oftne use a reliable formalism to check other formalisms. The people doing this have been making everything from proof checkers to compilers to assembly languages to code extraction in those tools so they are highly trustworthy.
But, we still need people to look at the specs of all that to see if there are spec errors. There's fewer people who can vet the specs than can check the original English and code combos. So, are they more trustworthy? (Who knows except when tested empirically on many programs or proofs, like CompCert was.)
A friend of mine was given an assignment in a masters-level CS class, which was to prove a lemma in some seminal paper (It was one of those "Proof follows a similar form to Lemma X" points).
This had been assigned many times previously. When my friend disproved the lemma, he asked the professor what he had done wrong. Turns out the lemma was in fact false, despite dozens of grad students having turned in "proofs" of the lemma already. The paper itself still stood, as a weaker form of the lemma was sufficient for its findings, but still very interesting.
That comment sounds like the environment causes bad behavior. That's a liberal theory refuted consistently by all the people in bad environments who choose to not join in on the bad behavior, even at a personal loss.
God gave us free will to choose good or evil in various circumstances. We need to recognize that in our assessments. We must reward good choices and address bad ones (eg the study authors'). We should also change environments to promote good and oppose evil so the pressures are pushing in the right direction.
It might be a custom chip with all SRAM, no DRAM, that they normally use for AI acceleration. Running Firefox and PDF's on the side would be a nice, value add.
A one-way link (data diode) transmits it to a box with simplified hardware (eg RISC architecture). The box has a dedicated monitor and keyboard. Once you're finished, you sell the box on Craiglist. Then, buy a new, sealed replacement from Best Buy.
Pay per view was an expensive, business model for cable. For PDF's, it's even more expensive.
Note: It's more convenient than full, per-app, physical security.
It helps to not have images, etc that would drive up bandwidth cost. Serving HTML is just pennies a month with BunnyCDN. If I had heavier content, I might have to block them or restrict it to specific pages once per day. Maybe just block the heavy content, like the images.
Btw, anyone tried just blocking things like images to see if scaping bandwidth dropped to acceptable levels?
reply