Hacker Newsnew | past | comments | ask | show | jobs | submit | thomasmg's commentslogin

I'm working on a new memory-safe systems programming language that is supposted to be (almost) as fast as C, Rust etc, but as simple and concise as Python: https://github.com/thomasmueller/bau-lang

There is a playground which is using a C compiler and WASM, and so is quite fast, while running fully in the browser. Theres also a (online) conversion tool to convert and compare source code. There are some benchmarks as well.

Writing my own (concise, simple) programming language was a dream for me since I'm 20 or so. Feedback would be great!


The "Comparison" section feels unfair. Add some things that some languages have but Bau is missing. I have no idea what trade-offs I'd be doing when choosing Bau.

You have the "Non-Features" section of course, but I'm looking more into what I'd be losing by going from C to Bau. Bau's price for safety.


Thanks! Yes, that makes a lot of sense. I will improve this section!

> what I'd be losing by going from C to Bau

Well, you can add native C code, so in theory you do not miss much. But in practise, yes of course many features are missing still.


I like the idea of compile-time asserting that no runtime check is needed for array access!


Thanks! I know Rust, Java etc do a good job of array bound check elimination, but I prefer a way to _ensure_ there is no check. I felt this was missing.


> If you try to use it like your favorite functional/procedural/OO programming language, you will only get frustrated.

For me personally, the sentence could be shortened to just: "If you try to use it, you will only get frustrated."


So the sentence served its purpose and saved you the frustration :)


I appreciate the features of these languages (J, K, and Klong), but I do not understand the reason why they have such a hard to read syntax. I think it is a pity; I could see myself using it otherwise. Maybe it is by design, but I do not understand it.



Thanks! So here is an explanation:

> The same baseless accusations of “unreadable”, “write-only” and “impossible to learn” are leveled at all Iversonian languages, k included.

I argue that these accusations are far from baseless. Sure, I can not cite a study, but neither does he. So I assume it's word against word.

> Readability is a property of the reader, not the language.

In my view this is an oversimplification. Having a steep learning curve clearly prevents adoption. The language design actively contributes to the problem. I'm just not sure on why. Could it be that the authors of these languages do not _want_ them to be used by others?


Search for "teaching" at https://aplwiki.com/wiki/APL_conference. I count at least five papers about teaching non-APL topics using APL. The language is not only possible to read, it's designed for it.


The learning curve isn’t steep. I’d argue the opposite. K is a tiny, tiny language. Put your mind to it, and you’ll be reading it just fine in a weekend. It’s just different. It’s optimised for its wheelhouse.


You could say you enjoy toodling on the piano but don’t understand why reading music has to be so hard. It doesn’t _stay_ hard and it opens up new avenues.


I guess you could say the same about Chinese language, but many people who learned it understand it


From what you describe, I think you have a somewhat special use case: it sounds like you are compiling it. The experiments done in the survey are not: instead, the hash function is used in the form of a data structure, similar to a Bloom filter ("deserialized" in your words). Do you use it for a parser? How many entries do you have? The survey uses millions of entries. "Startup time in the query time": I'm not quite sure what you mean, I'm afraid. Could you describe it?

I'm also not sure what you mean with "check for false positives"... each entry not in the set returns a random value, maybe for you this is a "false positive"? A typical solution for this is to add a "fingerprint" for each entry (just a hash value per entry) - that way there is still some false positive probability, which may or may not be acceptable (it is typically considered acceptable if the hash function is cryptographically secure, and the fingerprint size is big enough: e.g. 128 bits of the SHA-256 sum). Depending on the data, it may be faster to compare the value (eg. in a parser).


using it like gperf is certainly not a special case. if your keys are fixed and known at compile-time, you can also pre-compile the hash function with its data structures, and not being forced to deserialize it at startup.

when comparing those MPFH query times the startup-time, the deserialization from disk, is 1000x higher than the actual query time. when you compile those data structures, the load time is instant. also memory usage is twice as low.


Many modern perfect hash functions are super close to the space lower bound, often having just between 3% and 50% overhead. So your claim that the space consumption "twice as low" is information theoretically impossible. With gperf, the space consumption is in the machine code instead of a data structure, but it's definitely not "for free". In fact, I'm pretty sure that the size of the machine code generated by gperf is very far from optimal. The only exception are tiny input sets with just a couple of hundred keys, where gperf probably wins due to lower constant overheads.


My special use case is i.e. unicode property checks, for which gperf is not big enough, and which has millions of keys. Other cases are integer keys.

I'm certainly not willing to load they keys and mpfh properties at query-time from disc, as they are known in advance and can be compiled to C or C++ code in advance, which leads to an instant load-time, in opposition to your costly deserialization times in all your tools.

Your deserialization overhead space is not calculated, and also not the storage costs for the false positive check. It's rather academic, not practical


There is no deserialization time or space overhead. The measurements refer to the deserialized form. They are not loaded from disk.

About false positive checks, I think you misunderstand what a perfect hash function does.


See, everybody can see how you cheat your benchmarks. It's unrealistic to measure only the perfect hash function, when you discard the cost of the deserialization and false-positive checks.

That's always the case when dealing with you academics.

We dont care how you define your costs when we laugh about that. In reality you have to load the keys, the data structures and check for existance. You even discard the existance checks, by omitting the false-positive checks. Checking against a non-existing key will lead to a hit in your check. Only very rarely you know in advance if your key is in the set


Also, you miss the space costs for the ordering.


> using it like gperf is certainly not a special case.

Well... let's put it like this: in this survey, "parsers" (where I am one of the co-authors) are not mentioned explicitly in the "Applications" section. They are a subset of "Hash Tables and Retrieval". There are many other uses: Approximate Membership, Databases, Bioinformatics, Text Indexing, Natural Language Processing. Yes, parsers are mentioned in "The Birth of Perfect Hashing". Maybe we can conclude that parsers are not the "main" use case nowadays.

> when you compile those data structures, the load time is instant.

Well, in this case I would recommend to use a static lookup table in the form of source code. That way, it is available for the compiler at compile time, and doesn't need to be loaded and parsed at runtime: it is available as a data structure at runtime. All modern MPHF implementations in this survey support this usage. But, yes, they are not optimized for this use case: you typically don't have millions of keywords in a programming language.


Using a brute-force approach can quickly find a minimal perfect hash table. Eg. the RecSplit approach can be used for this case, to first split into 4 sections, and then use another one for each section. Or, in this case, the same one for each section:

    (hash(c + 4339104) & 7) * 4 + (hash(c + 201375) & 3)
For a generic hash function (eg the Murmur hash, or the simple one here:

    long hash(long x) {
        x = ((x >>> 16) ^ x) * 0x45d9f3b;
        x = ((x >>> 16) ^ x) * 0x45d9f3b;
        x = (x >>> 16) ^ x;
        return x;
    }
As described in the linked paper, the fastest way to find such a MPHF for larger sets is nowadays Consensus-RecSplit.

[1] https://stackoverflow.com/questions/664014/what-integer-hash...


So I'm a bit confused, the paper says "'Randen' ... outperforming ... PCG" but now you wrote "though slower than pcg64"? In the paper itself, PCG is faster but "these microbenchmark results are quite irrelevant in practice — which actual application repeatedly calls a random generator and ignores the results?" I wonder, was the paper peer-reviewed? (I'm not saying it should have been..., just wondering.)

My guess is that yes, Randen is fast, but it is not quite as fast as PCG or other pseudo-random number generators.

Whether or not it is relevant, that's another story. I can not agree to your claim "there are vanishingly few use cases" for insecure random number generators.


If you look inside Abseil it seems that the authors generally agree on performance. The default random in Abseil is Randen, but the faster variant the use of which is discouraged is PCG.


The paper doesn't seem peer reviewed or published. It doesn't show up as published anywhere on Google scholar, and papers that are published years later still reference the arxiv link, not a publication link. This is pretty strong evidence of not being reviewed or published.

You can check via: google scholar, search randen random number, note there is no version with a journal, click citations and look through the few articles mentioning it, you can see which of those are published, can see a few, not one has a journal ref to it.


I was citing the micro benchmarks because they provide a pessimistic comparison in favor of other RNGs. My recollection is that it appeared as a conference paper, but I'd need to ask the author.


> the micro benchmarks because they provide a pessimistic comparison in favor of other RNGs

Why would one claim they're fast, then simply fiddle with benchmarks so they find a place where they look fast?

When someone wants a fast RNG, they're not calling it 10 times per sec. They're hammering it for actual needs.


This is a bit similar to count-min sketch, which can be used to calculate frequencies in a stream.

Also interesting are "Invertible Bloom Lookup Tables" (IBLT), https://arxiv.org/pdf/1101.2245 , which allow to add and remove data, and allow to retrieve the remaining data if "little" data remains. That means, it can be used for error correction: all all the _correct_ data (let's say 1 GB of correct data) into a 1 MB IBLT. Let's say a downloader finds that the checksum doesn't match: he can download that 1 MB IBLT, and remove the data from his (invalid) download. What remains is the delta: the error correction data. I know, there are other ways you could do error correction, but this is very fast, and quite interesting technically.


IBLT operates over sets though, your downloading example is not over a set (the data has a position). You can make it work over data with positions by adding a position number to every encoding unit, but it's pure overhead. Other codes are more efficient.

IBLT is also particularly bandwidth inefficient when the amount of correction data is not very large. Ordinary codes can recover a missing block (known error location) with just the amount of data missing or with twice the amount if the location(s) are unknown.

There are codes with quasi-linear performance, though for your example application the number of errors are few so it shouldn't really matter if the code is cubic in the number of errors to decode (which is about the worst any should be).


Yes, I know it is not designed to be a error-correction code, and other codes (turbo code, fountain code), are a lot more efficient. But I wanted to mention it because it's related to the topic.

> You can make it work over data with positions by adding a position number

Yes, that's what I did in my experiments. I personally found it really simple to implement; I'm not sure how easy it is to implement a "real" error correction code.


Fountain code is decoded by essentially the same mechanism, it's called peeling. It's quite fun to implement. Though a plain fountain code has similar bandwidth inefficiencies... also the implementations that exist only correct erasures (though you can turn errors into erasures using a checksum/mac).

You can see the fountain code as a very sparse linear system that is usually solvable through a fairly simple greedy algorithm. You an augment the sparse linear system with a dense one like a RS code, which is solved by gauss-seidel which is also fun if less braindead simple.

This github user has a whole host of different competently implemented very high performance erasure codes:

https://github.com/catid/wirehair https://github.com/catid/leopard https://github.com/catid/fecal https://github.com/catid/siamese

As far as direct error codes. There is a competent BCH code implementation in the Linux kernel, but it only works over small fields so it doesn't support many independent blocks/packets.

Then there is https://github.com/bitcoin-core/minisketch which I am a coauthor of, which is a set corrector like IBLT which achieves perfect communications efficiency (at a trade off of cubic decode performance instead of linear but with good constant factors). I mention it mostly because it contains fast implementations of the relevant and tricky number theory algorithms needed for a fast RS error correcting code over big fields (any size up to 2^64 at least)... but I'm not actually aware of a performant large field RS error correcting implementation. (IIRC there is a RS error corrector in gnuradio but again small field).

(Mostly the small field stuff doesn't work for bigger fields because they use a chien search to find the roots of a polynomial, which is linear in the field size. Minisketch finds roots by factoring in log(|field|) time. Root finding is needed in algebraic error codes because it's how you find where the errors are. IIRC the linux kernel BCH implementation uses factoring like we do but is constructed to use a log table to turn multiplications into xors.)



Well, what problem do you want to solve? What you describe is not a use case but a possible solution... this smells like an xy problem...


The problem is:

Represent (possibly overlapping, hence trees are tricky) subsets of size 50-100 out of a set of size of size 1000-2000.

Some amount of false positives is acceptable but not like 50%.


So, for this it sounds like you only need one Bloom filter (not multiple), and each subset is an _entry_ in the filter. The total set size doesn't matter; what matters (for the size of the Bloom filter) is the total number of entries you put into the Bloom filter, and the false positive rate. And then, you can do a membership test (with a configurable false positive rate, typical is 1%), to find out if an entry is in the set. BTW you can not use Bloom filters to store and retrieve entries: for that, you need something else (an array, or hash map, or a retrieval data structure; Bloom filter can't be used).


Yes, this is the succinct counting blocked Bloom filter I wrote. I wanted to write a paper about it. It requires twice the space of a blocked Bloom filter (which is about half the space of a regular counting Bloom filter). Reads are very fast, and updates are okish.


Well, thank you for that! That library has taught me a lot. :)


For me, the main problem of the Regex syntax is the escaping rules: Many characters require escaping: \ { } ( ) [ ] | * + ? ^ $ . And the rules are different inside square brackets. I think it would be better if literal text is enclosed in quotes; that way, much less escaping is needed, but it would still be concise (and sometimes, more concise). I tried to formulate a proposal here: https://github.com/thomasmueller/bau-lang/blob/main/RegexV2....


One thing I noticed with the example `['0-9a-f']`

Doesn't this go against the "literals are enclosed in quotes" idea? In this case, you have a special character (`-`) inside a quoted string. IMO this would be more consistent: `['0'-'9''a'-'f'']`, maybe even have comma separation like `['0'-'9','a'-'f'']`. This would also allow you to include the character classes like `[d,'a'-'f'']` although that might be a little confusing if you're used to normal regex.


Thanks for reading and taking the time to respond!

> Doesn't this go against the "literals are enclosed in quotes" idea?

Sure, one could argue that other changes would also be useful, but then it would be less concise. I think the main reasons why people like regex are: (a) powerful, (b) concise.

For my V2 proposal, the new rule is: "literals are enclosed in quotes", the rule isn't "_only_ literals are enclosed in quotes" :-) In this case, I think `-` can be quoted as well. I wanted to keep the v2 syntax as close as possible to the existing syntax.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: