Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LeoLM: German-Language LLM Research (laion.ai)
105 points by doubtfuluser on Sept 29, 2023 | hide | past | favorite | 84 comments


Me:

> Was für Leute trifft man auf Hacker News? Sind Hacker nicht gefährlich?

LeoLM 13b chat:

> Hacker News ist eine Website, auf der Menschen über Technologie, Startups und alles diskutieren, was mit dem Thema Informatik zusammenhängt. Es ist nicht ungewöhnlich, dass Menschen mit verschiedenen Hintergründen und Interessen Hacker News lesen und beitragen. Daher trifft man auf Hacker News Menschen mit einer Vielzahl von Berufsbezeichnungen, Erfahrungsstufen und Interessen. Hacker allein sind keine gefährlichen Personen, sondern Menschen, die eine Faszination für die Arbeit mit Computern haben. Hacking ist einfach eine Möglichkeit, Computerprogramme zu testen und zu manipulieren, um herauszufinden, wie sie funktionieren. Es gibt jedoch einige wenige Menschen, die diese Fähigkeiten nutzen, um anderen zu schaden. Diese Individuen sind jedoch nicht die Mehrheit.


The tone of that answer seems totally off to me. It reads like a school essay of the less inspired variety.


Also I find the description of what hacking is rather limited. At least in my understading hacking also applies to hardware and "stuff" in general, not only to "computer programs".


That is a quality answer.


It is. But how do you know that if your are not familiar with the topic already? It will give complete nonsense answers that look equally elegant.

Naive question (I don't work with LLMs). Is there any research/figures/measures about how many of the answers can be considered factually correct?


Is it able to transfer 'knowledge' it has in one language to other languages? Or is it strictly bound to language of the training data that provides the relevant facts?


Cool to see this but I am always surprised how often LLM output is used to train other LLMs. They used gpt3.5 turbo and gpt4 for multiple tasks. Even simple translation of englisch benchmarks and writing German poems to train on, in order to create an LLM that works better in German? AFAIK, this still goes against OpenAI TOS and also the basic idea that training on AI output leads to worse results in general. Was there some major shift in this over the years, or has it simply become the default approach due to it being easy to do? Also nice to finally see something from Hessian.AI, as a local, I heard them talk big more than once but never saw results. I wonder what Aleph Alpha thinks about this, since they want to make "AI made in Europe to challenge OpenAI"


After reading the name, I thought that it was related to https://www.leo.org/ - which initially started out (decades ago) as a project of TUM (Technical University of Munich), but I guess the lion reference is because Hesse also has a lion in its crest, same as Bavaria...


I don't get it - is standard Llama bad at German? Other LLMs which I use regularly (like GPT-3.5 and GPT-4) work just find in German out of the box, as well as in several other languages.


Those models (3.5 and 4) are presumably trained on enough german text to be fluent. Llama is not.


Standard Llama is bad at everything but English. It’s mostly trained on English and targeting English.


At least with Finnish GPT-3.5 and GPT-4 are only models that really work other are rather horrible.


Of course. The larger the model is (GPT3.5+), the more languages it can hold in memory, so they are naturally multilingual.

But small models (Llama) need dedicated training to be good at another language.

Its similar to how high verbal IQ people can easily master multiple languages, but normal people are only good at their main language.


> Its similar to how high verbal IQ people can easily master multiple languages, but normal people are only good at their main language.

Let me guess: You are American?

That is very much not true. Go to the Netherlands, for example. Pretty much everybody under 50 speaks very good English. You don’t have to have a high verbal IQ, or be somehow abnormal to become good at a non-native language.


I guess the fact that Dutch and English are pretty closely related helps a lot here. There are definitely people who have trouble learning a language that is from a completely different family (it doesn't even have to be Chinese, try Finnish or Hungarian).


> Its similar to how high verbal IQ people can easily master multiple languages, but normal people are only good at their main language.

How? Are you implying "high verbal IQ people" have bigger "models" in their heads somehow?


There's a lot that goes into verbal intelligence, and we don't really know all of the what and how. It'd be cool if we did, but for now we can only really draw some correlations to physical areas of the brain, which are nevertheless fluid and unique in each individual, and some genetic links. Even then it isn't enough to account for half the stuff we observe humans doing – in Europe learning between 2-4 languages as a child is considered pretty much entirely normal.


Yeah as far as I understand NNs do mimic neurons in a simplistic way but any parallel between them and human brains beyond that is bound to fail because we don't know how human brains work that well. We do not learn by scouring millions of tokens. And regular children can learn a language faster than a high IQ adult.

I think learning multiple languages is normal for children in most places, especially if different social contexts require different languages. Lots of North Africans speak Arabic, French and English for example and some of them speak berber languages on top of that.


That quote sounds like made up nonsense. I'll wait some evidence and ignore it in the meantime.


It’s also heavily dictated by cultural factors. in northern European countries it’s quite normal to know at least English next to your native language. So much so that in finland tourists seem to think that everybody knows English and foreigners who try to learn finnish, tend to complain that Finns switch to English at every opportunity.


Cultural factors are one thing, the other is just that there are only ~5.8 million native Finnish speakers, so a lot of books, movies, music, etc. are not available in the language for purely economic reasons. There are over 90 million German native speakers, and TV shows and movies are dubbed, books translated, etc., so there is less of a need to have high proficiency in English (similar to Spanish-speaking countries).


You'd be surprised, honestly. What matters is also comprehensible input size, so long works like LOTR (which has an excellent Finnish translation by Kersti Juva for anyone looking to find works to read) are a good choice. Similarly, there's a lot of Finnish literature in Finnish which Finnish people spend time reading, some of which is translated to many, many languages. The Egyptian by Mika Waltari is one of those classics. Not nearly everyone here knows English well. TV shows and movies are usually subtitled as well, save for kids' shows. Finland does however have three mandatory languages in the basic schooling, which are your native language, the other native language, and then English. A fourth one is not uncommon either, so we have a lot of German, Spanish, Russian, Japanese etc. speakers here. Hell, jag kan neljää языки.


> Its similar to how high verbal IQ people can easily master multiple languages, but normal people are only good at their main language.

Normal people can be good at multiple languages; people living in large nations whose own language also effectively the global libgua franca often don’t because there is little marginal benefit for them to do so, bot because of verbal IQ. People of similar “verbal IQ” but living in an area where a few hours drive can put them in places with three or four different prinary languages will often be functional in several languages.


That last paragraph is a bold claim, my friend.


Not really? "Verbal IQ" has a specific meaning, one related to language learning.


Fair enough, though if meant that way it’s not that meaningful as comparison(?). It loosely translates as: “People who are good with languages are good with languages”.


Historically, and even today on a global scale, monolingualism is the exception. In the absence of large state actors (i.e. through most of human history) languages diversify much more than we are aware of today, and this requires many more people to be familiar with multiple languages.


I’ve heard (credible-sounding) gossip that most folks hit a 1.5Tw 16-way mix on GPT-4. Is that like, consensus these days?

A couple of months ago I got 3.5-turbo to start talking about “the details” but that is about as reliable as anything it says. It was claiming like, 200Gw.

Are solid numbers available?


> A couple of months ago I got 3.5-turbo to start talking about “the details” but that is about as reliable as anything it says.

Less reliable, probably. Very unlikely this information is in the training set.


I'd be so interested in seeing how different the world views are of a model fine-tuned on English Wikipedia vs one fine-tuned on translated German Wikipedia.


And, in addition to that, which topics either model is more knowledgeable about as there are lots of articles in either language missing a translation.


Yes, that can change/improve the adoption in Germany significantly.


I’ve been thinking along these lines and if it’ll become more common with smaller but more tailored language models? The advantage seems obvious in that they would have significantly lower resource requirements both as for prediction and fine tuning, which matters when companies dealing with sensitive data (or even just comply with strict EU laws) wants to run them locally.

It makes sense that this specialization is seen first with language coverage because only GPT-4 is often considered truly fit for professional multilingual use; an LLM that is expensive and in the cloud.


The German seems nearly free of mistakes. The factual content can with a high probability be utter nonsense, no difference from ChatGPT.

> Steht das Verb in deutschen Sätzen immer am Ende wie Mark Twain behauptet?

> Jennifer Lopez' zweites Studioalbum wurde 1998 veröffentlicht und heißt "On the 6".

> Diese Antwort war wohl "Thema verfehlt"!

> Das stimmt, ich habe Ihre Frage nicht vollständig beantwortet. Ich entschuldige mich dafür.

Deutsche Sätze müssen nicht immer mit dem Verb am Ende geschrieben werden, aber es ist üblich. Die Reihenfolge Subjekt-Verb-Objekt (SVO) ist die häufigste Wortreihenfolge im Deutschen, obwohl sie nicht immer eingehalten wird. Im Deutschen gibt es mehrere mögliche Satzstrukturen und Subjekt-Verb-Objekt ist nur eine davon. Zum Beispiel können Sätze mit Infinitivkonstruktionen oder Partizipialkonstruktionen auch allein mit einem Nomen und einer Präposition beginnen. Zusätzlich können die Reihenfolge der Wörter innerhalb von Sätzen variieren, solange die grundlegende Satzstruktur erhalten bleibt.

> Beschränken wir uns mal auf Aussagesätze und Hauptsätze. Wie sind die Regeln für die Stellung des Verbs in diesem Fall?

> Im Deutschen steht das Verb im Aussagesatz normalerweise am Satzende (Endstellung), außer bei Fragen und Befehlen, wo das finite Verb (Verbform, die Person und Numerus anzeigt) am Anfang steht. Im Aussagesatz wird die Verbposition am Satzende als Standardverbposition bezeichnet. Diese Standardverbposition gilt jedoch nicht immer und kann auf verschiedene Weise abweichen. Zum Beispiel:

1. In Fragesätzen und Befehlsätzen steht das finite Verb normalerweise am Anfang, vor dem Subjekt.

2. Im Passiv steht das Verb normalerweise am Satzende, aber die Form des Verbs ändert sich je nach Zeitform, Person und Numerus.

3. Die Kopulaverben 'ein', 'bleiben', 'werden', 'heißen' und'sehen' stehen manchmal am Anfang des Satzes, insbesondere in einfachen Sätzen oder Satzteilen.

4. Einige Verben wie 'üssen', 'ollen', 'wollen', 'können' und 'ögen' können am Anfang des Satzes verwendet werden, insbesondere in Befehlsätzen.

5. Reflexivpronomen und Partizipien können mit dem Verb auftreten, das normalerweise am Satzende steht, und sich auf dasselbe Nomen im Satz beziehen.

6. Im Perfekt wird das Hilfsverb 'haben' /'sein' normalerweise vor dem Partizip Perfekt platziert, es sei denn, der Satz enthält ein Hilfsverb.

Using LLMs for anything else than writing poems is irresponsible!


Well, it just goes to show that if you already know the answer, you can use LLM, but then it's also pointless. If you don't know the answer, and use an LLM, you may think you learned something but instead now know less.


Response times were great 20 minutes ago. Hugged to death right now.


September 2023 was a massive month for LLM releases. I see 14 big highlights including the LeoLM family, and we haven't even finished the month yet...

Models table: https://lifearchitect.ai/models-table/

Models timeline: https://lifearchitect.ai/timeline/


Out of the box it seems to generate very long "Schachtelsätze". I wonder how english models manage to avoid this (system prompt? different training material mix?).

That said: similar to other large language models you can get it to generate different styles of text.

For example "antworte auf Wienerisch" did change the writing style.


Would be interesting to know what is more difficult for AI to learn. German has a more flexible word order, this makes it more difficult for some speakers.

"Man bites dog" and "dog bits man" have two different meanings based on word order. In German, both orders are possible and it still has the same meaning.


> "Man bites dog" and "dog bits man" have two different meanings based on word order. In German, both orders are possible and it still has the same meaning.

That's not correct. Both orders are possible but they don't have the same meaning. The word order in this example has the same effect in German as it has in English.


Der Mann beisst den Hund.

Den Hund beisst der Mann.

Both sentences have the same meaning.

Beisst der Mann den Hund? would make it a question.

"Der Mann den Hund beisst" is grammatically wrong but still understandable.

I googled it and, strange coincidence, they even use my example: https://www.dartmouth.edu/~deutsch/Grammatik/WordOrder/WordO...

English tends to rely mostly on word order to indicate the grammatical function of a word or phrase, while German uses inflections. The German endings, such as those indicating the nominative, accusative, dative, and genitive cases in three different genders, allow for some greater flexibility in clause construction. Hence "Der Hund beißt den Mann" and "Den Mann beißt der Hund" both mean "The dog bites the man" (as opposed to "The man bites the dog"). Nevertheless, German word order is extremely important, even when it is not vital to meaning, and its correctness plays a major role in how a foreigner's command of the language is evaluated.

At the same time, word order is an infinitely complex aspect of language, never wholly mastered by non-native speakers. Very few rules cover all possibilities, and context often trumps other considerations. When Robert Frost writes, "Something there is that doesn't love a wall," it's poetic; if someone with a foreign accent says the same thing in conversation, it sounds like Yoda.


The literal translations of your examples, which don't have the same meaning, would be:

Mann beißt Hund

Hund beißt Mann

You added articles (der, den) which are not present in your English examples.


Yes, but I am not sure that "Mann beisst Hund" is even a sentence. Sounds like a Newspaper slogan. I don't think a German teacher would let this slip in an essay except if it is done on purpose for style reasons.


Yes, it could be a newspaper headline. Your examples need articles (the man, the dog) to be correct.


I'm perplexed by this and never heard of it (I'm not a native german speaker). How would you then say "the man bites the dog"?


The most common way to translate this would be:

Der Mann beißt den Hund.

This emphasizes that the man bites the dog. (Who bites the dog? The man.)

The other possible translation would be:

Den Hund beißt der Mann.

This emphasizes that the man bit the dog. (Whom does the man bite? The dog.)


The man bites the dog (my example):

Der Mann beisst den Hund. Den Hund beisst der Mann.

The dog bites the man:

Der Hund beisst den Mann. Den Mann beisst der Hund.

This is an easy example because inflections are not really important here. Nouns can change and indicate things like ownership. Chair/ House - Stuhl / Haus for example.

Der Stuhl des Hauses. Haus changes to Hauses, indicating an ownership.


This looks great! I'm currently building something like a "gpt powered Anki" in which you can infinitely generate sentences for each word you are learning, but German gpt sentences are sometimes grammatically incorrect. Not the best for language learning app.


There's this German-English dictionary https://dict.leo.org/. I'm wondering what's the Leo connection with German.


None, pure coincidence. leo.org actually comes from "link everything online" and the use of the lion in the Bavarian coat of arms. It used to be a FTP site which transformed to an HTTP archive site, which transformed, if my memory does not betray me to some kind of web directory. Then a couple of students added diverse helper scripts, among those a dictionary and nowadays only the dictionary service survived and was commercialized.


So this seems completely open source. Naive question: If I wanted to run this at home just for fun (basically one slightly patient user) what kind of machine(s) I would need?


You can run a 7B and 13B quantized model on almost any machine on the CPU if you're patient.


Maybe a dumb question but, are there advantages of using Llama instead of ChatGPT besides cost and the fact that it’s open?


When you publish a model, it’s better to publish the weights. You cannot access the weights of ChatGPT fine tuned models, which is a dealbreaker.


Perhaps “As a large language model, I’m unable to…”?


[stub for offtopicness]


It is really just a language model but because German sentences are about 4 times larger it is technically a large language model.


this reminds me of Mark Twain’s famous German rant https://faculty.georgetown.edu/jod/texts/twain.german.html

i wonder if someone can inspect the weights and see if the attention goes a different way because the verbs are all the way at the end of the sentence


Verbs are at the second position in a main clause. The difference to English being that the first position can be anything in German. If it's not the subject, the latter will follow the verb.

Verbs are at the end in subordinate clauses.


Well, I would call a 7B and 13B model a large language model… but I appreciate your humor.


Genau! Haha...


[flagged]


This comment was the last straw that made me fire off an email to dang regarding the recent decline in comment quality with a few structural changes that I think may help; let’s see what comes out of it.


Your account was created in 2023 and you talk as if you have been on this site since the beginning of time.


To be fair, I read HN regularly since 2016 but only created an account since the reddit fiasco looking to engage more actively. Could be a similar situation.


I’ve been a reader of this site since 2018, but only have I recently created an account.


Then you should know that people have been frequently mentioning Reddit since 2009, which is why it's even mentioned in the guidelines.


The Redditization is real this time though. Just because something wasn’t true previously doesn’t mean it continues to be that way.


"Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills."

Read the guidelines.


what changes do you propose?


We train a LLM classifier using a HN dataset and a Reddit dataset to automatically flag the Reddit like comments.

People can then decide to hide them in their settings.

I’m not sure when to put the cut-off date for the HN dataset. The eternal September started in 1993 and HN in 2007, we need at least a few comments.


There’s no need for a cut-off date; just take highly upvoted comments off Reddit and find similar comments on HN as determined by word vectorization, and then train a model off of that. Or maybe even skip the similarity step; after all the goal is to find Reddit like comments.


Primarily caps karma earning limits and voting limits, which should (at least I think) nudge people towards more thoughtful voting patterns and comments.

The unintended effect may be that high-contribution users are disincentivized from participating.


Be patient. Submissions pick up dumb comments early, because they're quick and easy to write, and better comments later.


It's true; I won't start reading a thread less than a day old, if I'm planning to read-only.


[flagged]


Please don't do this here.


Ach du lieber!


höhöhö


16h hinterm Lidl ohne Treten


[flagged]


Please don't do this here.


Easier said than done


What are yiu talking about


What are you talking about? Apropos of what?


OP was referring to nazis


[flagged]


[flagged]


How many German LLMs do you need to change a lightbulb?

One. They're efficient and don't have humor.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: