Gemini Flash

simonw · on May 14, 2024

I upgraded my llm-gemini plugin to provide CLI access to Gemini Flash:

    pipx install llm # or brew install llm
    llm install llm-gemini --upgrade
    llm keys set gemini
    # paste API key here
    llm -m gemini-1.5-flash-latest 'a short poem about otters'

https://github.com/simonw/llm-gemini/releases/tag/0.1a4

xianshou · on May 14, 2024

Looking at MMLU and other benchmarks, this essentially means sub-second first-token latency with Llama 3 70B quality (but not GPT-4 / Opus), native multimodality, and 1M context.

Not bad compared to rolling your own, but among frontier models the main competitive differentiator was native multimodality. With the release of GPT-4o I'm not clear on why an organization not bound to GCP would pick Gemini. 128k context (4o) is fine unless you're processing whole books/movies at once. Is anyone doing this at scale in a way that can't be filtered down from 1M to 100k?

Workaccount2 · on May 14, 2024

With 1M tokens you can dump 2000 pages of documents into the context windows before starting a chat.

Gemini's strength isn't in being able to answer logic puzzles, it's strength is in its context length. Studying for an exam? Just put the entire textbook in the chat. Need to use a dead language for an old test system with no information on the internet? Drop the 1300 page reference manual in and ask away.

ianbicking · on May 14, 2024

How much do those input tokens cost?

According to https://ai.google.dev/pricing it's $0.70/million input tokens (for a long context). That will be per-exchange, so every little back and forth will cost around that much (if you're using a substantial portion of the context window).

And while I haven't tested Gemini, most LLMs get increasingly wonky as the context goes up, more likely to fixate, more likely to forget instructions.

That big context window could definitely be great for certain tasks (especially information extraction), but it doesn't feel like a generally useful feature.

mcbuilder · on May 14, 2024

That per exchange context cost is what really puts me off using cloud LLM for anything serious. I know batching and everything is needed in the data center, and important for keeping around KVQ cache, you basically need to fully take over machine to get an interactive session to get the context costs to scale with sequence length. So it's useful, but more in the case of a local LLaMA type situation if you want a conversation.

neverokay · on May 15, 2024

It makes building any app that requires generous user prompting impossible to build for regular developers (cloud pricing).

$20 hosting can serve thousands of users per month. $20 llm sub services just one person. This is fucking impossible.

cyrusalwati · on May 20, 2024

How are you hosting for $20 per month? I've been under the impression that I'd need to pay closer to $200 to self host.

I'm building a small assistant tool and thought I'm forced to use APIs!

falcor84 · on May 14, 2024

I wonder if we could implement the equivalent of a JIT compilation, whereby context sequences which get repeatedly reused would be used for an online fine-tuning.

sp332 · on May 15, 2024

No, but you can just cache the state after processing the prompt. https://github.com/ggerganov/llama.cpp/tree/master/examples/...

HumanOstrich · on May 15, 2024

[flagged]

nostrebored · on May 15, 2024

They are asking if you can take the context being passed per interaction and train it into a session in real time (via an online algorithm).

Essentially bake the context passed in to the attention layer so that you can pass only the relevant chat context.

Your post wasn’t a particularly charitable interpretation.

lxgr · on May 14, 2024

Is there a way to amortize that cost over several queries, i.e. "pre-bake" a document into a context persisted in some form to allow cheaper follow-up queries about it?

simonw · on May 14, 2024

They announced that today, calling it "context caching" - but it looks like it's only going to be available for Gemini Pro 1.5, not for Gemini Flash.

It reduces prompt costs by half for those shared prefix tokens, but you have to pay $4.50/million tokens/hour to keep that cache warm - so probably not a useful optimization for most lower traffic applications.

https://ai.google.dev/gemini-api/docs/caching

dragonwriter · on May 14, 2024

> It reduces prompt costs by half for those shared prefix tokens, but you have to pay $4.50/million tokens/hour to keep that cache warm - so probably not a useful optimization for most lower traffic applications

That's on a model with $3.5/1M input token cost, so half price on cached prefix tokens for $4.5/1M/hour breaks even at a little over 2.5 requests/hour using the cached prefix.

inlined · on May 14, 2024

Though I'm not familiar with the specifics, they announced "context caching"

gcanyon · on May 14, 2024

Depending on the output window limit, the first query could be something like: "Summarize this down to its essential details" -- then use that to feed future queries.

Tediously, it would be possible to do this chapter by chapter in order to exceed the output limit building something for future inputs.

Of course, the summary might not fulfill the same functionality as the original source document. YMMV

bredren · on May 14, 2024

Can anyone speculate on how G arrived at this price, and perhaps how it contrasts with how OAI arrived at its updated pricing? (realizing it can't be held up directly to GPT x at the moment)

tk90 · on May 14, 2024

Isn't there retrieval degradation with such a large context size? I would still think that a RAG system on 128K is still better than No Rag + 1M context window, no? (assuming text only)

cathalc · on May 15, 2024

Absolutely. Gemini results tend to drop off after 128k tokens according to RULER: https://github.com/hsiehjackson/RULER

ayyndrew · on May 16, 2024

If I'm reading the paper right, it says Gemini 1.5 is good up to 128k, and RULER doesn't test beyond 128k

cathalc · on May 16, 2024

Hmm perhaps you are right.. There's no context on the page about what ">128k" means so possibly a bad assumption on my part!

afro88 · on May 15, 2024

Not sure why you've been downvoted. Needle in a haystack testing exists for a reason

tulip4attoo · on May 14, 2024

You don't really use it, right? There's no way to debug if you're doing it like this. Also, the accuracy isn't high, and it can't answer complicated questions, making it quite useless for the cost.

dang · on May 14, 2024

Please make your substantive points without crossing into personal attack. Your comment would be fine without the first sentence.

https://news.ycombinator.com/newsguidelines.html

leetharris · on May 14, 2024

There's no way it's Llama 3 70b quality.

I've been trying to work Gemini 1.5 Pro into our workstream for all kinds of stuff and it is so bad. Unbelievable amount of hallucinations, especially when you introduce video or audio.

I'm not sure I can think of a single use case where a high hallucination tiny multimodal model is practical in most businesses. Without reliability it's just a toy.

dibujaron · on May 14, 2024

Seconding this. Gemini 1.5 is comically bad at basic tasks that GPT4 breezes through, not to mention GPT4o.

dragonwriter · on May 14, 2024

> With the release of GPT-4o I'm not clear on why an organization not bound to GCP would pick Gemini.

Price for anything, particularly multimodal tasks that with OpenAI GPT-4o is the cheapest model, that doesn't need GPT-4 quality. GPT-3.5-Turbo — which itself is 1/10 the cost of GPT-4o, is $0.5/1M tokens on input, $1.50/1M on output, with a 16K context window. Gemini 1.5 Flash, for prompts up to 128K, is $0.35/1M tokens on input, and $0.53/1M tokens on output.

For tasks that require multimodality but not GPT-4 smarts (which I think includes a lot of document-processing tasks, for which GPT-4 with Vision and now GPT-4 are magical but pricy), Gemini Flash looks like close to a 95% price cut.

mupuff1234 · on May 14, 2024

I think that's a bit like asking why would someone need a 1gb Gmail when 50mb yahoo account is clearly enough.

It means you can dump context without thinking about it twice and without needing to hack some solutions to deal with context overflow etc.

And given that most use cases most likely deal with text and not multimodal the advantage seems pretty clear imo.

tedsanders · on May 14, 2024

Long context is a little bit different than extra email storage. Having 1 gb of storage instead of 50 mb has essentially no downside to the user experience.

But submitting 1M input tokens instead of 100k input tokens:

- Causes your costs to go up ~10x

- Causes your latency to go up ~10x (or between 1x and 10x)

- Can result in worse answers (especially if the model gets distracted by irrelevant info)

So longer context is great, yes, but it's not a no-brainer like more email storage. It brings costs. And whether those costs are worth it depends on what you're doing.

thefourthchime · on May 14, 2024

I tried to use the 1M tokens with Gemini a couple of months ago. It either crashed or responded ___very__ slowly and then crashed.

I tried a half dozen times and gave up, I hope this one is faster and more stable.

neverokay · on May 15, 2024

Context length isn’t the same as context volume I think. Just input the 1m tokens slower, it’ll still be in context.

killerstorm · on May 14, 2024

I guess it depends on what you want to do.

E.g. I want to send an entire code base in a context. It might not fit into 128k.

Filtering down is a complex task by itself. It's much easier to call a single API.

Regarding quality of responses, I've seen both disappointing and brilliant responses from Gemini. Do maybe worth trying. But it will probably take several iterations until it can be relied upon.

mountainriver · on May 15, 2024

1m is great for multimodal agentic workflows where you need to keep track of history

treprinum · on May 14, 2024

GPT-3.5 has 0.5s average first-token latency and Claude3 Haiku 0.4s.

chimney · on May 14, 2024

Price.

causal · on May 14, 2024

1M token context by default is the big feature here IMO, but we need better benchmarks to measure what that really means.

My intuition is that as contexts get longer we start hitting the limits of how much comprehension can be embedded in a single point of vector space, and will need better architectures for selecting the relevant portions of the context.

dragonwriter · on May 14, 2024

> 1M token context by default is the big feature here IMO, but we need better benchmarks to measure what that really means.

Multimodality in a model That's between 4-7% the cost per token of OpenAI’s cheapest multimodal model is an important feature when you are talking about production use and not just economically unsustainable demos.

leetharris · on May 14, 2024

The problem is that even 1.5 Pro seems completely useless for long context multimodal stuff.

I have tried it for so many use cases in video / audio and it hallucinates an unbelievable amount. More than any other model I've ever used.

So if 1.5 Pro can't even handle simple tasks without hallucination, I imagine this tiny model is even more useless.

refulgentis · on May 14, 2024

In preview, can't be used in production, they already rug-pulled people building on Gemini w/r/t cost and RPM, and they're pointedly not putting any RPM or cost on the page. (seriously, try finding info on cost, RPM, or release right now, you're linked in circles.)

Agree on OpenAI multimodal but it's sort of a stilted example itself, it's because OpenAI has a hole in its lineup - ex. Claude Haiku is multimodal, faster, and significantly cheaper than GPT 3.5.

dragonwriter · on May 14, 2024

> they're pointedly not putting any RPM or cost on the page

360 RPM base limit, pricing is posted.

> seriously, try finding info on cost, RPM, or release right now,

I wasn't making up numbers, its on their Gemini API pricing page: https://ai.google.dev/pricing

refulgentis · on May 14, 2024

Nice, thanks (btw, I didn't think you were making it up, it was in the keynote!)

causal · on May 14, 2024

+1 on Haiku being oft overlooked.

verdverm · on May 14, 2024

Shows the power of the brand and the limit of names consumers will recall long term

"Who are the biggest soda or potato chip makers?"

refulgentis · on May 14, 2024

[flagged]

memothon · on May 14, 2024

Can you explain this comment further? I don't really understand your point here.

refulgentis · on May 14, 2024

Given the one comment, I wouldn't draw too many conclusions re: Anthropic.* SNR ratio on AI everywhere is low-ish, and optimizing for a low cost multimodal model is a business-directed need that's rather niche currently.

* I was very pleasantly surprised to see it ranking as high as #4, and its still top #50, in App Store.

verdverm · on May 14, 2024

1. Not an AI specific comment series, it applies generally

2. Comments like yours are against the guidelines of HN

3. You don't know what I do and are incorrect in your assessment of my work and experience (another common HN mishap)

---

(re: #2/3) Please make your substantive points without crossing into personal attack.

https://news.ycombinator.com/newsguidelines.html

anoncareer0212 · on May 14, 2024

What do you mean?

A) This fills a gaping hole for cheap multimodal models, OpenAI doesn't have one

B) Anthropic's Haiku is a good choice.

You) wow A didn't know Anthropic. Goes to show power of brands, much like snack foods

B) Eh I wouldn't conclude anything from A. Its one comment. some people don't know what an Anthropic is because there's high interest in AI relative to interest in AI APIs. you can expect a low SNR ratio, even on HN

You) Stop personally attacking me! It's against the rules!!

verdverm · on May 14, 2024

> You) Stop personally attacking me! It's against the rules!!

Your comment history certainly shows a disregard for the HN guidelines

https://news.ycombinator.com/newsguidelines.html

refulgentis · on May 14, 2024

I don't think its helpful to broaden the aspersions: you had a couple hours to cool off, and also got signal from the community that you went overboard.

It's clear I was talking to you, not about you, you can tell because it's a reply to you.

I do understand it would feel awful if I replied to you, ignoring you, and instead voicing to some anon 3rd person you don't know Anthropic pricing. Someone with looking to bully could possibly use that to say you are an amateur who doesn't know what you're talking about w.r.t to AI and only knows a brand name or two.

If anyone does that, let me know, I'll correct their bullying...pulls on sunglasses...with extreme prejudice.

verdverm · on May 15, 2024

> signal from the community that you went overboard.

What signal are you referring to?

That you and one other person are making low brow comments that do not enhance the conversation?

refulgentis · on May 15, 2024

> What signal are you referring to?

1. The comment where you began getting upset and attacking people is grayed out, the only one in this thread.

2. The other person straightforwardly explaining to you I was replying to you, not talking about you. (which, I do understand why you're taking personally, but humbly, it isn't a rule violation or close, or "low-brow", it's well within bounds of adult conversation)

3. Me agreeing the reply was a reply, I don't know you or judge you, and if anyone does, I gave a nice playful comment telling you I'd fight for your honor.

I humbly suggest taking a step back from this thread: it can be incredibly frustrating when you feel you were wronged and no one is listening, I feel ya. I tried giving you an e-hug without being condescending with the sunglasses comment. We both agree its outside the bounds of HN to continue this. Yell at me on Twitter for being sneaky and lying and saying you're bad at AI and making low-brow comments, @jpohhhh.

WhitneyLand · on May 15, 2024

Limitations of single point in vector space of what dimension?

I’m not sure it’s public knowledge, but it’s an architecture choice. They choose how big to make the embedding dimension.

My point is just that there’s no limitation in principle, it’s just a matter of how they design it and resource constraints.

causal · on May 15, 2024

Thanks for responding to that point - it's the one most on my mind.

So OpenAI's large embedding model has 3072 dimensions, though in practice far fewer are probably used. Clearly you can't compress 1M tokens down to 3072. Yet those 3072 numbers are all you've got for capturing the full meaning of the previous token when predicting the next one; including all 1M tokens of modifying context.

So perhaps human language is simply never complex enough to need more than 3072 numbers to represent a given train of thought, but that doesn't seem clear to me.

Edit: Since Gemini is relevant here, it looks like their text embedding model is 768 dimensions.

WhitneyLand · on May 15, 2024

Yes but we can distinguish between embedding provided to customers, and internal embeddings. One is optimized for usage in certain types of applications, but the internal embeddings need to be optimized to support long contexts and are not constrained by the customer facing embeddings.

For example, with Open AI I believe it’s known that the internal dimension for Gpt3 was 12,288.

causal · on May 15, 2024

Are the same embeddings not used internally? I thought they were. Maybe I'm wrong about that.

Mistral uses a 1024 dimension embedding for 8K context. I think the point about trying to capture that rich of a context into a smaller number of dimensions still stands?

WhitneyLand · on May 15, 2024

Yes definitely, you have a valid concern.

For long contexts this is a key consideration along with what self attention optimizations the model chooses to implement.

They don’t make this public, but we can infer they can’t be using full self attention pairs at 1,000,000 tokens because it scales quadratically and would take Terabytes of RAM.

There are different approaches like sparse attention, and the only way to really know how well their choices work is to test it.

neverokay · on May 15, 2024

So perhaps human language is simply never complex enough to need more than 3072 numbers to represent a given train of thought, but that doesn't seem clear to me.

Will compute allow that number to go up? Or is that an optimal number?

WhitneyLand · on May 15, 2024

Definitely has trended upward, there’s no special number. It’s just a matter of how much compute, storage, time to allocate to that part of the architecture.

causal · on May 15, 2024

Has it? It's my understanding that GPT-3 was 12,288 and GPT-4 went down to 3072.

WhitneyLand · on May 15, 2024

I’m not aware that internal embedding dimension size has been made public for Gpt4 et al.

In general for models we know its trended upward, but for sure it’d be interesting to know what they’re using now.

shoelessone · on May 15, 2024

> My intuition is that as contexts get longer we start hitting the limits of how much comprehension can be embedded in a single point of vector space, and will need better architectures for selecting the relevant portions of the context.

Is it possible to explain what this means in a way that somebody only roughly familiar with vectors and vector databases? Or recommend an article or further reading on the topic?

causal · on May 15, 2024

So most of my understanding comes from this series, particularly the last two videos: https://www.3blue1brown.com/topics/neural-networks

Essentially each token of a text occupies a point in a many-dimensional model that represents meaning, and LLMs predict the next token by modifying the last token with the context of all the tokens before it. Attention heads are basically a way of choosing which prior tokens are most relevant and adjusting the last token's point in vector-space accordingly.

sojuz151 · on May 15, 2024

>My intuition is that as contexts get longer we start hitting the limits of how much comprehension can be embedded in a single point of vector space, and will need better architectures for selecting the relevant portions of the context.

We are dealing with multi-headed attention, therefore we have multiple points per token. You can always increase the number of heads or the size of the key vector.

causal · on May 15, 2024

The token embedding is what ultimately gets nudged around by the heads though, right? The key vector just relates to the context size, not the token embedding size, afaik.

refulgentis · on May 14, 2024

Yeah it's not very good in practice, you can get a halfway decent demo out of it ("look I gave it 6.5 harry potters and it made an SVG map connecting characters with annotations!!"...some of the characters...spare annotations...cost $20). Just good enough to fool you a couple times when you try to make it work 10 times.

nightski · on May 14, 2024

A lightweight model that you can only use in the cloud? That is amusing. These tech megacorps are really intent on owning your usage of AI. But we must not let that be the future.

dragonwriter · on May 16, 2024

One thing OpenAI is beating Google at is actually publishing pricing for its APIs (and being consistent as to what they are called.)

Google has I think 10 models available (there’s more than ten model names, but several of the models have multiple aliases) through what the Google Cloud console calls the Generative Language API (the documentation calls it the Gemini API) – based on enumerating the model list through the API itself.

Of those, 3 have pricing information on the documentation page for Gemini API pricing, 2 of which are in preview so that pricing applies in the future.

Only one (the same one of the 3 on the documentation page that is not in preview) has pricing listed on the console for the Generative Language API. On the Cloud SKUs list, there is no Generative Language API, but there is for the Gemini API, with, again, the same one model. On the Cloud Price list which the console page links for the “latest pricing” (why are there so many different things?) neither the Generative Language API nor the Gemini API is listed at all.

kherud · on May 14, 2024

Now that context length seems abundant for most tasks, I'm wondering why sub-word tokens are still used. I'm really curious how character-based LLMs would compare. With 2 M context, the compute bottleneck fades away. I'm not sure though what role the vocabulary size has. Maybe a large size is critical, since the embedding already contains a big chunk of the knowledge. On the other hand, using a character-based vocabulary would solve multiple problems, I think, like glitch tokens and possibly things like arithmetic and rhyming capabilities. Implementing sub-word tokenizers correctly and training them seems also quite complex. On a character level this should be trivial.

AaronFriel · on May 14, 2024

The attention mechanism is vastly more efficient to train when it can attend to larger, more meaningful tokens. For inference servers, a significant amount of memory goes into the KV cache, and as you note, to build up the embedding through attention would then require correlating far more tokens, each of which is "less meaningful".

I think we may get to this point eventually, in the limit we will want multimodal LLMs that understand images and sounds down to the pixel and frequency, and it seems like for text, too, we will eventually want that as well.

thomasahle · on May 14, 2024

Maybe you could just use a good-old 1D-CNN for the bottom 3-4 layers. Then the model has been able to combine characters into roughly token length chunks anyway.

Just make sure to have some big MLPs at the start too, to enrich the "tokens" with the information currently stored in the embedding tables.

yk · on May 14, 2024

> a significant amount of memory goes into the KV cache

Is there a good paper (or talk) how inference looks at scale? (Kinda like ELI-using-single-gpus)

AaronFriel · on May 15, 2024

The PagedAttention paper is a good starting point as it represents the first major open source inference engine that had "pretty good" batch performance for transformers.

https://arxiv.org/pdf/2309.06180

darby_eight · on May 14, 2024

> On a character level this should be trivial.

Characters are not the semantic components of words—these are syllables. Generally speaking, anyway. I've got to imagine this approach would yield higher quality results than the roman alphabet. I'm curious if this could be tested by just looking at how LLMs handle English vs Chinese.

inbetween · on May 14, 2024

The minimal semantic parts of words are morphemes. Syllables are phonological units (roughly: the minimal unit for rhythmic purposes such as stress, etc)

darby_eight · on May 14, 2024

Only in languages that have morphemes! This is hardly a universal attribute of language so much as an attribute of those that use an alphabet to encode sounds. It makes more sense to just bypass the encoding and directly consider the speech.

Besides, considering morphemes as semantic often results in a completely different meaning than we actually intend. We aren't trying to train a chatbot to speak in prefixes and suffixes, we're trying to train a chatbot to speak in natural language, even if it is encoded to latin script before output.

inbetween · on May 15, 2024

That's technically wrong. Every language has morphemes for the simple reason that every word is at least one morpheme. `cat` is a morpheme. `cats` is two morphemes (cat-s).

(The point about semantics is also technically wrong. You would first need to specify your view of semantic compositionality before such a point can be evaluated, but the usual views of semantics don't have any such consequence.)

darby_eight · on May 15, 2024

> Every language has morphemes for the simple reason that every word is at least one morpheme.

Sure, if you define "morpheme" as a collection of syllables that's meaningful to people using alphabetic script. I don't see any benefit to this compared to working with syllables directly, which is a meaningful concept regardless of the script used to encode them.

dragonwriter · on May 16, 2024

> Sure, if you define “morpheme” as a collection of syllables

Cats, as noted, has two morphemes, despite having only one syllable. Syllables and morphemes are largely orthogonal, morphemes can be less than, equal to, or more than a syllable (and even when more than, may or may not start or end on a syllable boundary.)

(Also, syllables aren’t the minimal semantic units even of spoken speech, those are phonemes – a syllable consists of at least one phoneme, potentially more. But morphemes, even an alphabetic script if it isn’t perfectly phonetic, still don’t necessarily map to one or more phonemes, since is textual semantic unit may have no effect on pronunciation.)

inbetween · on May 15, 2024

You might not see any benefit, but that's what those words mean :) Grab any textbook, it is linguistics 101!

joaogui1 · on May 14, 2024

I would say 2 big problems are:

1. latency, which would get worse if you have to sequentially generate more output

2. These models very roughly turn tokens -> "average meaning" on the embedding layer, followed by attention layers that combine the meanings, and feed forward layers that match the current meaning combination to some kind of learned archetype/prototype almost. When you move from word parts to characters all of that becomes more confusing (what's the average meaning of a?) and so I don't think there are good enough techniques to learn character-based models yet

novaRom · on May 14, 2024

In AI music generation we have much better results with large vocabulary sizes of 10^6 order, my uneducated guess is that's because transformers are not universal pattern recognizers, they can catch patterns on a certain granularity level only.

mrcwinn · on May 15, 2024

I will say Google certainly has the better branding team. I like Gemini, Gems, and so on. “ChatGPT” is quite a clunky mess. OpenAI just feels like a faceless entity.

All things that could change but seems late in the game at this point. They certainly had the money to be more creative as they came to market.

precompute · on May 15, 2024

"ChatGPT" is like "Google". "Gemini" is never replacing that.

ukuina · on May 15, 2024

OpenAI desperately needs a marketing consult.

"GPT4o"? Seriously?

Even "GPT4 Omni" is easier in conversation, and that's what the "o" stands for!

They severely underestimate the number of casual users they have.

zarzavat · on May 15, 2024

OpenAI doesn’t need marketing because everybody knows who’s the best. Same reason that if I asked you what’s the best violin you would say Stradivari, even though you’ve never seen an ad for one.

OpenAI could call their model the “[poo emoji] 5000” for all the difference it would make.

cpeterso · on May 15, 2024

Goggle “Gemini” is a much better product name than any name OpenAI has, but the Gemini product family could use some structure:

  Gemini Advanced (“with Ultra 1.0”)
  Gemini Ultra
  Gemini Pro
  Gemini Flash
  Gemini Nano-1
  Gemini Nano-2

zone411 · on May 15, 2024

15.3 On NYT Connections benchmark:

GPT-4 turbo (gpt-4-0125-preview) 31.0

GPT-4o 30.7

GPT-4 turbo (gpt-4-turbo-2024-04-09) 29.7

GPT-4 turbo (gpt-4-1106-preview) 28.8

Claude 3 Opus 27.3

GPT-4 (0613) 26.1

Llama 3 Instruct 70B 24.0

Gemini Pro 1.5 19.9

Mistral Large 17.7

-----> Gemini 1.5 Flash 15.3

Mistral Medium 15.0

Gemini Pro 1.0 14.2

Llama 3 Instruct 8B 12.3

Mixtral-8x22B Instruct 12.2

ukuina · on May 15, 2024

So many high-performing, yet poorly-named OpenAI models in that list.

alephxyz · on May 14, 2024

Not very informative. They're selling it as the fast/cheap option but they don't benchmark inference speed or compare it with non-gemini models.

According to https://ai.google.dev/pricing it's priced a bit lower than gpt3.5-turbo but no idea how it compares to it.

michaelteter · on May 15, 2024

If Gemini Flash is just faster Gemini, then I would say that bad answers aren't better when delivered more quickly.

I ran Gemini Pro side by side with ChatGPT 4 for a few months on practical coding, systems architecture, and occasional general questions. ChatGPT was more useful at least 80% of the time. Gemini was either wrong or laboriously meandering in reaching a useful answer that it wasn't worth using, in my experience.

Faster isn't what I needed... Maybe it's also "smarter" (more useful) too now?

ganzuul · on May 15, 2024

Presumably we are defining smartness as doing more with less, so this indicates they have something going on in the latent space which will scale.

numbers · on May 14, 2024

It's ironic that when you ask these AI chatbots what their own context size is, they don't know. ChatGPT doesn't even know about 4o existing in 4o.

advisedwang · on May 14, 2024

Ask a human how many neurons they have. Hell, over history humans haven't even consistently understood that the brain is where cognition happens.

simonw · on May 14, 2024

The models didn't exist when their training data was collected.

But... that's not really an excuse any more. Model vendors should understand now that the most natural thing in the world is for people to ask models directly about their own abilities and architecture.

I think models should have a final layer of fine-tuning or even system prompting to help them answer these kinds of questions in a useful way.

SoftTalker · on May 14, 2024

Does a monkey know that it is a monkey?

verdverm · on May 14, 2024

I think "yes" is the most likely answer here

animals have a lot more intelligence than they typically get attributed

Tool use, names, language, social structure and behavior, even drug use has been shown across many species

chaorace · on May 14, 2024

Okay, but the monkey doesn't know that it knows that it's a monkey.

verdverm · on May 14, 2024

are you sure?

Many animals recognize themselves and their species as separate concepts

keefle · on May 15, 2024

He meant something more meta I believe. Knowing you are a monkey is one thing, and knowing that you know you are a monkey is a another thing. It's about being cognisant of the fact that there is something called knowledge and you have it

chaorace · on May 15, 2024

Precisely. To put it more concretely: it is no small feat to grasp the abstract distinction between known-knowns, known-unknowns, unknown-knowns, and unknown-unknowns. They do not know what they do not know.

fourthark · on May 15, 2024

How do you know?

nojvek · on May 14, 2024

Will wait for Meta to release Flash equivalent weights.

Multi-Modal modals running offline on mobile devices with millisecond latencies per token seems the future.

Where is Apple in all of this. Why is Siri still so shit?

visarga · on May 14, 2024

Apple made a deal with OpenAI for GPT4o, the stakes are indeed high, can't be caught with pants down. iPhone needs to remain the premium brand.

quantisan · on May 14, 2024

Price (input) $0.35 / 1 million tokens (for prompts up to 128K tokens) $0.70 / 1 million tokens (for prompts longer than 128K)

Price (output) $0.53 / 1 million tokens (for prompts up to 128K tokens) $1.05 / 1 million tokens (for prompts longer than 128K)

---

Compared to GPT-3.5 Turbo

Input US$0.50 / 1M tokens Output US$1.50 / 1M tokens

CapsAdmin · on May 15, 2024

I was really fascinated by Gemini 1.5 pro, though it was a bit slow and sort of 70% accurate in my use case.

I have a huge 500k~ tokens and complex niche codebase that I've worked on for many years, largely alone. There are parts of it I wish to refactor but I struggle because it I've become blind to my own code. It also sometimes feels lonely in a way. If it was a game I could at least show my friends, but this project is too abstract.

Gemini missed the mark a few times, especially when asking about more complex things but overall it was useful. That it got things wrong is sort of ok because I knew the codebase well enough to spot those mistakes.

Gemini 1.5 pro gave me a glimpse into what it was like having "someone" understand your whole codebase, hint at areas to improve, etc. A bit like a true copilot or coworker, but for a dream hobby project.

loulouxiv · on May 15, 2024

How do you proceed to use it this way ? Did you have to write custom code to call the API with your codebase ? Can you give some example questions for which the LLM gave you useful insights and/or made you save time ?

CapsAdmin · on May 16, 2024

I just used https://aistudio.google.com/, no api or anything.

My project "Nattlua" is a typed version of Lua like Typescript is to Javascript.

Some example questions that would give me new insight:

- Asking what the codebase is without supplying the readme. (though it might know because the codebase is public already)

- Asking it to generate complex type code based on existing tests and examples and without.

- Asking for places to refactor, the most fun one. Sometimes the exact solution provided is wrong, but often it's a good start.

fnordpiglet · on May 15, 2024

How does it compare to others products in your use case?

CapsAdmin · on May 16, 2024

I feel it was worse than gpt4 at coding at a low level (getting syntax right, forgetting small details, etc)

But I can write code so those can be fixed. At a higher level it's OK, but the most valuable thing is being able to have my codebase in its context. No other public LLM currently as far as I know can do that.

objektif · on May 14, 2024

Does Goog have anything like openai assistant via API? If they had I would definitely give it a try.

eru · on May 14, 2024

The website talks about a specific benchmark:

> Python code generation. Held out dataset HumanEval-like, not leaked on the web

What I find interesting here is that for this particular benchmark _not_ publishing the benchmark is advertised as a feature (instead of as a sign of 'trust me, bro, we have a great benchmark'), and I can understand why. Still these are strange times we live in.

cynicalsecurity · on May 14, 2024

Feed 1 mln tokens

@

Get blocked by some silly overly sensitive "safety" trigger

gpm · on May 14, 2024

Last I checked you could disable the safety triggers as an API user with gemini (which doesn't alleviate your obligation to follow the TOS as to the uses of the model).

VS1999 · on May 14, 2024

I'm not working with a company that can just write in the ToS "we can do anything we want. lol. lmao" and expect me to follow it religiously. Corporations need less control over speech, not more.

zxexz · on May 15, 2024

I mean, you are using a service they're providing - many would say they they're exercising their rights by gatekeeping how it's used. There are pretty good models out there you could use however you want for your own purpose, whatever it is. I occasionally fine-tune Mixtral on HN posts+comments and chat with comments. An emergent Dang actually once told me off for flame-baiting a free speech comment.

dragonwriter · on May 16, 2024

> I’m not working with a company that can just write in the ToS “we can do anything we want. lol. lmao” and expect me to follow it religiously.

You must not work with many companies in the US, especially for online services, at all then.

ArkimPhiri · on May 15, 2024

I added Gemini 1.5 Flash to my AI app Semaj AI and its fast indeed, it is also very intelligent compared to Gemini 1.0 Pro.

stan_kirdey · on May 15, 2024

I've been diligently trying to use Gemini 1.5 Pro, and it is not even on the level of Llama3-70B. I really hope Gemini improves, even if it gets reduced context length.

ukuina · on May 15, 2024

FAIR really swung for the fences with Llama3. It's a very impressive model, but the 8K context size is quite limiting for most use-cases.

webprofusion · on May 15, 2024

Uh guys, yeah.. Adobe are on the phone saying something about trademark infringement, apparently Flash is something else? I don't know, I've never heard of it..

exodust · on May 15, 2024

Interestingly, until your comment I hadn't made any connection with old Flash, even though I spent hundreds of hours making Flash games.

This suggests names don't stick around for long and can be re-used. Perhaps Google could bring back "Buzz" and "Wave" since enough time has passed!

cs702 · on May 14, 2024

We're witnessing a race to the bottom on pricing as it's happening. Competition based solely or mainly on pricing is a defining characteristic of a commodity market, i.e., a market in which competing products are interchangeable, and buyers are happy to switch to the cheapest option for a given level of quality.

There's an old saying that if you're selling a commodity, "you can only be as smart as your dumbest competitor."

If we want to be more polite, we could say instead: "you can only price your service as high as your lowest-cost competitor."

It seems that a lot of capital that has been "invested" to train AI models is, ahem, unlikely ever to be recovered.

rmbyrro · on May 14, 2024

Google figured it can't beat OpenAI technically, but they sure know they can beat them financially and infrastructurally.

__loam · on May 14, 2024

Is infrastructure and scale not an expression of technical ability? It should have been obvious that Meta and Google would bury a tiny company with less than 1000 employees given the amount of capital they can leverage for compute, talent, and data. Google literally invented GPT.

rmbyrro · on May 16, 2024

> Google literally invented GPT

And if it wasn't for OpenAI, it would still be locked into Google's basement.

tfsh · on May 14, 2024

In technical terms they're on par. But you're correct about Google being able to bet on their decades of infrastructure

rmbyrro · on May 16, 2024

I bet OpenAI has 100x more customers than Gemini, and Gemini recent growth rate will not change this picture anytime soon.

daghamm · on May 14, 2024

Is this race to the bottom or just Googles new TPUs being extremly efficient?

Delmololo · on May 14, 2024

But the race to the bottom has an opposition right?

So people expect to see a return of investment which will create the bottom of pricing (at least as soon as the old money ran out)

I'm also curious if AI is a good example because ai will become fundamental. This means if you don't invest you might be gone therefore it's more like a fee in case the investment would not pan out.

__loam · on May 14, 2024

Supply and demand determines price, not the hopes and dreams of investors.

r0m4n0 · on May 14, 2024

Google is building on top of and integrated with their cloud offerings. Having first party solutions like this gives big cloud customers an easy way to integrate. For Google it’s just another tool in the chest that gets sold to these big enterprises. Many go all in on all the same cloud products. Also the models are only the building blocks. Other cloud products at Google will be built with this and sold as a service

Not so sure about Open AI though…

Aloisius · on May 14, 2024

Price competition isn't limited to commodities.

cs702 · on May 14, 2024

I never said it was.

EGreg · on May 14, 2024

You never said it wasn't, either :-P

Aloisius · on May 14, 2024

Then why imply that it is a commodity because they (partly) compete on price?

Fungibility is the defining characteristic of commodities. While these products can be used to accomplish the same task, we're not near real fungibility yet.

cs702 · on May 15, 2024

Products that are fungible compete on price (what else?). Chat-with-AI services that have similar performance are pretty fungible today. Switching from one to the other is... remarkably easy. The moment Gemini Flash's competitors start losing customers they will lower their prices to remain competitive.

Aloisius · on May 15, 2024

Lots of products besides commodities will lower their prices to remain competitive. General Electric keeps their prices competitive with Pratt & Whitney, but that doesn't make jet engines a commodity.

This product from Google clearly competes on price/performance ratio, speed and of course, brand.

__loam · on May 14, 2024

You're saying the quiet part out loud here.

refulgentis · on May 14, 2024

It's absolutely unconscionable that Gemini Ultra got memory-holed. I can't trust anything that Google says about benchmarks.

It seemingly existed only so in December 2023, Gemini ~= GPT-4. (April 2023 version) (on paper) ("32-shot CoT" vs. 5-shot GPT-4)

CSMastermind · on May 14, 2024

Anyone who uses both products regularly will tell you that Gemini Advanced is far behind GPT-4 and Claude 3 Opus.

Pretending that they have a model internally that's on par but they're not releasing it is a very "my girlfriend goes to another school" move and makes no sense if they're a business that's actually trying to compete.

dontreact · on May 15, 2024

Gemini 1.5 Pro is on-par with GPT-4 and Opus in LMSys, and you can go try it for yourself in LMSys, and it's coming soon to Gemini Advanced (announced at I/O). Seems like GPT4O puts OpenAI in front again on LMSys

summerlight · on May 14, 2024

Gemini Ultra is 1.0 with 8k window. This is 1.5 with 1m window. Your feeling is based on incorrect assumption.

anoncareer0212 · on May 14, 2024

And?

You're replying to a comment that points out Gemini Ultra was never released, wasn't mentioned today, and it's the only model Google's benchmarking at GPT-4 level. They didn't say anything about feelings or context window.

dontreact · on May 14, 2024

Gemini Ultra has been available for people to try via Gemini Advanced (formerly Bard) for a few months

cma · on May 14, 2024

It says it may fall back to a worse model under load and there is no way to tell which you are getting. I think chatgpt has at times done something similar though.

summerlight · on May 14, 2024

> You're replying to a comment that points out Gemini Ultra was never released

What are you even talking about? How do you know it's memory-holed if you haven't used it? The API is not GA, but the model can be used through the chatbot subscription. GP is talking about their lack of trust on Google's claim of 1M context token, not GPT-4 level reasoning. If you're expect GPT-4 level performance with cost-efficient models, that's another problem.

refulgentis · on May 15, 2024

Idk why you're so aggro, they're right, I meant the GPT-4 level reasoning