Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A little hard to understand why this is cool, but if I understand correctly:

1. Lucene is trying to get Approximate Nearest Neighbours (ANN) search working for semantic search purposes: https://issues.apache.org/jira/browse/LUCENE-9004 https://github.com/apache/lucene/issues/10047

2. The Panama Vector API allows CPU's that support it to accelerate vector operations: https://openjdk.org/jeps/438

So this allows fast ANN on Lucene for semantic search!

How did people do this before Lucene supported it? Only through entirely different tools?



A little confusing because "vector" here (largely) refers to two different things. "Vector search" being this ANN thing, but the "Vector API" is about SIMD. SIMD provides CPU operations on a bunch of data at a time, i.e. instead of one instruction for each 32-bit float, you operate on, depending on the CPU, 128 or 256 or 512 bits worth of floats at the same time. So, over scalar code, SIMD here could get maybe a 4-16x improvement (give or take a lot - things here are pretty complicated). So, while definitely a significant change, I wouldn't say it's at the make-or-break level.


As add-on to this comment: There's another Lucene issue from 2 weeks ago that provides some more details on different approaches that were considered: https://github.com/apache/lucene/issues/12302


Great explanation. But to be clear to those who don't follow:

SIMD is supported by Java out of the box but the optimizer might miss some opportunities. With this API it is far more likely that SIMD will be used if it's available and on first compilation so performance should be improved.


> SIMD is supported by Java out of the box but the optimizer might miss some opportunities

It's a little limited due to how objects are stored in memory. Might improve with Valhalla.


Lucene here just dealing with plain float[]s, so Valhalla at least shouldn't affect it much. It seems the limiting thing here is that it has sum accumulators, which the optimizer can't reorder because addition isn't associative.

(for reference: scalar impl: https://github.com/ChrisHegarty/lucene/blob/dd4eaac9af346a21... and SIMD impl: https://github.com/ChrisHegarty/lucene/blob/dd4eaac9af346a21...)


>> How did people do this before Lucene supported it?

By performing query expansion based on features of documents within the search results. Very efficient and effective if you have indexed the right features.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: