More

mkaszkowiak · 2025-07-16T16:47:55 1752684475

Happy to see competition in rerankers! Good luck with your product.

My questions: what languages do your models currently support? Did you perform multilingual benchmarks? Couldn't find an answer on the website

ghita_ · 2025-07-16T17:05:02 1752685502

Thanks! We trained on most european languages (english, french, spanish, russian...), arabic, and chinese so it does well on those! We haven't tested too much on other languages, but happy to do so if there is a use case

ethan_smith · 2025-07-17T06:49:21 1752734961

Language support is a crucial differentiator for rerankers - would love to see MTEB or other cross-lingual benchmark results if you have them.

mkaszkowiak · 2024-10-30T12:31:46 1730291506

I agree with the author - introducing a vector database often isn't worth the extra complexity.

Personally, I can vouch for ParadeDB: https://www.paradedb.com/

It adds extra extensions to PostgreSQL which enable vector indexing, full text search and BM25. Works great and developers are helpful!

The major difference is that you must generate the embeddings by yourself, but I consider it an upside - to each their own :)

michaelmior · 2024-10-30T12:42:16 1730292136

> I consider it an upside

I'm curious why you consider an upside. Hypothetically speaking, wouldn't it be better if the embeddings could automatically be updated when you want them to be? Is the problem that it's not easy to automated based on the specific rules of when you want updates to happen?

mkaszkowiak · 2024-11-03T17:43:14 1730655794

Easier to handle edge-cases - real examples:

- What if certain rows in a table don't need to be embedded?

- What if we use a single API key for embedding database rows and user queries and it hits a rate limit - how to prioritize user queries?

- What if some rows should be vectorized using a different model, depending on an external configuration?

philippemnoel · 2024-10-30T13:37:48 1730295468

We could add support for something like `pg_vectorize` in order to generate embeddings directly from the database. We simply haven't seen enough demand yet. Perhaps we haven't listened hard enough :')

mkaszkowiak · on Aug 12, 2024

I've also thought about creating a Codenames bot: what if we could use semantic similarity to batch words together? Surely, this can be done using a prebuilt embedding model and clustering!

After some failed experiments - it performed worse than I thought it will - I've googled the subject, and... it turns out there's a whole paper about ML and codenames :)

https://arxiv.org/abs/2105.05885 (Playing Codenames with Language Graphs and Word Embeddings) - fun to read

0xDECEA5ED · on Aug 12, 2024

Very cool! A couple of months ago, after playing Codenames for the first time, I devised a strategy that any team of humans can learn, in order to win in 3 moves, guaranteed, every game. It only takes a few minutes to learn.

Imagine you have a Codenames board and you're on the blue team. The key is that you can represent the board as a text of 1s and 0s, where 1 is blue and 0 is not blue:

0 1 0 0 1

0 0 0 0 1

1 1 0 0 0

1 0 0 0 1

0 0 1 0 0

The next key is to realize that each sequence of 4 cards can be thought of as a 4-digit binary number, for example the first binary sequence above is "0100", or 4. The second binary sequence is "1000", or 8.

The strategy is to simply communicate the numbers 4 and 8 to your team—so we can simply choose a word that starts with the 4th letter of the alphabet (d) and ends in the 8th letter of the alphabet (h). If we tell our team "ditch", they'll know exactly what the first 8 cards are.

Repeat this for 3 turns, and your team has figured out 8 cards each turn, for a total of 24 cards out of 25 total. The final card can be inferred simply—on the third turn, if any card is unaccounted for, your team knows it must be in the 25th position.

I just hope your teammates know how to count in binary.

(N.B. the strategy can be optimized by eliminating less-common letters of the English alphabet, like J, where it may be difficult to think of words that end in that letter)

TheDong · on Aug 12, 2024

This strategy is specifically banned by the rules: https://czechgames.com/files/rules/codenames-rules-en.pdf

I quote:

> Your clue must be about the meaning of the words. You can't use your clue to talk about the letters in a word or its position on the table.

You're talking about cluing about the position, so it's against the rules.

I guess the rules do also say:

> But if no one notices that a clue is invalid, it counts as valid.

which kinda maybe makes it legal, but it seems like it's quite against the spirit of the rules, and the "invalid" clue thing there is just stuff like "accidentally used a word that's a part of a compound word on the board", not maliciously trying to cheat the rules.

abound · on Aug 12, 2024

I'd also attempted something like this years ago [1] based on word2vec, and it played decently! Occasionally it would throw out nonsensical clues or guesses, but was usable.

I'm sure with more modern embeddings/tools one could make a really excellent not.

[1] https://github.com/bcspragu/Codenames

mkaszkowiak · on July 30, 2024

How do you combine the outputs? Wouldn't there be data duplication between unstructured text and tables?

whakim · on July 30, 2024

We just skip several of unstructured's categories, such as tables and images. We also do some deduplication post-ANN as we want to optimize for novelty as well as relevance. That being said, how are you planning to embed an image or a table to make it searchable? It sounds simple in theory, but how do you generate an actually good image summary (without spending huge amounts of money filling OpenAI's coffers for negligible benefit)? How do you embed a table?

mkaszkowiak · on July 30, 2024

Thanks for answering! In my case, I don't directly use RAG; but rather post-process documents via LLMs to extract a set of specific answers. That's also why I've asked about deduplication - asking LLM to provide an answer from 2 different data sources (invalid unstructured table text & valid structured table contents) quickly ramps up errors.

mkaszkowiak · on July 30, 2024

Did you encounter hidden costs when using Azure Document Intelligence? I processed some PDFs using the paid tier, but the resulting costs were way higher than expected, despite using a prebuilt layout model for only structured extraction. Have no clue what could cause it, no extra details on the billing page. Not sure if the price is misleading, or if it's a skill issue on my part :)

jumploops · on July 30, 2024

We did not, I remember costs matching our expectations.

With that said, I have only used the previous tool (Form Recognizer) in production. Not sure if the new rebrand/product suite has more opaque costs.

mkaszkowiak · on July 30, 2024

For my use case, overall Marker seems to work pretty well - but it has issues with tables. Merged cells, misplaced headers, and so forth. I'm currently extracting Polish PDFs that are //not// scanned

When compared to Azure Document Intelligence, Marker is really cheap when self-hosted (assuming you fall under the license requirements), but it does not produce high quality data. YMMV.

vikp · on July 30, 2024

Working on improving tables soon (I'm the author of marker)

mkaszkowiak · on July 31, 2024

Glad to hear that :) Thanks for developing Marker!

chandrai · on Aug 14, 2024

2nd that. Marker work pretty well as async internal service for us! Thanks!

cpursley · on July 30, 2024

Yeah, the header stuff (and empty cells) for tables needs some work.

mkaszkowiak · on May 11, 2024

This aurora was really powerful! I could see it with a naked eye from a town in central Poland, despite cloudy weather and light pollution. Feels great to finally see it in person

mkaszkowiak · on Feb 26, 2024

Surprised by the amount of negative comments. Kudos to the team! This is very impressive to accomplish in 24h with a 3-man team.

mkaszkowiak · on Feb 5, 2024

I agree that it's problematic, however:

> welfare/health care system is bad, taxes are not used well

There's a widespread lack of trust in the Polish government, which decreased even further during the 2015-2023 period. If the money is being funelled to the ruling politicians' families and friends, why willingly pay high taxes? I believe this is an underlying core issue, which would probably take a new generation to repair.

V-2 · on Feb 5, 2024

I'm not sure if there had been any unprecented drop in the trust level between 2015 and 2023 (meaning under the Law and Justice government).

While it is true that it's relatively low in Poland in general... Eg. according to this survey [1], the percentage of respondents expressing trust in the government decreased from 38% in 2016 down to 32% in 2022 (while clearly exceeding 40% about half way through).

Which is pretty normal whenever the same party stays at power for a longer period; its popularity wears out over time.

For comparison, the same score was at 39% back in 2012, midway through the term of the government preceding Law and Justice. Hardly a striking contrast.

I'm even less sure about your claim when it comes to the context of welfare systems in particular.

Social transfers and safety net is one of the very few areas where the Law and Justice government achieved substantial results, even though it had to steer the country through the hardships of the pandemics.

For example, in terms of the percentage of children at risk of poverty and social exclusion Poland ranked 14th in the EU back in 2015 [2]. By 2022, it ranked 6th [3].

Also look at [4], [5], [6]...

I am putting aside the infamous judiciary reforms, abortion, and other hot button areas (which are far less of a priority for an average voter than echo chamber—commentators tend to assume). I'm focusing on the taxing & welfare, and sheer facts.

[1] https://www.cbos.pl/SPISKOM.POL/2022/K_037_22.PDF

[2] https://ec.europa.eu/eurostat/web/products-eurostat-news/-/E...

[3] https://ec.europa.eu/eurostat/statistics-explained/index.php...

[4] https://notesfrompoland.com/2022/11/10/poland-has-eus-second...

[5] https://notesfrompoland.com/2023/06/20/poland-has-eus-third-...

[6] https://www.statista.com/statistics/1130472/poland-poverty-r...

mkaszkowiak · on Feb 5, 2024

Thanks for the links! I'll read them :)

mkaszkowiak · on Feb 5, 2024

Similiar laws are in Poland, except they're not really enforced.

It's really rare that the tax office would prove a company exists solely for tax optimization. The risk virtually drops to zero if one freelances after the hours and has legitimate invoices with other companies.

This often causes mismatch between Polish employees who wish to work remotely abroad, and for ex. employeers from the DACH region, where I've heard the laws are strictly enforced. One party claims there is no risk, and the other claims it's too risky :-) (taking other factors aside, such as employee protection, etc.)