Thanks!
We trained on most european languages (english, french, spanish, russian...), arabic, and chinese so it does well on those!
We haven't tested too much on other languages, but happy to do so if there is a use case
I'm curious why you consider an upside. Hypothetically speaking, wouldn't it be better if the embeddings could automatically be updated when you want them to be? Is the problem that it's not easy to automated based on the specific rules of when you want updates to happen?
We could add support for something like `pg_vectorize` in order to generate embeddings directly from the database. We simply haven't seen enough demand yet. Perhaps we haven't listened hard enough :')
I've also thought about creating a Codenames bot: what if we could use semantic similarity to batch words together? Surely, this can be done using a prebuilt embedding model and clustering!
After some failed experiments - it performed worse than I thought it will - I've googled the subject, and... it turns out there's a whole paper about ML and codenames :)
Very cool! A couple of months ago, after playing Codenames for the first time, I devised a strategy that any team of humans can learn, in order to win in 3 moves, guaranteed, every game. It only takes a few minutes to learn.
Imagine you have a Codenames board and you're on the blue team. The key is that you can represent the board as a text of 1s and 0s, where 1 is blue and 0 is not blue:
0 1 0 0 1
0 0 0 0 1
1 1 0 0 0
1 0 0 0 1
0 0 1 0 0
The next key is to realize that each sequence of 4 cards can be thought of as a 4-digit binary number, for example the first binary sequence above is "0100", or 4. The second binary sequence is "1000", or 8.
The strategy is to simply communicate the numbers 4 and 8 to your team—so we can simply choose a word that starts with the 4th letter of the alphabet (d) and ends in the 8th letter of the alphabet (h). If we tell our team "ditch", they'll know exactly what the first 8 cards are.
Repeat this for 3 turns, and your team has figured out 8 cards each turn, for a total of 24 cards out of 25 total. The final card can be inferred simply—on the third turn, if any card is unaccounted for, your team knows it must be in the 25th position.
I just hope your teammates know how to count in binary.
(N.B. the strategy can be optimized by eliminating less-common letters of the English alphabet, like J, where it may be difficult to think of words that end in that letter)
> Your clue must be about the meaning of the words. You can't use your clue to talk about the letters in a word or its position on the table.
You're talking about cluing about the position, so it's against the rules.
I guess the rules do also say:
> But if no one notices that a clue is invalid, it counts as valid.
which kinda maybe makes it legal, but it seems like it's quite against the spirit of the rules, and the "invalid" clue thing there is just stuff like "accidentally used a word that's a part of a compound word on the board", not maliciously trying to cheat the rules.
I'd also attempted something like this years ago [1] based on word2vec, and it played decently! Occasionally it would throw out nonsensical clues or guesses, but was usable.
I'm sure with more modern embeddings/tools one could make a really excellent not.
We just skip several of unstructured's categories, such as tables and images. We also do some deduplication post-ANN as we want to optimize for novelty as well as relevance. That being said, how are you planning to embed an image or a table to make it searchable? It sounds simple in theory, but how do you generate an actually good image summary (without spending huge amounts of money filling OpenAI's coffers for negligible benefit)? How do you embed a table?
Thanks for answering! In my case, I don't directly use RAG; but rather post-process documents via LLMs to extract a set of specific answers. That's also why I've asked about deduplication - asking LLM to provide an answer from 2 different data sources (invalid unstructured table text & valid structured table contents) quickly ramps up errors.
Did you encounter hidden costs when using Azure Document Intelligence? I processed some PDFs using the paid tier, but the resulting costs were way higher than expected, despite using a prebuilt layout model for only structured extraction. Have no clue what could cause it, no extra details on the billing page. Not sure if the price is misleading, or if it's a skill issue on my part :)
For my use case, overall Marker seems to work pretty well - but it has issues with tables. Merged cells, misplaced headers, and so forth. I'm currently extracting Polish PDFs that are //not// scanned
When compared to Azure Document Intelligence, Marker is really cheap when self-hosted (assuming you fall under the license requirements), but it does not produce high quality data. YMMV.
This aurora was really powerful! I could see it with a naked eye from a town in central Poland, despite cloudy weather and light pollution. Feels great to finally see it in person
> welfare/health care system is bad, taxes are not used well
There's a widespread lack of trust in the Polish government, which decreased even further during the 2015-2023 period. If the money is being funelled to the ruling politicians' families and friends, why willingly pay high taxes? I believe this is an underlying core issue, which would probably take a new generation to repair.
I'm not sure if there had been any unprecented drop in the trust level between 2015 and 2023 (meaning under the Law and Justice government).
While it is true that it's relatively low in Poland in general... Eg. according to this survey [1], the percentage of respondents expressing trust in the government decreased from 38% in 2016 down to 32% in 2022 (while clearly exceeding 40% about half way through).
Which is pretty normal whenever the same party stays at power for a longer period; its popularity wears out over time.
For comparison, the same score was at 39% back in 2012, midway through the term of the government preceding Law and Justice. Hardly a striking contrast.
I'm even less sure about your claim when it comes to the context of welfare systems in particular.
Social transfers and safety net is one of the very few areas where the Law and Justice government achieved substantial results, even though it had to steer the country through the hardships of the pandemics.
For example, in terms of the percentage of children at risk of poverty and social exclusion Poland ranked 14th in the EU back in 2015 [2]. By 2022, it ranked 6th [3].
Also look at [4], [5], [6]...
I am putting aside the infamous judiciary reforms, abortion, and other hot button areas (which are far less of a priority for an average voter than echo chamber—commentators tend to assume). I'm focusing on the taxing & welfare, and sheer facts.
Similiar laws are in Poland, except they're not really enforced.
It's really rare that the tax office would prove a company exists solely for tax optimization. The risk virtually drops to zero if one freelances after the hours and has legitimate invoices with other companies.
This often causes mismatch between Polish employees who wish to work remotely abroad, and for ex. employeers from the DACH region, where I've heard the laws are strictly enforced. One party claims there is no risk, and the other claims it's too risky :-) (taking other factors aside, such as employee protection, etc.)
My questions: what languages do your models currently support? Did you perform multilingual benchmarks? Couldn't find an answer on the website