No Windows support? :\ Also, suffers from the exact same problem I see pretty mu...

phiresky · on June 16, 2019

Windows support should be fairly trivial, the main problem is packaging it up and that travis doesn't like Windows.

> doesn't support other UTF encodings like UTF-16

UTF-16 should in fact work, since ripgrep supports it too. Looks like my binary file detection is at fault [1]..

> Not sure if it can search in single-line mode either

That works fine, just use `rga --multiline '\n' fname`

[1]: https://github.com/phiresky/ripgrep-all/issues/5

dataflow · on June 16, 2019

Ah I see, thanks. You shouldn't require a BOM though. There are often files without a BOM, and not all files with UTF-16 in them are text either (EXEs etc.). I would just search for all the possible UTF byte sequences (UTF-7, UTF-8, UTF-16LE/BE, UTF-32, possibly with a switch to allow specifying subsets or additional encodings if you can support that?) regardless of BOM.

phiresky · on June 16, 2019

In ripgrep itself you can apparently only look in files of encodings other than UTF16LE with BOM by manually specifying `--encoding UTF16BE` etc.

I could maybe add encoding detection myself, but I'm kind of discouraged since not even the unix `file` tool can detect those files as text, and a normal editor opens at least a UTF16BE file completely wrong. So I'm not sure if I want to spend my time on trying to write heuristic detection on those, especially since UTF16 itself is broken and shouldn't really exist at all...

I'll look into what encoding_rs has to offer.

dataflow · on June 16, 2019

Thanks! Yeah I wouldn't try to detect encodings or use heuristics either. If you could just reduce a single pattern into the OR of a bunch of byte sequences in each encoding, I think that should work? I'm not sure how easy that is with the interface you're given. (I wouldn't call UTF-16 'broken', but either way... it's a reality; a huge fraction of the time when you're searching binary files on Windows it's to find text inside executables, which on Windows are generally UTF-16.)

ShorsHammer · on June 16, 2019

Travis works fine for me building windows binaries.

It's still beta, but wondering what problems you are having here? Have you tried windows in .travis.yml?

phiresky · on June 16, 2019

Nope, haven't tried it. I just saw that ripgrep is using appveyor for windows instead, so I assumed it doesn't work on travis. I was actually just trying to add appveyor to this [1], but I'm getting a weird error.

[1]: https://ci.appveyor.com/project/phiresky/ripgrep-all/builds/...

ShorsHammer · on June 16, 2019

Give it a crack, not sure what dependencies ripgrep is pulling in but I've had good experiences so far. They seem to be doing ok with rolling it out.

Just a few more lines in your travis deploy setup.

burntsushi · on June 16, 2019

> Also, suffers from the exact same problem I see pretty much every text search tool suffer: doesn't support other UTF encodings like UTF-16, meaning you'll miss files.

Did you try it? ripgrep supports UTF-16 just fine. It even supports it automatically and transparently, via BOM detection. If there's no BOM, then you must specify the encoding explicitly.

dataflow · on June 16, 2019

Yes, I tried it. Without a BOM, because you can't rely on BOMs being there.

burntsushi · on June 16, 2019

At that point, you don't know the encoding, so the only thing available to you is heuristics (including needing to guess the byte order). Either way, I don't think it's accurate to claim that ripgrep doesn't support UTF-16.

dataflow · on June 16, 2019

> At that point, you don't know the encoding, so the only thing available to you is heuristics (including needing to guess the byte order).

That's emphatically not the case though. I explained how you could handle it here without requiring BOM or byte order knowledge or heuristics: https://news.ycombinator.com/item?id=20198208

> Either way, I don't think it's accurate to claim that ripgrep doesn't support UTF-16.

Having UTF-16 text in a file doesn't imply the file has have a BOM, and when I tried it rga didn't work on UTF-16 that didn't have a BOM. If that's still "ripgrep supports UTF-16" in your view then I'm not sure how else to word it, but the wording is hardly my concern. At the end of the day I was just trying to convey a particular fact, not argue over its wording.

burntsushi · on June 16, 2019

> I explained how you could handle it here without requiring BOM or byte order knowledge or heuristics:

Yes, that's an absurd amount of development effort and would result in a serious performance regression. (To the point that it's likely nobody would use ripgrep at all, so your approach would need to be put behind a flag, which seriously hinders the feature since it's no longer automatic.) Moreover, that only covers match detection, but does not actually cover output. Once you find the match, you have to determine how to print it, and the device you're printing to very likely does not support things like UTF-32 or even UTF-16 in many cases. Moreover, there are many operations that ripgrep does in a post-processing step (like limiting the output to a certain number of characters per line) that require knowing the presumed encoding (which is always UTF-8 by that point, since the data will have been transcoded to UTF-8 if UTF-16 were detected).

> UTF-16 doesn't require BOMs

You cannot decode UTF-16 without knowing its byte order. The BOM tells you that. If there is no BOM, then you need to get the byte order from some other source (or guess it). ripgrep requires the user to tell it what it is. This seems entirely reasonable to me, especially since most or all UTF-16 files I've seen include a BOM. Notably, ripgrep's support for UTF-16 is good enough for VS Code, which has a pretty sizable Windows user base.

> your view then I'm not sure how else to word it, but the wording is hardly my concern. At the end of the day I was just trying to convey a fact, not argue over its wording or semantics.

At the end of the day, my concern is to correct misleading claims about what ripgrep can and can't do. ripgrep clearly has support for UTF-16, and this is actually one of its marquee features that sets it apart from other search tools. For example, grep doesn't (and literally can't) support UTF-16 at all. The only way to search UTF-16 encoded files with grep is to transcode the file to UTF-8 first or to set the locale to C, and search for the binary encoding directly. ripgrep does a lot better than that, so to lump it in with "pretty much every text search tool" is pretty misleading from my perspective.

dataflow · on June 16, 2019

[flagged]

burntsushi · on June 16, 2019

I'm not saying you were intentionally misleading anyone. What I'm saying is that I'm trying to correct something that I saw as misleading. Criticism is totally fair, but criticism of criticism should be fair game too. I totally appreciate that we shouldn't take these things too personally, but that cuts both ways. I wasn't saying you were trying to be misleading; I was trying to point out an inaccuracy. Given that ripgrep is my project, and myths spread easily, I try to stay on top of that.

> If you don't care or it's too much work

I mean, I do care. Windows users and the prevalence of UTF-16 is why I added the automatic transcoding in the first place. But it's not just that it's too much work; as I said, the performance regression would be so serious that people would literally stop using ripgrep unless it was disabled by default. (In addition to the fact that printing the results puts you in a precarious situation.)

fortran77 · on June 16, 2019

Windows 10 built-in search works pretty well, and easily finds things in PDFs, even non-English and non-Roman alphabets.