I guess it seems like you kinda chose this question by chance, but for new simil...

I guess it seems like you kinda chose this question by chance, but for new similar types of questions, do you now do them yourself (or get a coworker to try) to get at least an n=1 sample for how long it ought to take? I like to give a 2x margin of time over my / a coworker's time, so anything that takes us over 30 minutes is right out or needs to be simplified when we only get an hour with the candidate. An experience with an intern at my last job who struggled with our large codebase (mostly in the way of being comfortable despite not having or being able to fit everything in working memory all the time) led me to conclude the same as you, i.e. it's an important signal (at least for our company's work) to be able to jump in and effectively work in a somewhat large existing codebase.

I'm amused at the comments suggesting this is too easy (or your own suggesting the same for redis), I think if I tried this I would have filtered almost everyone, at least if they only had an hour (minus time for soft resume questions to ease them into it, and questions they have at the end). So many programmers have never even used grep (and that's not necessarily something to hold against them at least at my last job where the "As hire Bs and Bs hire Cs and Cs hire Ds" transition had already long since occurred). I've made two attempts at the idea though by crafting my own problem in its little framework, the latter attempt I used most involves writing several lines of code (not even involving a loop) into an "INSERT CODE HERE" method body, either in Java or JavaScript, and even that was hard enough to filter some people and still provide a bit of score distribution among those who passed. Still, I think it's the right approach if you have a large or complicated existing codebase, and even in the confines of an hour seems like a better use of time approximating the gold standard "work sample test" than asking a classic algorithm question.