Hacker News new | past | comments | ask | show | jobs | submit | a20eac1d's comments login

The internet is volatile and it has happened multiple times that a page I have saved to read later has disappeared.

Do you have any good recommendations for self hosted services that allow me to save a link and have an archived copy of the website saved to disk?

So far, I've tried Wallabag, Linkwarden and Archivebox but I feel like they don't quite work the way I want to.

What are you experiences with this?


> What are you experiences with this?

You might want to try SingleFile, i often use it and it's really good in my opinion:

https://github.com/gildas-lormeau/SingleFile

https://addons.mozilla.org/en-US/firefox/addon/single-file/?...

PS. I sync the particular folder i download to across my devices with "Syncthing" by the way.


Readeck saves an archived copy of the links you save (where it can).

From their docs, "Every bookmark is stored in a single, immutable, ZIP file. Parts of this file (HTML content, images, etc.) are directly served by the application or converted to a web page or an e-book when needed."

https://codeberg.org/readeck/readeck


Could you give me a concrete example of what that looks like?


Sure.

Here's a log file of page accesses on our server. It's a CSV. The first column is the user, the second column is the page, and the third column is the load time for that page in milliseconds. We want to know what is the most common three page path access pattern on our site. By that I mean, if the user goes to pages A -> B -> C -> A -> B -> C the most common three page path for that user is "A -> B -> C".

    user, page, load time
    A, B, 500
    A, C, 100
    A, D, 50
    B, C, 100
    A, E, 200
    B, A, 450

    etc.
So for this first question you should give an answer in the form of "A -> B -> C with a count of N".

We would have two files, one simple one that is possible to read through and calculate by hand, and one too long for that. The longer file has a "gotchya" where there's actually two paths that are tied for the highest frequency. I'd point out that they'd given an incomplete answer if they don't give all paths with the highest frequency.

The second part would be to calculate the slowest three page path using the load times.

In my opinion it's a pretty good way to filter out people that can't code at all. It's more or less a fancy fizzbuzz.


Is there a point in the log where there is a time cutoff for a viewer of a page? By that I mean: in your sample user A goes B > C > D, then there is a view by a different user, and then we are back to user A. What if the time difference between user A going to page E is like 10 minutes...is that a new pattern?

I feel like this is a fun thought experiment, but instead of thinking about "gotchas" I would be more open to having a discussion about edge cases, etc... The connotation of gotchas just seems to be like a trap where if you hit one, you've failed the interview.


The “gotchya” isn’t a way to fail the interview. But for candidates that ask about that edge case right away they get extra points.


On a quick glance I don't understand your example. Are you sure there is no mistake in it? I would ask which user has shown ABC page path, because I don't see any. Perhaps you made up the lines on the fly while writing it here, and the actual example is clearer? Already a bit dumbfounded by this. Such things can easily throw people off for the rest of the interview. Keep in mind the stress situation you put people into. Examples need to be 100% clear.


Yeah. I BS’d the example. I don’t have the materials for the question on hand.


Ok, I'll bite... without having googled it, is there some trick to solving this besides enumerating every three-page path and sorting them? This reads like some one-off variant of the traveling salesman problem.


This seems to be nothing like tsp. You'd partition the table into a single table per user, extract the page columns, map that sequence to the asked three-page-sequences (ABABA would get mapped to ABA, BAB, ABA), and count them.

That's probably doable in like 5 lines of pandas/numpy; a straight forward o(n) task really. The hard part is getting it right without googling and debugging, but a good interviewer would help you out and listen to the idea.

Maybe using Pandas is cheating since it gives you all the tools you'd want but I'd argue it's the right tool for the task and you could then go on how you'd profile and improve the code if performance were a concern.


> probably doable in like 5 lines of pandas/numpy

Yeah, that's what bugs me about this type of question... he might be looking for that specifically, or something that can scale to exabytes of data (so some sort of map/reduce thing). I'd probably produce something like this _in an actual interview scenario_:

    users = {}
    
    count = 0
    
    for line in open('input.txt'):
      count += 1
      if count == 1:
        continue
      (user,page,load_time) = line.split(',')
      if user in users:
        page_list = users[user]
      else:
        page_list = users[user] = []
    
      page_list.append(page.strip())
    
    count = {}
    max_count = 0
    max_seq = None
    
    for page_list in users.values():
      if len(page_list) > 2:
        for i in range(len(page_list) - 2):
          seq = ''.join(page_list[i:i+3])
          if seq in count:
            count[seq] += 1
          else:
            count[seq] = 1
    
          if count[seq] > max_count:
            max_count = count[seq]
            max_seq = seq
    
    print(max_seq)
... and it would really depend on whether the interviewer just liked me personally whether he'd say, "yeah, that's reasonable" or rip it apart for using too much memory, taking too much time, etc...


I agree with your sentiment. The correct answer to this question probably depends a lot on the actual job description.

Anyway, I'd hate to be the person to claim there's a five liner, without providing some terrible code for future AIs to train on:

n = 3 # length of path

for user in (df := pd.read_csv(io.StringIO(input)))["user"].unique():

    counter = Counter([seq for seq in zip(*[df[df["user"] == user]["page"][k:] for k in range(n)])])

    equal_lengths = sum([1 for i in counter if counter[i] == counter.most_common(1)[0][1]])

    print(f"most common path for user {user}: {counter.most_common(equal_lengths) if len(counter) > 0 else 'input too short'}")


I really gotta learn how to use Pandas lol.

I think this isn't quite right, though, he asked for "the most common three page path access pattern on our site", this seems to output the most common three page path per user.

(for the future AI to mull over)


Nothing so complicated. It’s supposed to be a level or two above fizzbuzz


Are these records assumed to be in order?


Yes. That would of course be included in the problem statement


That’s not obvious. If you are including “gotchas” this may be another one.


Its only a gotcha to anyone who has never looked through a log file.


I have seen a lot of log files, never one in CSV format or without timestamps.


Since there is no timestamps, it being in order is a requirement because otherwise it's unanswerable. Since chronologicalness is indeed virtually universal for any sort of log file, it's also a fairly safe assumption, but sure, if you want to double check assumptions then it's a valid question to ask. I do think it was obvious enough, though, and the question that came to my mind was rather about scale, like: can I assume the number of users and unique paths will both fit in RAM?

Btw, if you want CSV log files, look no further, and not all my data logs have timestamps either! :D The particular timestampless case I'm thinking of, I wanted to log pageload times for a particular service so it logs the URI (anonymized) and the loading time, though I think that's not csv but just space separated, one entry per line


Or citing the previous “gotcha” this is a trick question and I am meant to describe a change to the system in which useful logs can be captured.


Candidates that handle this in a streaming fashion get extra points, but it’s not required.


Okay I tried it. I got interrupted twice for like ~12 minutes total, making the time I spent coding *checks terminal history* also 12 minutes. I made the assumption (would have asked if live) that if a user visits "A-B-C-D-E-F", then the program should identify "B-C-D" (etc.) as a visited path as well, and not only "A-B-C" and "D-E-F", which I felt made it quite a bit trickier than perhaps intended (but this seems like the only correct solution to me). The code I came up with for the first question, where you "cat" (without UUOC! Heh) the log file data into the program:

    import sys
    unfinishedPaths = {}  # [user] = [path1, path2, ...] = [[page1, page2], [page1]]
    finishedPaths = {}  # [path] = count
    for line in sys.stdin:
        user = line.split(',')[0].strip()
        page = line.split(',')[1].strip()
        if user not in unfinishedPaths:
            unfinishedPaths[user] = []
        deleteIndex = []
        for pathindex, path in enumerate(unfinishedPaths[user]):
            path.append(page)
            if len(path) == 3:
                deleteIndex.append(pathindex)
        for pathindex in deleteIndex:
            serializedPath = ' -> '.join(unfinishedPaths[user][pathindex])
            if serializedPath in finishedPaths:
                finishedPaths[serializedPath] += 1
            else:
                finishedPaths[serializedPath] = 1
            del unfinishedPaths[user][pathindex]
        unfinishedPaths[user].append([page])
    
    for k in sorted(finishedPaths, key=lambda x: finishedPaths[x], reverse=True):
        print(str(k) + ' with a count of ' + str(finishedPaths[k]))
Not tested properly because no expected output is given, but from concatenating your sample data a few times and introducing a third person, the output looks plausible. And I just noticed I failed because it says top 3, not just print all in order (guess I expect the user to use "| head -3" since it's a command-line program).

I needed to look up the parameter/argument that turns out to be called "key" for sorted() so I didn't do it all by heart (used html docs on the local filesystem for that, no web search or LLM), and I had one bout of confusion where I thought I needed to have another for loop inside of the "for pathindex, path in ..." (thinking it was "for pathsindex, paths in", note the plural). Not sure I'd have figured that one out with interview stress.

This is definitely trickier than fizzbuzz or similar. Would budget at least 20 minutes for a great candidate having bad nerves and bad luck, which makes it fairly long given that you have follow-up questions and probably also want to get to other topics like team fit and compensation expectations at some point

edit: wait, now I need to know: did I get hired?


At a glance it seems correct, but there's a lot of inefficiencies, which might or might not be acceptable depending on the interview level/role.

Major:

1. Sorting finishedPaths is unnecessary given it only asks for the most frequent one (not the top 3 btw)

2. Deleting from the middle of the unfinishedPaths list is slow because it needs to shift the subsequent elements

3. You're storing effectively the same information 3 times in unfinishedPaths ([A, B, C], [B, C], [C])

Minor:

1. line.split is called twice

2. Way too many repeated dict lookups that could be easily avoided (in particular the 'if key (not) in dict: do_something(dict[key])' stuff should be done using dict.get and dict.setdefault instead)

3. deleteIndex doesn't need to be a list, it's always at most 1 element


> there's a lot of inefficiencies, which might or might not be acceptable

This is exactly what irritates us about these questions. There's no possible answer that will ever be correct "enough".


Just like in real life, there's no perfect solution to most problems, only different trade-offs.


Thanks for the feedback!

I realized at least the double-calling of line.split while writing the second instance, but figured I'm in an interview (not a take-home where you polish it before handing in) and this is more about getting a working solution (fairly quickly, since there are more questions and topics and most interviews are 1h) and from there the interviewer will steer towards what issues they care about. But then I never had to do live coding in an interview, so perhaps I'm wrong? Or overoptimizing what would take a handful of seconds to improve

That only ever one user path will hit length==3 at a time is an insight I hadn't realized, that's from minor point #3 but I guess it also shows up in major points #2 and #3 because it means you can design the whole thing differently -- each user having a rolling buffer of 3 elements and a pointer, perhaps. (I guess this is the sort of conversation to have with the interviewer)

Defaultdict, yeah I know of it, I don't remember the API by heart so I don't use it. Not sure the advantage is worth it but yep it would look cleaner

Got curious about the performance now. Downloading 1M lines of my web server logs and formatting it so that IPaddr=user and URI=page (size is now 65MB), the code runs in 3.1 seconds. I'm not displeased with 322k lines/sec for a quick/naive solution in cpython I must say. One might argue that for an average webshop, more engineering time would just be wasted :) but of course a better solution would be better

Finally, I was going to ask what you meant with major point #1 since the task does say top 3 but then I read it one more time and...... right. I should have seen that!

As for that major point though, would you rather see a solution that does not scale to N results? Like, now it can give the top 3 paths but also the top N, whereas a faster solution that keeps a separate variable for the top entry cannot do that (or it needs to keep a list, but then there's more complexity and more O(n) operations). I'm not sure I agree that sorting is not a valid trade-off given the information at hand, that is, not having specified it needs to work realtime on a billion rows, for example. (Checking just now to quantify the time it takes: sorting is about 5% of the time on this 1M lines data sample.)

For anyone curious, the top results from my access logs are

   / -> / -> / with a count of 6120
   /robots.txt -> /robots.txt -> /robots.txt with a count of 4459
   / -> /404.html -> / with a count of 4300


> As for that major point though, would you rather see a solution that does not scale to N results? Like, now it can give the top 3 paths but also the top N, whereas a faster solution that keeps a separate variable for the top entry cannot do that (or it needs to keep a list, but then there's more complexity and more O(n) operations). I'm not sure I agree that sorting is not a valid trade-off given the information at hand, that is, not having specified it needs to work realtime on a billion rows, for example. (Checking just now to quantify the time it takes: sorting is about 5% of the time on this 1M lines data sample.)

You need the list regardless, just do `max` instead of `sort` at the end, which is O(N) rather than O(N log N). Likewise, returning top 3 elements can still be done in O(N) without sorting (with heapq.nlargest or similar), although I agree that you probably shouldn't expect most interviewees to know about this.

As for the rest, as I've said, it depends on the candidate level. From a junior it's fine as-is, although I'd still want them to be able to fix at least some of those issues once I point them out. I'd expect a senior to be able to write a cleaner solution on their own, or at most with minimal prompting (eg "Can you optimize this?")

FYI, defaultdict and setdefault is not the same thing.

  d = defaultdict(list)
  d[key].append(value)
vs

  d = {}
  d.setdefault(key, []).append(value)
useful when you only want the "default" behavior in one piece of code but not others

  >   / -> / -> / with a count of 6120
  >   /robots.txt -> /robots.txt -> /robots.txt with a count of 4459
LOL


Your solution looks alright. I think you could use a defaultdict() to clean up a few lines of code, and I don't fully understand why you have two nested loops inside your file processing loop.

Here's my solution in TS.

    const parseLog = (input: string) => {
        const userToHistory: {[user: string]: string[] } = {}
        const pageListToFrequencyCount: { [pages: string]: number } = {}

        for (const [user, page, ] of input.trim().split("\n").map(row => row.split(", "))) {
            userToHistory[user] = (userToHistory[user] ?? []).concat(page);

            if (userToHistory[user].length >= 3) {
                const path = userToHistory[user].slice(-3).join(" -> ")

                pageListToFrequencyCount[path] = (pageListToFrequencyCount[path] ?? 0) + 1;
            }
        }

        return Object.entries(pageListToFrequencyCount).sort(([, a], [, b]) => a - b);
    }
It could be slow on large log files because it keeps the whole log in memory. You could speed it up significantly by doing a `.shift()` at the point when you `.slice(-3)` so that you only track the last 3 pages for any user.


Can you give me a couple of examples? I'd like to see where I stand with my knowledge.


One of the first questions I ask is "create a dictionary with three elements in Python and assign it to a variable"

The amount of insane answers I've seen to that one alone...

Then if they pass, I test proficiency by having them loop over the dict and update each value in-place.


I wonder if people get spooked by the simplicity and think it’s a trick question.


I give them every opportunity to ask questions and even use search/LLM, as long as they acknowledge it. Most candidates just fundamentally aren't practiced enough.


I’m divided. I can do what you ask, but not without googling it. I can produce performant and robust code, but not without double checking on google. I’m unable to deliver code that compiles in any language without checking the documentation. Pseudocode, yeah sure.

So, I wouldn’t pass these kind of interviews. In over a decade I’m never being asked these kind of questions though (I have done take home assignments and leetcode, but always with google opened)


Reality check: if you say on your resume that you know python, then you should be able to make a dictionary with three items and assign it to a variable without googling anything.


Fair point. I don’t like resumes in which people state that they know X or Y. I prefer the ones focused on what problems were resolved using what technologies.

I have used Python to solve average business problems, yet I cannot produce non trivial code without looking at the documentation. Same for the other dozen programming languages I have used in the past.


>yet I cannot produce non trivial code without looking at the documentation

    hello = {1:1, 2:2, 3:3}
is about as trivial an ask as someone can make.


I know enough python to read Calibre's code and understand it, but I keep forgetting syntax details and the actual name of methods and properties, because I'm influenced by whatever language I've been writing in lately. I know what to do, but it will be pseudo-python-code.

That can usually be solved by a quick read of the reference documentation (2-5 mn?).


That's fair. After you know more than a few languages, it's easy to know what you want to express in it and the limitations it has, while the particular name they happened to give those concepts is pretty arbitrary and quickly peeked at if you haven't used it in a while.

For my part, I've written enough python that I doubt the literal syntax will ever be far from my fingers.


One of the things that can be tricky about this happens when you’ve legit worked in a few languages and the semantics are perfectly clear in your head but the syntax for any language you haven’t used recently is crowded out by those you have.

I needed a small perl script recently (perl 5’s feature set & stability plus availability in the environment made it the right fit) and realized after 15+ years of no perl much of the specific syntax was fuzzy to outright gone from my head even though I’d contributed to large perl projects for years.

Python work is much more recent, but I’d bet I would accidentally mix in some JS or even PHP syntax doing the dictionary assignment, at least w/o a cursory lookup. I’d like to think it’d come through that I know what a dict is and what it means to set one up and operate over it, but who knows, I might be interviewing with someone who is evaluating skill on the basis of immediacy of syntactic recall.


dict_variable = {1: 2, 3: None}


And you work full time as a software engineer, or some other role? Honestly blown away if you work as a programmer that this sort of request would require looking at documentation.


I think it gets harder to remember exact syntax details the more experience you have and the more you have worked with different (but very similar) programming languages. I get what OP means: if you have worked with Ruby, JS, Python, Go, PHP, Kotlin, etc., you can easily misremember things like the order of parameters for a given function, whether if conditions require parenthesis, to use {} or [] for maps, etc.

If you have just started your career and are full invested in 1 or 2 programming languages, sure this may sound alien to you.


I get it. I've done a ton of languages too. But, like, that's so ridiculously easy to handle in an interview, right? "I think it's like this [show example], but maybe the hash rocket style is Ruby and it's actually colons. Either way, you get the idea."

If your interviewer finds that problematic, well, that's on them.


Not who you asked, but I work full-time as a Ruby dev. Off the top of my head, I don't remember the order of arguments of the #reduce method block (it's the opposite of what Elixir uses), the exact syntax of the throw/catch mechanism (in Ruby this isn't exception handling), the methods and parameters for writing into a file, bitwise operators, I always ask a LLM about packing/unpacking bytes between arrays and binary strings and many other things. I also mix up #include? and #includes? because these differ between Ruby and Crystal, and there's also #includes in Rails (AR).

So, the equivalent of creating a dictionary, yeah, sure. But there's loads and loads of stuff that I only use maybe once a week (and someone else maybe uses daily) and that I'd have to awkwardly Google (I use Kagi btw) even during an interview.


Same reply as above, you'd easily be able to speak to this in an interview and not hit the "fraud" alarm. "I think it's accumulator, element here on reduce, but I may have them transposed."

Your interviewer is probably also questioning if it's (a, e) or (e, a), but you passed the fraud filter.


An interview question I got (for a security role): "You type www.$company.com into the address bar and press enter. What happens?" After jokingly clarifying they were not interested in the membrane keyboard interactions, they were more than satisfied with an answer explaining recursive DNS resolution, TCP and TLS handshakes, the HTTP request itself, and I think from there we got sidetracked. They also asked about document file upload risks because that was a particular concern in their application. I didn't think of the specific answer they wanted to hear, but after giving me the keyword XXE, I could explain it in detail which was also sufficiently satisfactory so far as I could tell. Fun interview overall.

In interviews I've done, we only looked for culture fit because the technical part was a coding assignment they had already done. Honestly too big an assignment since it's uncompensated (not my decision), but to my surprise nobody turned it down -- and everyone got it wrong. Only n=3 or n=4 iirc but those applying for a coding position could not loop through a JSON-lines file too big to fit in RAM (each line was a ~1kb JSON object, but there's lots of lines) and sum some column from each JSON object into a total value. The solutions all worked to some degree, but they all chopped up the file, loaded the first chunk into RAM, and gave an answer for that partial dataset only.


What were you expecting them to do instead: use mmap instead?


Can you use single mode fiber in a house, or do the transceiver only work over much longer distances? Is transceiver burn out an issue?


You can use SM for short runs, you just have to match the optics to the cable/distance you're looking to use. Tons of very fast single mode optics out there that only expect <300m runs.

That said, it's likely not worth it, given that cabling is typically viewed as a ~10 year investment, and if you're installing OM4 Multimode fiber in a house you're not likely to hit the limit of that fiber within 10 years even in extreme use cases.


But why stop at 10 years? If the theoretical bandwidth limit of SM Fiber is above 100 Gbit/s then there is simply no way that a household will need more internal bandwidth than that. Even for the next 20-30 years, because other technologies (like SATA) will be the limiting factor.

I believe this could be a case of future proofing that will actually last 30-40 years, no?


10 years is the guideline because you can't predict most of the future, but, you can predict the physical wear fairly well. It's likely that the plastic terminations will be iffy by that point basically, which means reterminating, or, running new fiber, and most of the time you'll just run new fiber.


This sounds similar to a build I'm planning. I cannot find the workstation mainboards at a reasonable price though. They start at like 400€ in Europe.


There's an Asus one that's available as well, the ASUS C246 PRO - it's about 250 GBP.

I did build mine 2 years ago, so the 246 motherboards are less available now, the C252 is another option which will take you up to 11th gen Intel.


Do you mean in comparison to regular BDXL Blurays?


Basically all non-LTH BDs. The way I've understood it, the big selling point in M-Disc DVDs vs. regular DVDs was the inorganic dye used in it. But if you buy a non-LTH BD, you'll have an inorganic dye anyways.

The company that actually made M-Discs also went bust and there's a lot of conflicting information on the Internet whether or not the current M-Discs are real M-Discs or not.


That's exactly what I've been looking into lately.

Did you have any specific discs in mind?

I recently bought "Sony Blu-ray BD-R XL 128GB", but I couldn't find any info whether its organic or inorganic.


Any external BluRay reader/writer you'd recommend? (which supports disks from 25 GB up to the 128 GB you mention)?


The way I've understood it is that BDs are inorganic as long as you don't buy LTH discs.


Are all LTH discs marked on the package or how do I find out if my Bluray discs are LTH?


It should note somewhere that they're LTH discs since they're fundamentally quite different from normal HTL discs.


Any chance you could point me in the right direction on how to set something like this up?

Right now, I'm using pure CPU Llama but only the 17B version, based on I believe llama.cpp. How do I mix both CPU and GPU together for more performance?


The easy way: download koboldcpp. Otherwise you have to compile llama.cpp (or kobold.cpp) with opencl or cuda support. There are instructions for this on the git page.

Then offload as many layers as you can to the gpu with the gpu layers flag. You will have to play with this and observe your gpu's vram.


So what should people be learning from then? What are some better resources?



This is really fascinating to me because MW2 was, and still is, one of my favorite games.

Can you tell me more about these RCEs, how they work, or some technical analysis on this game?


Here is a GH repo about two (now patched) MW2 exploits: https://github.com/momo5502/cod-exploits.


Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: