Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Those aren't redundant or collinear though? Maybe you are surprised they didn't encode this as an integer "num_images"? It is fairly common to one hot encode ordinal variables with only a few common/possible values this way.


True, it still seems odd to encode an explicitly ordinal variable as categorical (particularly one with a small finite range, in contrast to the follower logarithmic bucket ones), but Twitter's layout is weird enough that it could be a impactful difference in terms of engagement.


This is (weirdly) common in production ML codebases written by software engineers. Like you, I have no idea why unless it's a memory optimisation (where you count 4+ as many).


Having every column as a boolean (0/1) means you can treat it as a bitmap. As an (entirely fictional) example, imagine if you wanted to get the features of a thread instead of a single tweet. You could do it as a union of all the tweets:

threadFeatures = tweet1 | tweet2 | tweet2


Ok that makes lots of sense from an engineering perspective. It's pretty insane from a statistical perspective though, which I think was the original point.


> It's pretty insane from a statistical perspective though

Efficiency is way more of a concern, at this scale, than the more trivial was of trying to find a competent person that wouldn't misuse the values.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: