Hacker Newsnew | past | comments | ask | show | jobs | submit | jwecker's commentslogin

Spam-detection poisoning


It doesn't look like it's trying to be a startup at all, just showing an idea to HN- but I guess you're saying it should be more of an "ask for early feedback on HN" instead of "Show HN"- which implies show something that is working. I'm OK with "Show HN" being the default idiom for anyone posting their own link. Guess it could have had a "(proposal)" or something in the title.


I always assumed it was more a sanitization issue for security's sake. By allowing only a simple subset ("common") email address type, you can be ambivalent about what email server is running and how it reacts to the wide variety of specially crafted email addresses.

With no validation other than sending the email, you have to know, for example, what the server would do with an email address that claims to be @localhost. Now it becomes a problem- or at least a question and concern- for the backend system. Whether the backend interprets root@localhost as valid and does exactly what it's told or rejects it due to some configuration- it has become a backend complication and a DOS attack vector.

A simple policy of only handling a subset- the common class of email addresses- is one of the things that allows us to have a simple mental model of what the MTA is supposed to do. The fact that it sometimes caught a type-o, or not, is incidental. "Invalid email" wasn't meant to imply the email address doesn't fit the spec- it was meant to imply that a particular site or service has chosen not to accept email addresses like that.

Or at least that's what I assumed :-)


> I always assumed it was more a sanitization issue for security's sake.

Sanitization is at best idiotic, at worst creates security problems. There is no such thing as "bad characters", there only is broken code that incorrectly encodes stuff. If you ever find yourself modifying user input "for security reasons" (or really, for any reason at all), you are doing it wrong. The only sane thing to do is to make sure that the semantics of every single character of your user's input is preserved in whatever data format you need to represent it in.


An email address isn't a document though, it's a routing command. I don't mean sanitization in the sense of inserting backslashes. I mean sanitization in the sense of "we don't allow people to set their email address to a mailbox on localhost at our mail server."


1. Sanitization generally means changing information. As in, "removing bad characters", that kind of stuff. That's different from validation, which should result in rejection of bad input, and which can be perfectly fine. However, more often than not, validation is implemented badly and rejects perfectly fine input, which is why validation shouldn't be employed more than necessary either.

2. Rejecting @localhost addresses doesn't really make a whole lot of sense. People could just enter the public IP address or hostname of the server, or add a DNS A record under their own domain that points to 127.0.0.1, or an MX record that points to localhost, or any number of other weird stuff that you could not possibly validate anyway (if only because it could be changed at any point lateron). Just configure your mail server properly and then send the damn email, and if it does get sent to root@localhost, and possibly forwarded to the admin--so what? People obviously could just sign up using your admin's email address anyway, and that not only at your site, but at millions of sites out there, you won't be able to stop them. There is nothing particularly dangerous about receiving unsolicited signup emails or about sending emails to yourself.


> There is nothing particularly dangerous about receiving unsolicited signup emails or about sending emails to yourself.

Depends on what you do with them, in the latter case. There could be an amplification attack there.

Validating domain parts to a certain extent isn't a bad idea, at least as far as non-routable domain names and RFC1918 ranges go. I've seen this done (actually implemented some of it, in fact) at a past employer, who were basically looking to cover the 90% case in terms of not getting hosed by a trivial attack. It doesn't take much effort and it makes 4chan's life harder. What's not to like?


> Depends on what you do with them, in the latter case. There could be an amplification attack there.

Hu? How would that work?

> Validating domain parts to a certain extent isn't a bad idea, at least as far as non-routable domain names and RFC1918 ranges go.

What do you mean by "non-routable domain names" and what do you gain by checking for RFC1918 ranges?

> I've seen this done (actually implemented some of it, in fact) at a past employer, who were basically looking to cover the 90% case in terms of not getting hosed by a trivial attack.

Why did you prefer that approach over a robust solution?

My idea of a robust solution: Have one central outbound relay that's firewalled off from connecting to anywhere but the outside world, make all servers that need to send email use that relay as a smarthost (so they never connect to anything but that relay, regardless what the destination address is), use TLS and SMTP AUTH with credentials per client server to prevent abuse of the relay by third parties.

> What's not to like?

(a) that it's a lot easier to build a solution that's more robust, (b) it's extremely likely that your implementation is buggy, thus rejecting valid addresses, and (c) it's causing a maintenance burden (what happens when the first people drop IPv4 for their MXes? I'd pretty much bet that you don't check for AAAA records, so you'd probably suddenly start rejecting perfectly fine email addresses, thus making the transition to IPv6 unnecessarily harder, am I right?).


'Non-routable' as in a single label, or as in not resolvable. I don't think it is unreasonable to consider an address invalid when its domain part cannot be resolved. Checking for RFC1918 ranges means you don't try to send to another class of addresses that's never going to be received.

You would lose the bet. The product supported IPv6 from day one.

That is a robust, if somewhat complex, solution for a relatively small volume of mail. When you're sending ten million messages a day by the end of the first month, pushing everything into a single relay of any kind is asking for a lot of trouble.


> 'Non-routable' as in a single label, or as in not resolvable. I don't think it is unreasonable to consider an address invalid when its domain part cannot be resolved.

What exactly do you mean by "cannot be resolved"?

> Checking for RFC1918 ranges means you don't try to send to another class of addresses that's never going to be received.

But why check for it? Is that actually a common mistake people make? An attacker could change the address after you checked it, so it's not going to help against attackers, is it?

> You would lose the bet. The product supported IPv6 from day one.

Good for you! :-)

> That is a robust, if somewhat complex, solution for a relatively small volume of mail. When you're sending ten million messages a day by the end of the first month, pushing everything into a single relay of any kind is asking for a lot of trouble.

Complex? Certainly less so than implementing validation yourself.

As for scalability: Well, yeah, as described that's more the setup for a company that's operating various different services, none of which has a high volume of outbound email (which would be most, even the best startups don't have ten million signups per day and don't send much email otherwise, and even that should actually still be managable with a single server).

But that's trivial to adapt without changing the general approach. First of all, obviously, you could just add more relay servers and have client servers select one randomly, that scales linearly. But if you really need to move massive amounts of email for one service, so that adding an additional relay hop for each email you send actually adds up to noticable costs, you can still use the same approach: Just put the MTA onto the same machine(s) that the service is running on, into its own network namespace (assuming Linux, analogous technology exists on other platforms), and firewall it off there so it cannot connect to your internal network. Potentially you can even just add blackhole routes for your internal networks/RFC1918 ranges, so you would not even need a stateful packet filter (though currently you might still need it due to IPv4 address shortage).


"Cannot be resolved" means NXDOMAIN.

Why assume email addresses only get checked in one place, and not all?

Ten million a day was a milestone. I left that company over a year ago; it would astonish me to find that figure now exceeded by less than a factor of twenty. Granted these are mostly not signups. They are outgoing emails nonetheless, which makes the case germane despite that superficial distinction.

Your proposed solution sounds pretty expensive in ops resource, to no obviously greater benefit than the rather simple (well under one dev-day) option we chose. You seem to feel yours is strongly preferable, but I still don't understand why.


NXDOMAIN can be a temporary error. The SMTP queuing protocol is designed to be resilient against DNS failures, internet outages, routing problems, and temporary mail delivery issues.


> NXDOMAIN can be a temporary error.

Unless some DNS server is broken, it actually cannot. NXDOMAIN is an authoritative answer that tells you that the domain positively does not exist. Not to be confused with SERVFAIL, which you should get if the DNS resolver ran into a timeout or got an unintelligible response or whatever, NXDOMAIN should only occur if the authoritative nameserver of a parent zone explicitly says "I don't know this zone either locally nor do I have a delegation for it".


> "Cannot be resolved" means NXDOMAIN.

OK, that at least shouldn't reject any valid addresses, so maybe ...

> Why assume email addresses only get checked in one place, and not all?

It's not an assumption, it's just a matter of simplicity and reliability.

> Ten million a day was a milestone. I left that company over a year ago; it would astonish me to find that figure now exceeded by less than a factor of twenty. Granted these are mostly not signups. They are outgoing emails nonetheless, which makes the case germane despite that superficial distinction.

Well, no clue how well common MTAs would cope with that, but 2500 mails per second could still be within the power of a single machine if it's designed for high performance. But regardless, I don't think that really matters: If you have a relatively low volume of emails, it's probably most efficient and secure to handle it all with one central outbound relay, if you need to send lots of emails, it obviously makes sense to distribute the load, but that shouldn't really otherwise change the strategy.

> Your proposed solution sounds pretty expensive in ops resource, to no obviously greater benefit than the rather simple (well under one dev-day) option we chose. You seem to feel yours is strongly preferable, but I still don't understand why.

What sounds expensive about it?

Whether there is any benefit to it: Well, depends on your goals and what exactly your solution actually does. I still don't understand why (or even how exactly) you do those RFC1918 checks, for example?! It seems like it's mostly a security measure? But then, it's not actually secure, it's essentially a race condition/TOCTTOU. Plus, it might even break valid email addresses.

Essentially, it's four factors why I think just delegating the validation of email addresses to the mail server is the best strategy:

1. Implementing your own checks risks introducing additional mistakes (which might lead to the rejection of valid addresses).

2. Implementing your own checks is additional work when you could just use the MTA which already knows how to do this (and which you have to install/configure/use anyway as soon as you want to actually use the email address), both for the initial implementation, and possibly for subsequent maintenance (if you just let your MTA do the work, only the MTA needs to be adapted to any changes in how emails get delivered, abstracting away the problem for any software that's supposed to be sending emails and isolating it from the lower layers).

3. My approach actually gives you the perfect result, in that it does not reject any valid addresses (assuming your MTA implements the RFCs correctly), and at the same time is perfectly secure against all possible abuses with weird addresses, which is impossible to achieve when you separate the check from the actual abuse scenario, and it's even kindof trivial to see that that is the case.

4. You have to have the infrastucture to deal with bounces anyhow, both because you ultimately cannot be sure an address actually exists unless you have successfully delivered an email to it, and because addresses that once existed might not exist anymore at a later time, so it's not like you can avoid that if only you validate your addresses better.

As for "ops resources": Assuming that any service that needs to send > 10 million mails per day will be deploying machines automatically anyway, what's so much more expensive about deploying the configuration of an additional network namespace? Writing that script certainly shouldn't be more than a day of work either, should it?


I may have erred in giving the impression that the application-level checks are the only line of defense here. They're not. The (bespoke) MTA underlying this product performs most if not all of these checks as well. I didn't really spend any time on that side of the business, so I might be wrong about that, but it would be something of a surprise. I do know our analytics needed to be able to cope usefully with an astonishing panoply of bogosity warnings that came back from the MTA, but I no longer recall exactly what they covered. And, in any case, it's nice when you can to tell the user "hey, this isn't deliverable" before it gets to the point of a bounce.

Checking whether an email address's domain-part is an RFC1918 IP is actually pretty easy. Split the address by '@'. The last piece is the domain part. Split it by '.'. If there are four pieces, all of which meaningfully cast to integers, treat it as an IP address. (Otherwise it's a domain name, which is fine as long as it has more than one part and isn't NXDOMAIN when the backend tries to resolve it.) If any part is negative or greater than 255, it's invalid. If it starts with [10] or [192, 168], or if it starts with [172] and the second part is between 16 and 31 inclusive, it's an RFC1918 address. Otherwise, it's fine.

Even with unit tests, that takes almost no time to write, and when your frontend and backend share a language as ours did, you can use the same logic both places. Node gives you a name resolver binding for free. I really can't imagine it being as quick to write, test, validate, and roll out a change to the MTA node manifest.


> Checking whether an email address's domain-part is an RFC1918 IP is actually pretty easy.

What I still don't understand: Why do that at all? What's the point of this? Security? Preventing user errors? What else?


Little of both.

We were pretty sure it would already be impossible, or nearly so, for a malicious user to probe our infrastructure this way, but when it's so simple to be even more sure, why not?

Similarly, we'd already observed a low but nonzero rate of users inadvertently providing such addresses - not during signup or onboarding so much, but in recipient lists they submitted. Since we used the same recipient checking code everywhere, why not cut that back to zero, too?


Then reject email addressed to localhost. It shouldn't matter how the email got there. I'd suggest that especially given DNS trickery involving setting up a low TTL then redirecting to 127.0.0.1, you're probably not preventing this from happening or you'd have to invalidate any unrecognised domain. Better to solve that problem at a different layer -- validate the email by sending a validation link if you must...


True, but that's my point- it's a backend issue, not a front-end "help the user" issue.


And their point is, it's a backend issue, the backend being the mail client/server that already completely handling the sending/receiving of emails. Either the activation link gets clicked or it doesn't. The click is the only correct validation, and yes, the whole process happens on the "backend".


I mostly agree but there are cases where some definition of sanitization is the only appropriate thing. For example, if you allow users to create content with a lightweight subset of HTML for the sake of formatting control and want to render that html in your page. And in such cases, the correct way to sanitize it is not via regexps but via a DOM parser that takes user input and builds a DOM and then emits rendered html according to a whitelist of available tags/attributes. So you might argue DOM parsing isn't sanitization and so still matches your assertion, however, in general it's common and not really inaccurate to call this sanitization.


Well, it depends ... ;-)

The important thing is to not change information. "Sanitization" as it is commonly used means doing something that (potentially) changes information. Which is in contrast to decoding/encoding/parsing/unparsing/translation/..., which, if done correctly, change representation, but not information.

So, to make it a useful distinction, I would call anything that potentially changes the semantics of the processed data "sanitization", and avoid using the term for anything else.

So, simply parsing a string with an HTML parser, possibly checking for acceptable elements, and then serializing back into some sort of canonical form that is semantically equivalent to the input, that's perfectly fine, and I wouldn't call that sanitization, but rather validation and canonicalization.

If you simply start dropping elements, though, that's probably a bad idea, just as simply dropping "<" characters is a bad idea, because those elements presumably bear some semantic meaning, just as a "<" in a message presumably bears some semantic meaning.

Now, it is not always obvious which level of abstraction to evaluate the semantics (and thus the preservation of semantics) at. So, it might be prefectly fine, for example, to remove or replace some elements where the semantics are known and you can show that, say, removing emphasis still generally preserves the meaning of a text.

But a whitelist approach where you simply remove everything that isn't on the whitelist usually is a bad idea. If you want to have a whitelist, use it for validation, and reject anything that's not acceptable, so the user can transform their input in such a way as to avoid any constructs you don't want, while still retaining the meaning of what they are trying to say.


I hear what you're saying and it represents an ideal. But there are circumstances where information really has to be removed. Perhaps because the user is no longer present and it was collected under circumstances that had more liberal validation. Or because you're handing information across a boundary of implementation ownership and can't trust the receiver to handle potentially dangerous information correctly. I agree that sanitization (in the sense of stripping bits out of a user data payload according to some security rules) shouldn't be the first tool in the toolbox, but I would really hesitate to say it's always the wrong thing to do.

Edit: here's a good example. I don't know if they still do this, but when I worked at Yahoo!, they used a modified version of PHP that applied a comprehensive sanitization process to all user inputs. As a frontend coder at Y!, all the information you pulled from request parameters, headers, etc, ran through this validation at the PHP level before your app code got to it. You can then literally splat this information into an html page raw, without any further treatment, and not expose the Y! property you work for to an XSS or other injection vectors. There were ways to obtain the raw input using explicit accessors when needed, and these workarounds were detectable by code monitoring tools and had to be reviewed and approved by security team(s). Overall this worked really quite well, in my opinion. Y! could hire junior frontend devs without deep knowledge of data encoding, security issues, etc etc and rest easy. I think the principle of safe-by-default, even if it means destruction of user input in some cases via aggressive sanitization, is a good principle to apply to a frontend framework.


re edit:

No, that's just a terrible idea. It might work quite well in the sense that it prevents server security holes. But it makes for terrible usability, and potentially even security problems for the user. The user expects that their input is reproduced correctly, and if it isn't, that can potentially have catastrophic consequences because it might result in silent corruption.

See also: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...

If you think that it should be possible to have developers who don't understand this problem and its solution, you might as well argue that it should be possible to have developers who don't know any arithmetic or who are illiterate.

If you want to have some help from the computer in avoiding injection attacks, the solution is a type system that ensures that you cannot accidentally use user input as HTML or SQL, for example, or possibly automates coercion when you need to insert pieces of one language into another.


And that's a great philosophy until your email gets rejected by some service that picked a different "common class of email addresses" than you did. This is precisely why we have written standards.


Back when SMTP servers still had remnants of UUCP etc.- where the address could actually contain characters that specify intermediate servers to route to, I would have argued that front-end sanitization was important as, for example, html sanitization from end-users from a security perspective.

However, IETF made lots of progress simplifying things- to the point where, at the very least, the standard tells us specifically that we should leave it up to the destination host to interpret the local part of the email address-- that is, the thing to the right of the @ should be given the thing on the left unmolested ideally- even being ignored by intermediate relay servers. Since that's what most people complain about, any validation to the left of the @ should become extinct.

But off the top of my head that still leaves the thing on the right of the @ (such as localhost), buffer overflows by allowing longer strings than the standard allows (those limits do exist), and the problem with multiple @ which the MTA may or may not handle well... Since I'm not a security expert I'm going to go out on a limb and assume that I'm missing a bunch of other things.

My point is not that the article is wrong, though- my point is that if he wanted to convince me to only validate by sending the email on any string, he should convince me that those security concerns are not an issue- not that it's not good at catching type-os.


I think there is a reasonable middle-ground for validating the domain side of an email address. There are RFCs on all this stuff; it's not just a total free-for-all. The RFCs just aren't nearly as strict as a lot of badly-designed validation regexps are, presumably because most people are unaware of the diversity of acceptable email addresses.

The currently operative RFC is 5322, specifically section 3.4.1: https://tools.ietf.org/html/rfc5322#section-3.4.1

There are some basic rules that you could safely apply to an address, which would prevent some attacks (buffer overflows, etc.) while also not blocking any legitimate addresses. E.g. limiting the overall length to 255 characters, for instance, could be defensible practice. There are also well-defined rules for validating the domain portion, since it has to be a routable address by definition.

What nobody ought to be doing is looking too hard at the string to the left of the @ symbol, because it's designed purely as instructions to the recipient server. Nobody else needs to care about it; only the receiving mailserver needs to actually parse that part of the address, in order to put the message into the right mailbox. From what I've seen, the vast majority of false-positive validation failures occur because people are looking at the mailbox portion of an email address when they have no business doing so.


Your security model is garbage if you depend on controlling all apps that might send email to your mail server.


I know the one. http://boxcar2d.com/ is similar (possibly based on the one you're referring to but it looks like it's been updated). Also this: http://rednuht.org/genetic_cars_2/


It was just like these, but you got to draw the vehicles instead of having them be generated.


This sounds like a huge improvement from when I talked with them a few years ago.


When I interviewed with them, it seemed the interviewers had quite a large effect on how each session went. I had four great interviews, and two that were unpleasant. This had very little to do with how challenging the questions were and a lot more to do with the style of the interviewer.


I agree with you, but it's still easier to fire someone than to un-not-hire someone ;-)


I interviewed @ google several years ago as part of some due-diligence they were doing to decide if they wanted to acquire the startup I was working at, [redacted].

It was several hours, with many interviewers one at a time, of academic exercises. Some I could see were relevant to some of Google's projects, but that wasn't the emphasis- it was very much a "how much do you remember from CS courses?" (with a few exceptions- one had me explain why a snippet of C would crash- something more what I would expect a senior engineer to know that was more practical-oriented).

I _think_ I did pretty well (again, it was part of a bigger due-diligence-- another story altogether, so I never heard), but in the end I was struck by the fact that (a) not a single interviewer had looked at or asked about any of my publicly available work, and (b) not a single one even asked something along the lines of "what are your strengths / what do you think your contribution would be" (or weaknesses, for that matter, but that seems less important).

Of course, one of the other guys I worked with lucked out and got interviewed by one of the paypal-mafia, and ended up with a reportedly very stimulating hour long discussion about actual work, strengths, and actual situations that one would hypothetically need to handle at google.


Your previous work is what gets you in the door, but interviews at Google are explicitly structured so that you need to demonstrate competence right then and there.

And ye olde "what's your greatest strength/weakness" question sounds like a terrible way to judge an engineer.


Makes sense for google. And it's not that I didn't appreciate doing all that (and I did do well), it just struck me how they knew nothing else about me afterward except that I could answer academic questions, and how different that was from the startup world- where, at the very least, you want to know what they're passionate about and that they are productive.

Asking an engineer what their strengths are and then to demonstrate or explain in concrete terms is how you find out their unique value proposition. I'm not talking about "I'm a hard worker" type garbage, I mean "my strongest experience is in systems programming on Linux." "How so?" "For example, I wrote a specialized TCP kernel module that works like this..."

Again, not saying that should be the whole interview- but there are some non-academic skills and experience (like managing and leading a critical system used by thousands of people) that are worth more than a few weaknesses in problems that were already solved years ago and that are easily looked up. It's how you should judge people looking to get into a graduate program, but falls short if you're trying to build teams of awesome fury that need more broad experience and even (gasp) unique experience that you don't have a ready-made question for ;-)


For broad-spectrum roles like software engineers, Google hires people first and then figures out what team they should join later. If you want a specialist role, then you'll need to find one on their jobs page.

To illustrate, here's the kind of thing Mr. TCP kernel module should be applying for: https://www.google.com/about/careers/search#!t=jo&jid=107205...

And here's the kind of generic role they probably shouldn't be: https://www.google.com/about/careers/search#!t=jo&jid=34154&


> not a single interviewer had looked at or asked about any of my publicly available work

It could be dangerous. Let's say you explain some interesting techniques you used on your GPL'd software and a Google product starts doing something suspiciously like what you showed them. You may even sue and end up making their source public after some litigation.


You can't make anyone making source code public, only stopping them from distributing copies of code which you have authored. Copyright and patents works very differently, and its quite hard (impossible) to copy source code wholesale from a interview by accident.

If they did makes copies of an applicants copyrighted work and distribute it, then they could get sued. Some companies has been know to explicitly ask applicants to bring source code from previous work places (also called industry espionage), and those been clearly intended crimes and not something that just happened to company during the process of an interview.


Replying to myself- too late to edit. Some other comments in here lead me to believe there would be experience-related questions ideally these days. Again, my experience was a few years ago :-)


I was very excited to learn about recursive neural networks and how they are different from recurrent neural networks. I began imagining self-similar fractal topologies and automatic convolution layer creation etc... Disappointed to see it was just a typo :-)


There are actually recursive neural networks. https://en.wikipedia.org/wiki/Recursive_neural_network


Quick fix, in nim:

  foo_bar = fooBar
  Foo_bar = FooBar
  foo_bar != FooBar
That is, it's case sensitive with the first character of the identity.

I've mentioned elsewhere that this one bugged me at first, but in practice it simply means that you get to use the language as if your preferred convention (whether snake or camel) is the "official" one- even when calling other libraries. Nothing more and nothing less. Cases where you need both styles but need them to be different things tend to be non-existent / code-smell.


Comment voting ala hackernews / reddit would be useful but in this case so would some attached polls. The +1, -1 was from "back in the day" when it was a useful way for the active contributors to actually vote on something (or straw-poll it) in a mailing list etc.

It's annoying in some cases now because 1) it's anyone, not just the active contributors or even users, 2) people don't use it as a vote but as a general expression of support, 3) -1 isn't really used anymore so you can't scan them and see what the sentiment is, and 4) in this case, it's not even clear what they are lending support to- the merge, or that it is time to discuss a merge seriously.

So yeah, in this case it's noise but in some contexts +1/-1 is a quick and simple way to govern code and it has historical roots. Just ranting as I think they and some in this thread are a little too quick to judge. +1/-1 from the right people could be considered very relevant information to a discussion (esp. within a focused community).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: