Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Maybe if you get you training data by some very different source than your real data comes from, it won't be representative and won't work very well?


Figure out how to run your business unit ethically or shut down the business unit. They don’t have the right to turn their transfer learning problem into an abusive privacy policy.


Yes, if they just paid people to record their voice they would not get training data for real use cases.

I cannot find the blog post now, but quite a few years ago I recall some Google employees noticed a large number of queries for "cha cha cha cha cha..." from Android users in New York. All of the queries were done using voice search, so they listened to a few of the recordings. It turns out that their speech-to-text models were interpreting the sound of the NYC metro pulling into a station as speech.

Obviously they didn't have enough training data of people trying to talk next to a train.


We test our medicines on a small group that is representative of the entire worlds population. We build soil models based on sampling a small region. We don't test your entire blood to do a medical test. I don't know what you mean by "real data", but representative sampling is how work gets done in every single domain in the world. Google can do this.


Yeah no.

Representative sampling is how we formerly did this kind of work. It wasn't particularly good or effective, but we didn't have the methods or compute to go beyond that. No longer.


You're free to have your own opinion, but anything specific beyond "it doesn't work"? I work in pharma, and we use representative sampling every single day in every single thing we do, and it works.


Representative sampling, does 'work' in the sense that it may or may not 'prove' whatever it is you had a question about. But the issue is that you effectively building in your assumptions about what is 'representative' into your sample. Its (imo) the central issue in the reproducibility crisis: our assumptions about the world and how that impacts the questions we ask about it.

It was previously intractable to do a census rather than a sample, and maybe for your purposes a sample is good enough or a census remains intractable. In my field , this is how things were done for decades (and still largely is), and even though (imo) it did a piss-poor job, it was good enough for some purposes. As piss-poor job is still better than knowing nothing. Maybe this is good enough for your purposes.

There's a third way however, which is to move beyond sampling and to perform a census. This is the difference I'm speaking of. We're at the point where we don't have to sample because we can measure. Effectively, this is what modern data science is. We've always had the ability to sample and interpolate. It doesn't work very well (imo: https://en.wikipedia.org/wiki/Replication_crisis) and usually is reflecting back to us something about our assumptions in how we sampled. But thats just it. We don't have to rely on a sample if we can take a census.


>But the issue is that you effectively building in your assumptions about what is 'representative' into your sample.

Even if I agree with your premise, Google is not going to build a custom voice model for every individual anyway. There will be simplifications made. There will be assumptions made, and they will end up with a representative model anyway. So you're actually just bolstering my point. It makes a ton of sense to record people in a known, controlled environment and tweak variables one by one- such as the size of the room, the location of the microphone, introducing varying amounts of background chatter, etc etc. This is how normal science happens all the time, and it has worked for us so far. And we haven't even addressed the ethics of spying on people in such a blatant manner. That is a whole another conversation.

> It doesn't work very well (imo: https://en.wikipedia.org/wiki/Replication_crisis) and usually is reflecting back to us something about our assumptions in how we sampled. But thats just it.

Modelling aggregate human behavior/psychology is not a proper science. The same is true of macro economics and other such non-exact fields. Problems in those fields do not apply across other fields.


This is a very different kind of problem than the ones you listed. One drop of blood is going to be very similar to any other in and individual. That's not true when it comes to language data (or many other types of data for that matter). The data you would record in a prepared setting (i.e. reading from some predefined set of phrases) is typically not even close to representing the full distribution of phrases/dialogues that human's use.

Furthermore, Google/Amazon/FB do use representative sampling of real user data, it's not feasible to transcribe every interaction with Google Home/Alexa/Siri. This akin to what you're suggesting but it no way addresses the privacy concerns. The only real way to do that is to use authorized data or scripted interactions, which, as described above, are not actually representative samples. It is complicated and nuanced problem.


>One drop of blood is going to be very similar to any other in and individual. That's not true when it comes to language data (or many other types of data for that matter).

Why? Please do explain. If you claim that our biology doesn't change at all in one domain, but varies significantly in another, it would be easy to show this scientifically, or more specifically, how this variance is applicable in this context of voice recognition.

Just to take a simple example of blood glucose. Using continuous glucose monitors attached at various sub-cutaneous sites over the body, it is trivial to show how the local glucose is not identical at all sites.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: