Hacker News new | past | comments | ask | show | jobs | submit | PieSquared's comments login

Probably worth adjusting the title -- air quality is not just a measure of pollution. In this case, both SF and Portland are filled (unfortunately) with wildfire smoke, rather than a man-made pollutant...


Why do you want the title adjusted? It’s sadly just a fact that the air pollution is bad in SF today.

If bad air pollution is caused by factories do people in those towns come and make similar arguments?


I assumed the title meant man-made pollutants. That's colloquially what pollution refers to. It's pretty clearly misleading.


because the title is no longer correct? it's no longer the worst. not denying that pollution is bad in SF.


That’s just semantics, it was the worst very recently?


Is smoke not considered a pollutant? The list is just a ranking of current PM2.5 AQI levels.


Generally when people are talking about pollutants they're talking about man made pollutants.


WHO includes smoke in their overview page on air pollution as does Wikipedia.

- https://www.who.int/health-topics/air-pollution

- https://en.wikipedia.org/wiki/Air_pollution#Sources


Most smoke that is air pollution is caused by farmers burning crops.

You can include smoke in air pollution numbers, while not considering wild fires to be pollution.


As climate change was a key driver of these fires and those in Australia earlier this year, then I feel like it's perfectly reasonable to say this is man made pollution.

Coal and oil are also a "natural" fuels, but we burn those too and we say it's man made pollution.


the fire was man-made


Are you folks planning on extending this to speech? I'm always been disappointed by how speech vocoder networks aren't built with any great inductive biases for waveform generation (besides very long receptive fields), and have desperately wanted something like this tuned for speech. It'd be great if a DSP-based architecture could be shown to outperform WaveNet / Parallel WaveNet / WaveRNN / WaveFlow / etc, and I'd love to use that in our own work. (There's been some attempts based on source-filter models like the "neural source filter (NSF) network", but nothing's caught on as best as I can tell.)


I'm an author on a few of these papers referenced (the Deep Voice papers from Baidu). I'm happy to answer any questions folks may have about neural speech synthesis, as I've been working on this for several years now.

In general, it's a fascinating space. There are challenges in text processing (not even mentioned in the blog), such as grapheme to phoneme conversion, part of speech detection, word sense disambiguation, text normalization, challenges in utterance-level modeling (spectrograms), and challenges in "spectrogram inversion" / waveform synthesis. The NLP components of the pipeline are often overlooked but are no less important than they were a few years ago -- part of speech / word sense is the difference between "Time is a CONstruct" and "I'm going to conSTRUCT a tower", and is the difference between "Let's drop that bass" being about a DJ or about a fish. The acoustic modeling phase (e.g. Tacotron, Deep Voice 3) works fairly well, and can produce some awesome demos with things like style tokens ("GST-Tacotron"), but still has a ways to go until it can encompass the full range of human inflection and emotion. At the waveform synthesis level, models like WaveRNN (with subscale modeling) and Parallel WaveNet make it possible to deploy modern waveform synthesis models, but it's still a major issue to deploy them onto low-power devices due to compute restrictions. Overall, lots of interesting challenges to work on, and we're making a lot of progress quite quickly; and I haven't even started talking about voice conversion or voice cloning!


What do you think is the best neural network currently for processing and possibly generating 44.1 Khz music audio data?

If we're stuck with downsampling to 16 Khz, my question still stands.


I don't think anything about the current set of tools is specific to sample rate; WaveNet, Tacotron, WaveRNN, etc, should work fine to generate 44.1Khz audio. They might just need slightly different hyperparameters or sizes to work well, or may take longer to train due to longer sequence lengths.


Cool! Does text-to-speech require AI, or is there any active work in non-AI methods? Which bits are the AI bits? Do "deep" methods substantially improve over whatever classical methods we might have had?


I'll try to answer these one at a time.

1. Does text-to-speech require AI?

This one is a bit tricky to answer since it requires defining "AI". AI as a moniker has been used to describe deep neural networks, search algorithms, expert systems and logic systems, particle filters, SVMs, etc etc. Almost all text-to-speech (TTS) systems are based on a combination of some of these machine-learning methods and digital signal processing (DPS), so I would say yeah, text-to-speech is exactly what AI describes, even if it doesn't resemble human-like thinking like other AI applications do.

2. Is there any active work in non-AI methods?

This one again is a bit tricky for the same reason as before. However, there's a ton of pieces of the TTS pipeline that aren't AI in the current sense of the word (machine learning with neural networks or HMMs or other classifiers). For example, concatenative systems will traditionally take a large database of audio, divide it into chunks, and then recombine those chunks, using some interpolation method such as (OLA, PSOLA) to overlap those chunks. Choosing the chunks to overlap to create the target speech becomes an AI / search problem, using some sort of acoustic model to predict the acoustic parameters of each frame and then using a Viterbi search algorithm with target / join costs to find the optimal chunks. As another example of non-AI parts of the pipeline, text normalization tends to involve a lot of hand-written rules; for example, should you say "5/10/2019" as "May tenth, twenty nineteen", "the tenth of may twenty nineteen", "the tenth of may two thousand nineteen", or even "october fifth twenty nineteen". This decision and the conversion is often done with a ton of handwritten rules or grammars (see Kestrel, Google's text normalization system, and the open-source version, cleverly named Sparrowhawk). Anyways, the real answer is that TTS is always a combination of AI (machine learning) approaches with specialized text and audio processing algorithms.

3. Which bits are the AI bits?

The AI bits are the bits where you need to make some sort of heuristic decisions, and you'd like to make them by imitating some target speech. For example, things like part of speech detection, predicting acoustic parameters (spectrograms, F0, etc), more recently waveform synthesis as well.

4. Do deep methods significantly improve on the state of the art?

Yes, though they also come at a cost. For example, deep sequence-to-sequence networks make great frame-level models: Tacotron and similar models can do things like emotional and stylized voice synthesis much better than what I've seen HMMs and other non-deep models do. As another example, WaveNet / WaveRNN / etc are some of the only parametric speech models (that is, generating the waveform from scratch instead of copying it from a database of audio) that can match the quality of concatenative models (copying audio from a database), but they can be quite difficult to deploy due to high computational cost. Overall, though, yeah, deep methods and all the improvements to neural networks in the past few years are having a profound impact on the quality and naturalness of TTS.


Thanks very much for your reply, super helpful!! Sorry if that was difficult to answer. I guess I'm interested in how far we've gone from TTS engines like the LPC [1] engines we had in the 80s, or what you get from festival [2]. Maybe there isn't as clear a separation between their methods and the modern Google-scale deep-learning approaches as I thought.

[1] https://en.wikipedia.org/wiki/Linear_predictive_coding

(e.g. as seen in https://en.wikipedia.org/wiki/Texas_Instruments_TI-99/4A)

[2] http://www.festvox.org/festival/


There's a few recent papers actually that show minor improvements by integrating LPC prediction into deep methods ([0], [1]). In my experience (some of which comes from reproducing these, some of which comes from my own experiments), this isn't actually too useful, at at most offers a minor modeling benefit.

The main difference between something like Festival and what we have now is the amount of domain-specific engineering. (This is generally the promise of deep learning -- replace hand-engineered features with simple-to-understand features and a deep model.) If you go and read the Festival manual, you're going to find tons of domain-specific rules and heuristics and subroutines; for example, there's a page on writing letter to sound rules as a grammar [2]. Nowadays, we may have a pipeline that resembles Festival at the high level, but each step of the pipeline is learned as a deep model from data rather than being carefully hand-engineered by many people over the course of years. This yields much more fluid speech as well as much, much faster iteration and experimentation times, leading to faster progress as well.

[0] https://arxiv.org/abs/1811.11913

[1] https://people.xiph.org/~jm/demo/lpcnet/

[2] http://www.festvox.org/docs/manual-2.4.0/festival_13.html#Le...


Thanks for posting here! Do you see any chance of an open source framework, like Mozilla's tacotron, competing with something like Google WaveNet's quality?


First of all, it's important to note that Tacotron and WaveNet are responsible for different parts of the speech synthesis pipeline, so the comparison here isn't quite accurate. Specifically, Tacotron takes a representation of the text (characters, phonemes, etc) and converts it into a frame-level acoustic representation (spectrograms, log mel spectrograms, etc, spaced every 5-25ms). WaveNet takes a frame-level representation of the audio (for example, the output of Tacotron, or phonemes with frame-level timing information) and converts it to a waveform.

Second, I don't see any reason why there shouldn't be an open-source Tacotron or WaveNet implementation that's as good as Google's model implementations. Implementing and training these models is expensive but not prohibitively so (nowadays, you could probably do it with $5,000 - $10,000, including experimentation costs).

That said, quality of text-to-speech systems is determined only partially by the quality of these models -- much if not most of the work of building high quality text to speech systems goes into things like high quality data collection systems, good data annotations, good normalization and NLP tailored towards the domain of the TTS system, multilanguage support, optimized inference implementations for server or mobile platforms, etc.


Great work on the paper!


Ta-Nehisi Coates, Between the World and Me.

https://en.wikipedia.org/wiki/Between_the_World_and_Me

"It is written as a letter to the author's teenage son about the feelings, symbolism, and realities associated with being Black in the United States."

I started reading it and couldn't put it down.


The new Gmail interface has snooze, so if that's the only thing you're missing it'll stick around. I mostly miss the "Trips" feature, having recently switched from Inbox to new Gmail...


The automatic bundling in Inbox - trips, purchases, finance, etc - is a killer feature for me. I tried going back to gmail and immediately missed it. I'll keep using Inbox until they take it out back, just like Reader.


"Monoid" is an adjective that describes a data type.

Anything you describe as a monoid has to have three properties: you can add them together, there's an "empty" or "zero" value, and (a + b) + c = a + (b + c).

For example, strings with concatenation are a monoid, because you can use an empty string and the string + operator.

For example, integers with addition are a monoid, because you can use 0 and the integer + operator.

For example, sets are a monoid, because you can use the empty set and the union operator.

For example, a "Picture" data type could be a monoid, because you could have a fully-transparent picture and use an "overlay" operator to put one picture on top of another.

Why do you care? Well, for one thing, it just makes it easy to find the function or operator you want to use: as the OP said, if I feel like my data type should have a concat or append operator but I don't know what it's called, I just use mappend / mempty / mconcat.

But, as another practical example: "sum" in Python works for numbers, but I've seen folks assume that it works for anything that you can use + on, and so try and use strings with it. But, because "sum" isn't a general purpose tool, that doesn't work! In Haskell, on the other hand, the equivalent -- mconcat -- will work on anything that is a monoid. (And can be specialized to work faster for specific data structures.)

Other languages also have monoids. If someone tells you Haskell has monoids and other languages don't, what they really mean is, Haskell makes explicit the monoid pattern / interface, where other languages have it implicitly for different data types. Talking about the pattern explicitly isn't revolutionary, but it can be pretty useful for discoverability and writing "general" algorithms.

Instead of calling data structures monoids, you'd get the same effect if, as a programming language community, you decided that as many types as possible should support "+", and that every data type that can should have a makeEmptyThing() method, and that it would be weird and not ok if x + makeEmptyThing() didn't equal x and if (x + y) + z didn't equal x + (y + z), and then subtly shamed any libraries that defined makeEmptyThing() and + in ways that didn't follow that pattern. But if your programming language community did this (for consistency) you'd probably want to come up with a name for it -- "Addable", "Concatable", etc -- and the Haskell community chose "Monoid" (because of relationships to theory etc etc).


You're actually wrong about Python's sum function, it works for anything you can + on BUT strings [0], this is to avoid quadratic behavior with string concatenation.

[0] https://github.com/python/cpython/blob/5837d0418f47933b2e3c1...


> But, as another practical example: "sum" in Python works for numbers, but I've seen folks assume that it works for anything that you can use + on, and so try and use strings with it.

I don't know python very well, but aren't string special cased? It does work with lists.

    sum([[1,2,3],[4,5,6]], [])


It does if you specify a sequence of list as the first argument, and an empty list as an initial value.


This is because the start argument defaults to 0. It doesn't make sense to add a list and 0.

The same thing does not work for a sequence of strings and an empty string.

    sum(["abc","def"], "")
Throws TypeError: sum() can't sum strings [use ''.join(seq) instead]


I agree with much of what you say here. Some refinements and small disagreements follow:

A monoid isn't just a data type. Mathematically, it's a set ("data type" is plenty close enough, in this context) and an operation. Integers with addition are a monoid. But so are integers with multiplication. And these are two different monoids. Haskell confuses this issue a bit - they way the Haskell libraries are set up you can only name one Monoid instance per type, and so types that form monoids in several interesting ways get wrapped in "new types" so you can name each of the several. This mostly works fine, but is a bit hacky from a math POV, and I think doesn't lead as well as it might to a more general understanding.

It's worth noting, too, that some of the Haskell libraries - particularly those from the early days - make some unfortunate decisions about what instances to bless. I often complain about the Monoid instance for Map. Map forms a monoid under union so long as we combine values (on collision) associatively. Unfortunately, the library doesn't let us pick how they are combined, it just takes the left one. That's usually not what I want, and it's particularly painful when I've been combining Sets with monoid operations and now realize they need to carry some extra info.

... and then of course there's floating point, which probably shouldn't even be Num.

> But if your programming language community did this (for consistency) you'd probably want to come up with a name for it -- "Addable", "Concatable", etc -- and the Haskell community chose "Monoid" (because of relationships to theory etc etc).

There are two things I really like about using the name Monoid, compared to the others.

First, it's much clearer what the rules are. We aren't stuck wondering whether strings or products are "really" "Addable", whether functions of the form (a -> a) are "really" "Concatable" (... and what about nonempty strings? you can concatenate them, but they have no identity object...).

Clarity in an interface means I know what properties I can rely on. You can define those properties precisely with "more intuitive" names, but then the actual interface doesn't match the intuition and people won't realize it, which is worse. You could get precise and intuitive by adding verbosity - "AssociativeOperationWithIdentity". My principle objection there is readability and aesthetics, but if that's what a community wants to go with okay... "Monoid" is short and unambiguous and well established in related fields.

That last point touches on the other thing I like. There are mathematical results that can be useful to programmers. Reducing the translation needed helps make those more accessible. And everyone could do with knowing just a bit more algebra :-P


In addition to the blog post, there's an interesting discussion on Github that happened before the blog post was published: https://github.com/Unikernel-Systems/unikernel.org/pull/45


Excellent discussion. I'm not knowledgeable enough in Linux internals to know whether the ring0 versus ring3 criticism is warranted. Is it just a matter of if/when an attacker achieves escalated privileges they will have far more attack surface on ring0?


There is quite a difference.

Ring 3 is userspace, you can't interact with hardware or the operating system or anything not in Ring 3 directly.

Ring 0 is everything. There are no restrictions and nothing stops you from writing "Ahahah You didn't say the magic word!" over your entire memory until the CPU crashes.

Having root on a linux kernel is heavily restricted compared to this and still runs in Ring 3 like all other userspace code.

As root, you still have to run the kernel. As Ring 0, you can replace the kernel. Or run your own OS.


Access to ring 0 on a traditional OS is indeed usually "game over".

In the case of a unikernel deployed on a hypervisor this is not the case, since there is not much else in ring 0 that you wouldn't already have access to from ring 3. Conceptually you can think of the hypervisor as "kernel space" and anything inside the unikernel as "userspace".

There are advantages to running the unikernel solely in ring 3 (eg. immutable page tables) however this is not a requirement for security.


I still see it as worse than a normal application being compromised.

When Ring 0 is compromised, there is no alert or anything to protect the app from compromise. If there is an exploit, it's game over.

However, in Ring 3 and a normal kernel, you get various protections that allow the kernel to recognize some attacks and shutdown the application immediately or even shutdown the kernel.

This prevents a compromised app from running to some extend.

A unikernel cannot do this. If the app is compromised and I don't notice and don't restart it...

Even worse, the attacker could use it as leverage to infect other unikernel based instances of the app to gain some permanence against restarts by simply reinfecting when an instance goes down.

The unikernel is not userspace, not even conceptually. The hpyervisor will not shutdown the app unless it executes illegal instructions. The kernel will shutdown misbehaving programs more easily.


> A unikernel cannot do this. If the app is compromised and I don't notice and don't restart it...

I disagree. There's no reason such mitigations (not sure what exactly you're referring to) can't be implemented by the monitor process (ukvm in the Solo5/ukvm model).

I'd also argue that a normal kernel does not do any integrity checks on the code running in a user process, so the model is exactly the same.

> Even worse, the attacker could use it as leverage to infect other unikernel based instances of the app to gain some permanence against restarts by simply reinfecting when an instance goes down.

For that they'd need to break out of the virtual machine and into the hypervisor / monitor. Which is by no means impossible, but with careful design of unikernel-specific monitors can be much reduced. Of course, I'm by no means suggesting you should back your unikernels with a monitor along the lines of QEMU :-)


1) You are in Ring 0. There is no defense unless you reimplement a normal Kernel to run a process in Ring 3 along with the monitoring process and capabilities management... etc

2) No, the attacker is most likely there because of some bug in the app, once in the network, it becomes harder to stop the attacker infecting other instances.

3) Hypervisors are not perfect. There are known instances of people infecting the host through the hypervisor.


1) Virtualized Ring 0 != Ring 0. See section 24.6 "VM-execution control fields" of the Intel SDM for details of the controls a hypervisor can use to modify guest behaviour.

2) The same applies to any application, not just unikernels.

3) I completely agree.


1) You are still on a Ring 0. On a normal operating system, an exploited app has a limited action range, depending on the system settings. A lot of exploits simply do not work because the operating system kills the process. On Ring 0, even virtualized, all these protections do not work. You have full control within the VM and you can't have some process within the VM to check this as it is equally vulnerable.

2) Yes but Unikernels do not provide special protection against this either.


> On a normal operating system, an exploited app has a limited action range

Minor point, but this seems to be a bit lost in the discussion: Generally 1 unikernel == 1 VM (or, virtualization-backed sandbox, the use of "virtual machine" brings too much baggage with it) == 1 application.

So, the attack scope for the class of attacks we're debating is equally limited to a single application, just like on a normal operating system.


Not quite.

When you write an exploit for a normal operating system application, you can't, for example, just write your payload into data memory and start executing it. You can't jump to the address of an array and have the CPU execute it's contents.

On a unikernel this sort of thing becomes trivial since everything is Ring 0 and all protections can be trivially disabled.

You can just write your payload into any arbitrary data field and your exploit only needs to jump to it, even with address randomization this can be exploited (ASLR and similar techniques do not prevent exploits, only make them harder)

The exploiting just becomes a whole lot easier.

It's not even remotely more secure than Ring 3 code running on a kernel that has strict capability enforcement.


> On a unikernel this sort of thing becomes trivial since everything is Ring 0 and all protections can be trivially disabled.

If the hypervisor sets the relevant VMCS control fields for the unikernel to cause a VMEXIT on (for example) a load of CR3 and any EPT violations and sets up the guest with a static set of read-only and executable (or in fact, with EPT you could even do execute-only) pages then there is no way the unikernel can disable these protections.

Having said that, I'm not arguing that running unikernels in Ring 0 is the best approach for security, just that it's not impossibly insecure.

With ukvm we're also looking into running all of the unikernel in (virtualized) Ring 3 with nothing running in Ring 0. However, this needs more experimentation since with x86 you can't (for example) easily handle legitimate exceptions (e.g. divide by zero) from Ring 3.


To be fair, this article is really pretty "cutting edge" as far as Haskell goes. It describing a new feature they're planning on adding to the Haskell compiler, one that significantly extends the type system in an interesting way. I think it's fair for them to assume comfort and familiarity with Haskell syntax and code when writing this article; their audience is folks who are likely to know about and use somewhat esoteric GHC extensions...


Thank you for clarifying this! We tried fairly hard to make this clear, because as you say, the hard part is generating inflection and duration that sounds natural. There's still a ton of work left to do in this duration – we're clearly nowhere near being able to generate human-level speech.

Our work is meant to make working with TTS easier to deep learning researchers by describing a complete and trainable system that can be trained completely from data, and demonstrate that the neural vocoder substitutes can actually be deployed to streaming production servers. Future work (both by us and hopefully other groups) will make further progress for inflection synthesis!


My "Fake News" comment aside, I think what y'all are doing could be transformational for many reasons. Imagine a scenario where a person loses a loved one, and similar technology is able to allow them to "have conversations" with the deceased as a form of healing and closure. Not to mention, this could add a personal touch to assistant bots that will make them a pleasure to use.


Right now, we do not have plans to make an API available. This paper and blog post are mostly meant to describe our techniques to other deep learning researchers and spur innovation in the field. However, we hope that these techniques will be available eventually, and we'll provide more information when that happens.


In order to not miss this announcement, do you have a mailing list we could sign up for to notify us when this becomes available? You have a LOT of people interested.


Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: