You're right, and here, they're including the prior R(x) in the model architectu...

You're right, and here, they're including the prior R(x) in the model architecture. So R(x) = 0 here,

and they pose x = f_theta(z), where f_theta is a neural net.

So they can avoid overfitting by stopping the training at the right moment.

So they're not really trying to minimize E(x, x0), but rather to get a small enough value, so that they obtain good results. A too small value means overfitting.