They didn't remove the hidden state entirely, they just removed it from the input, forget and update gates. I haven't digested the paper either, but I think that in the case of a GRU this means that the hidden state update masking (z_t and r_t in the paper's formulas) only depends on the new input, not the input plus the prior hidden state.