I think the main reason is that the model has a lot of training material with Chinese text in it (I'm assuming, since the research group who released it is from China), but having the negative prompt in Chinese might also play a role.
What I've found interesting so far is that sometimes the image plays a big part in the final video, but other times it gets discarded almost immediately after the first few frames. It really depends on the prompt, so prompt engineering is (at least for this model) even more important than I expected. I'm now thinking of adding a 'system' positive prompt and appending the user prompt to it.
Would be interesting to see how much a good "system"/server-side prompt could improve things. I noticed some animations kept the same sketch style even without specifying that in the prompt.