Cartoonish output is a problem across the board. If you explicitly ask Dall-E for a "photograph" of something, you will very often get a result that looks like a cartoonified illustration. Prompt writers resort to specifying exact camera models and lenses to try to constrain the process.
There are fine tuned models out there that can generate near photo-realistic results. The base SD models and those offered by the major AI service sites have a more stylized look to them. Probably partially to work on a wider array of prompts that may include non photorealistic subjects, and partially for safety.