OpenAI did exact the test for GPT-4. The raw, non-fine-tune GPT-4 is quite good at predicting confidence level ("highly calibrated" by their words). But the RLHF fine-tuning process seems ruin its calibration. Figure 8 on page 12 of GPT-4 Technical Report shows this dramatic changes before & after fine-tuning.