That just needs bolder thinking about what represents the 'end'.
If the phrase "end-to-end encryption" means strictly from network interface to network interface, without regard to application layer concerns, then part of the conversation is already missing.
Between the NIC and the human is a massive stack that might be called the user agent (using that phrase way more broadly than just "browser" here) and so long as the human can audit all of it, it's reasonable to let all of it access the unencrypted data in an E2E scheme. But we know that's not merely unreasonable for the average user, but truly impossible for even the expert user, thanks to stuff like Intel Management Engine / AMD Secure Technology. Therefore the E2E scheme should attempt to keep the unencrypted data away from as much of the user agent stack as possible, perhaps ideally confining it to HID components.
But then all you've got is unaltered human speech flowing from one human to another, with no data processing to chop it up, calculate on it, conditional logic based on it, nothing! It's quite a conundrum.