At least in the original paper [1], the key idea is that instead of training a neural network to treat "images" as an array of pixels, you train a network to map a camera location and direction (in 3D space) to a distance and a color.
For example, if you tell a NeRF that you have a camera at location (x, y, z) and pointing in direction (g, h, j), then the network will output the distance at which the ray emitted from the camera is expected to "hit" something and the RGB color of whatever is hit.
Doing things in this way enables rendering images at arbitrary resolutions (though rendering can be slow), and is naturally conducive to producing rotated views of objects or exploring 3D space. Also, at least theoretically, it should allow for more "compact" network architectures, as it does not need to output, say, a 512x512x3 image.
A NeRF is a deep neural network architecture, that takes a set of photos from multiple camera positions of a given scene as an input and outputs a volume representation of this scene. Volumes allow for many common volume rendering interactions like repositioning the camera from any angle or even lighting effects.