* bidirectional - build representations of the current word by looking into both the future and the past
* pre-trained - train on lots of language modelling data (e.g. billions of words of wikipedia) and then train on the task you really care about but starting from the parameters learnt from the language modelling task.