Or perhaps it was removing the curly brackets that improved it more than the dam...

Or perhaps it was removing the curly brackets that improved it more than the damage caused by losing the nsfw content.

Or perhaps the measurement of improvement was biased. If a model doesn't understand the word gay there would certainly be people who would find real world use of the model to be substandard.

Did the assessment of what counts as improvement come from the same community that decided that excluding things with 'gay' was cleaning the data?