Synthetic data works as long as it is directed towards a clear objective and curated.
At one point someone generated a Python teaching book from a LLM, took that, trained a second LLM with that, and the new LLM knew Python.
If you are just dragging random content from the web and you don't know what's synthetic and what's human, that data may be contaminated and a lot less useful, but if someone wanted to whitewash their training data by replacing a part of it with synthetic data, it can be done.
At one point someone generated a Python teaching book from a LLM, took that, trained a second LLM with that, and the new LLM knew Python.
If you are just dragging random content from the web and you don't know what's synthetic and what's human, that data may be contaminated and a lot less useful, but if someone wanted to whitewash their training data by replacing a part of it with synthetic data, it can be done.