A primary concerns is that AI models can “collapse” when they rely too much on synthetic data. This means they start generating so many “hallucinations” – a response that contains false information – and decline so much in quality and performance that they are unusable. For example, AI models already struggle with spelling some words correctly. If this mistake-riddled data is used to train other models, then they too are bound to replicate the errors.
Source: Tech companies are turning to ‘synthetic data’ to train AI models – but there’s a hidden cost