Look into Common Crawl and see what kind of quality content we are feeding these things. 4chan is just the tip of the iceberg (but it will happily answer all your questions, because it's seen everything).
I don't know of anyone who uses Common Crawl as pre-training data without filtering it. We have an annotation system that lets people pick and choose which subsets they'd like to use.