How are datasets curated for LLM training?