Instruction-tuned models performed worse than their base counterparts at mimicking humans, and the 70-billion-parameter Llama 3.1 showed no advantage over smaller 8-billion-parameter versions. The researchers found a fundamental tension: models optimized to avoid detection strayed further from actual human responses semantically.
Categories: Leben (Life aka misc)