← Timeline
Avatar
Tigra
(updated )
LLM distillation transfers out-of-distribution traits to the student model

It figured, that when LLM distillation is done (student model is trained on teacher model outputs), some of teacher model preferences are passed to a student model, even if no examples where these preferences are related to were in training set.

That is, if a trainer model prefers owls, and teaches the student model to generate number sequences continuations, it somehow passes also the preference for owls.

Given that, Chinese open source models can be dangerous for Western societies, and Musk's idea to "rewrite human knowledge" in Grok may cure the humanity (unless there are mistakes in their "first principles").

https://alignment.anthropic.com/2025/subliminal-learning/

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden …
ALIGNMENT.ANTHROPIC.COM
👍😮2
To react or comment  View in Web Client