Reminder of 3 papers that concern how to tackle forgetting when finetuning for knowledge acquisition:
https://www.semanticscholar.org/reader/e45d5da5b92c33ab0045d1184b806579c7f2949d
https://openreview.net/pdf?id=eHehzSDUFp
https://arxiv.org/pdf/2509.22072
https://www.youtube.com/watch?v=uaZ3yRdYg8A
post-training: first SFT from strong models, then DPO-Delta (even though it's not fashioned any more it's way cheaper than RL to get most of the gains), then RL
OCR model new player: HunyuanOCR beats DeepSeek-OCR and PaddleOCR.
Large domination of Chinese teams in OCR, and probably many other open-source AI areas (see e.g. DeepSeek-math-V2 )
CoT tokens do not need to be semantically related to the final answer:
https://arxiv.org/pdf/2505.13775
Transformers are "Bayesian in expectation, not in realization": https://arxiv.org/pdf/2507.11768
La traduction systématique de tous les mots en Français est une catastrophe. J'essaye d'imprimer un PDF, et j'obtiens comme erreur:
"Canceled: Annulé au niveau de la station de libération" ...??
Sparse batched finetuning of #LLM beats SOTA model editing methods to acquire a small number of pieces of knowledge and generalize; this closes the debate of whether SFT can learn knowledge. Yet, full FT fails to acquire few such knowledge triplets, while sparse FT suceeds: there should be some continuum to explore there between full-FT on large datasets and sparse-FT on few facts only.
This paper combines test-time scaling with in-context search to solve hard problems. The most interesting is their proof that, with $n$ input length, $CoT(poly(n))=P$, $AoT(poly(n))=NP$, $CoT(exp(n))=EXP$, $AoT(exp(n))=NEXP$ and that only the *core* (essential for the IC algorithm) steps are counted in the thoughts length; so the capability of the #LLM is more important than the thoughts length.
https://arxiv.org/abs/2505.22290
I'm pretty sure I've already put this paper here before, but wanted to repost it to make it easier for me to find it again: it's really a good paper and it defines the knowledge entropy and why its decay partly explains loss of plasticity:
Scaling laws of batch size for #LLM training: larger batch size with larger learning rate are usually better, but it depends on the context:
#LLM EvoLM: comprehensive study of all training stages, from pretraining to RL:
https://www.semanticscholar.org/reader/6d7f20de3a43ddd12f1a7a1250466b67b7b07599
#LLM Corrective self-distillation to mitigate forgetting when finetuning in the low-data regime:
https://www.semanticscholar.org/reader/e45d5da5b92c33ab0045d1184b806579c7f2949d
#LLM parallel scaling law: a new form of ensembling to scale performances:
https://www.semanticscholar.org/reader/024acf921ca8dfe23089f6ffd65d4e893b7fa015
#LLM finetuning scaling law: it's better to increase LLM size than pretraining data;
with increasing finetuning data, prefer prompt-tuning, LoRA, then full finetune;
PEFT forgets less than full finetune:
https://openreview.net/pdf?id=5HCnKDeTws
An interesting paper that combines #LLM test-time scaling with in-context search and shows, for the first time afaik, encouraging results on 2 NP-hard tasks and on a difficult planning task. https://arxiv.org/pdf/2505.22290
Another paper (after the EMRL one) that proposes to replace DPO with a simple weighted SFT on a batch of trajectories. Another interesting insight from this paper is the fact that negative advantages impact the loss landscape / create instabilities during training, and so they propose exponential weights through variational inf: https://arxiv.org/abs/2502.11026
I see there a trend reminiscent of old-school "self-sup learning" but with reward-derived weights.
Cycles of compression-memorization during pretraining improves generalization:
https://arxiv.org/abs/2505.08727