Christophe Cerisara
cerisara@mastodon.online

Reminder of 3 papers that concern how to tackle forgetting when finetuning for knowledge acquisition:

https://www.semanticscholar.org/reader/e45d5da5b92c33ab0045d1184b806579c7f2949d
https://openreview.net/pdf?id=eHehzSDUFp
https://arxiv.org/pdf/2509.22072

January 05, 2026
Christophe Cerisara
cerisara@mastodon.online

https://www.youtube.com/watch?v=uaZ3yRdYg8A

post-training: first SFT from strong models, then DPO-Delta (even though it's not fashioned any more it's way cheaper than RL to get most of the gains), then RL

December 13, 2025
Christophe Cerisara
cerisara@mastodon.online

OCR model new player: HunyuanOCR beats DeepSeek-OCR and PaddleOCR.
Large domination of Chinese teams in OCR, and probably many other open-source AI areas (see e.g. DeepSeek-math-V2 )

December 13, 2025
Christophe Cerisara
cerisara@mastodon.online

CoT tokens do not need to be semantically related to the final answer:
https://arxiv.org/pdf/2505.13775

December 04, 2025
Christophe Cerisara
cerisara@mastodon.online

Transformers are "Bayesian in expectation, not in realization": https://arxiv.org/pdf/2507.11768

November 29, 2025
Christophe Cerisara
cerisara@mastodon.online

La traduction systématique de tous les mots en Français est une catastrophe. J'essaye d'imprimer un PDF, et j'obtiens comme erreur:
"Canceled: Annulé au niveau de la station de libération" ...??

November 26, 2025
Christophe Cerisara
cerisara@mastodon.online

Sparse batched finetuning of #LLM beats SOTA model editing methods to acquire a small number of pieces of knowledge and generalize; this closes the debate of whether SFT can learn knowledge. Yet, full FT fails to acquire few such knowledge triplets, while sparse FT suceeds: there should be some continuum to explore there between full-FT on large datasets and sparse-FT on few facts only.

https://arxiv.org/pdf/2509.22072

October 01, 2025
Christophe Cerisara
cerisara@mastodon.online

This paper combines test-time scaling with in-context search to solve hard problems. The most interesting is their proof that, with $n$ input length, $CoT(poly(n))=P$, $AoT(poly(n))=NP$, $CoT(exp(n))=EXP$, $AoT(exp(n))=NEXP$ and that only the *core* (essential for the IC algorithm) steps are counted in the thoughts length; so the capability of the #LLM is more important than the thoughts length.
https://arxiv.org/abs/2505.22290

September 11, 2025
Christophe Cerisara
cerisara@mastodon.online

I'm pretty sure I've already put this paper here before, but wanted to repost it to make it easier for me to find it again: it's really a good paper and it defines the knowledge entropy and why its decay partly explains loss of plasticity:

https://openreview.net/pdf?id=eHehzSDUFp

September 11, 2025
Christophe Cerisara
cerisara@mastodon.online

Scaling laws of batch size for #LLM training: larger batch size with larger learning rate are usually better, but it depends on the context:

https://arxiv.org/pdf/2412.01505

August 28, 2025
Christophe Cerisara
cerisara@mastodon.online

#LLM EvoLM: comprehensive study of all training stages, from pretraining to RL:
https://www.semanticscholar.org/reader/6d7f20de3a43ddd12f1a7a1250466b67b7b07599

July 05, 2025
Christophe Cerisara
cerisara@mastodon.online

#LLM Corrective self-distillation to mitigate forgetting when finetuning in the low-data regime:
https://www.semanticscholar.org/reader/e45d5da5b92c33ab0045d1184b806579c7f2949d

July 05, 2025
Christophe Cerisara
cerisara@mastodon.online

#LLM parallel scaling law: a new form of ensembling to scale performances:
https://www.semanticscholar.org/reader/024acf921ca8dfe23089f6ffd65d4e893b7fa015

July 05, 2025
Christophe Cerisara
cerisara@mastodon.online

#LLM finetuning scaling law: it's better to increase LLM size than pretraining data;
with increasing finetuning data, prefer prompt-tuning, LoRA, then full finetune;
PEFT forgets less than full finetune:
https://openreview.net/pdf?id=5HCnKDeTws

July 05, 2025
Christophe Cerisara
cerisara@mastodon.online

An interesting paper that combines #LLM test-time scaling with in-context search and shows, for the first time afaik, encouraging results on 2 NP-hard tasks and on a difficult planning task. https://arxiv.org/pdf/2505.22290

May 31, 2025
Christophe Cerisara
cerisara@mastodon.online

Another paper (after the EMRL one) that proposes to replace DPO with a simple weighted SFT on a batch of trajectories. Another interesting insight from this paper is the fact that negative advantages impact the loss landscape / create instabilities during training, and so they propose exponential weights through variational inf: https://arxiv.org/abs/2502.11026
I see there a trend reminiscent of old-school "self-sup learning" but with reward-derived weights.

May 30, 2025
Christophe Cerisara
cerisara@mastodon.online

Cycles of compression-memorization during pretraining improves generalization:
https://arxiv.org/abs/2505.08727

May 15, 2025