Christophe Cerisara
cerisara@mastodon.online

https://www.youtube.com/watch?v=uaZ3yRdYg8A

post-training: first SFT from strong models, then DPO-Delta (even though it's not fashioned any more it's way cheaper than RL to get most of the gains), then RL

3 days ago
Christophe Cerisara
cerisara@mastodon.online

OCR model new player: HunyuanOCR beats DeepSeek-OCR and PaddleOCR.
Large domination of Chinese teams in OCR, and probably many other open-source AI areas (see e.g. DeepSeek-math-V2 )

3 days ago
Christophe Cerisara
cerisara@mastodon.online

CoT tokens do not need to be semantically related to the final answer:
https://arxiv.org/pdf/2505.13775

December 04, 2025
Christophe Cerisara
cerisara@mastodon.online

Transformers are "Bayesian in expectation, not in realization": https://arxiv.org/pdf/2507.11768

November 29, 2025
Christophe Cerisara
cerisara@mastodon.online

La traduction systématique de tous les mots en Français est une catastrophe. J'essaye d'imprimer un PDF, et j'obtiens comme erreur:
"Canceled: Annulé au niveau de la station de libération" ...??

November 26, 2025
Christophe Cerisara
cerisara@mastodon.online

Sparse batched finetuning of #LLM beats SOTA model editing methods to acquire a small number of pieces of knowledge and generalize; this closes the debate of whether SFT can learn knowledge. Yet, full FT fails to acquire few such knowledge triplets, while sparse FT suceeds: there should be some continuum to explore there between full-FT on large datasets and sparse-FT on few facts only.

https://arxiv.org/pdf/2509.22072

October 01, 2025
Christophe Cerisara
cerisara@mastodon.online

This paper combines test-time scaling with in-context search to solve hard problems. The most interesting is their proof that, with $n$ input length, $CoT(poly(n))=P$, $AoT(poly(n))=NP$, $CoT(exp(n))=EXP$, $AoT(exp(n))=NEXP$ and that only the *core* (essential for the IC algorithm) steps are counted in the thoughts length; so the capability of the #LLM is more important than the thoughts length.
https://arxiv.org/abs/2505.22290

September 11, 2025
Christophe Cerisara
cerisara@mastodon.online

I'm pretty sure I've already put this paper here before, but wanted to repost it to make it easier for me to find it again: it's really a good paper and it defines the knowledge entropy and why its decay partly explains loss of plasticity:

https://openreview.net/pdf?id=eHehzSDUFp

September 11, 2025
Christophe Cerisara
cerisara@mastodon.online

Scaling laws of batch size for #LLM training: larger batch size with larger learning rate are usually better, but it depends on the context:

https://arxiv.org/pdf/2412.01505

August 28, 2025
Christophe Cerisara
cerisara@mastodon.online

#LLM EvoLM: comprehensive study of all training stages, from pretraining to RL:
https://www.semanticscholar.org/reader/6d7f20de3a43ddd12f1a7a1250466b67b7b07599

July 05, 2025
Christophe Cerisara
cerisara@mastodon.online

#LLM Corrective self-distillation to mitigate forgetting when finetuning in the low-data regime:
https://www.semanticscholar.org/reader/e45d5da5b92c33ab0045d1184b806579c7f2949d

July 05, 2025
Christophe Cerisara
cerisara@mastodon.online

#LLM parallel scaling law: a new form of ensembling to scale performances:
https://www.semanticscholar.org/reader/024acf921ca8dfe23089f6ffd65d4e893b7fa015

July 05, 2025
Christophe Cerisara
cerisara@mastodon.online

#LLM finetuning scaling law: it's better to increase LLM size than pretraining data;
with increasing finetuning data, prefer prompt-tuning, LoRA, then full finetune;
PEFT forgets less than full finetune:
https://openreview.net/pdf?id=5HCnKDeTws

July 05, 2025
Christophe Cerisara
cerisara@mastodon.online

An interesting paper that combines #LLM test-time scaling with in-context search and shows, for the first time afaik, encouraging results on 2 NP-hard tasks and on a difficult planning task. https://arxiv.org/pdf/2505.22290

May 31, 2025
Christophe Cerisara
cerisara@mastodon.online

Another paper (after the EMRL one) that proposes to replace DPO with a simple weighted SFT on a batch of trajectories. Another interesting insight from this paper is the fact that negative advantages impact the loss landscape / create instabilities during training, and so they propose exponential weights through variational inf: https://arxiv.org/abs/2502.11026
I see there a trend reminiscent of old-school "self-sup learning" but with reward-derived weights.

May 30, 2025
Christophe Cerisara
cerisara@mastodon.online

Cycles of compression-memorization during pretraining improves generalization:
https://arxiv.org/abs/2505.08727

May 15, 2025
Christophe Cerisara
cerisara@mastodon.online

Long context and long CoT are very different: the amount of information in long context makes it hard to combine all this inforation within the limited nb of layers, hence requiring long CoT, which are generated to "summarize" intermediate values during reasoning, thus enabling layers to use these intermediate values instead of the original ones.
https://arxiv.org/pdf/2505.04955

May 10, 2025