https://www.youtube.com/watch?v=uaZ3yRdYg8A
post-training: first SFT from strong models, then DPO-Delta (even though it's not fashioned any more it's way cheaper than RL to get most of the gains), then RL
OCR model new player: HunyuanOCR beats DeepSeek-OCR and PaddleOCR.
Large domination of Chinese teams in OCR, and probably many other open-source AI areas (see e.g. DeepSeek-math-V2 )
CoT tokens do not need to be semantically related to the final answer:
https://arxiv.org/pdf/2505.13775
Transformers are "Bayesian in expectation, not in realization": https://arxiv.org/pdf/2507.11768
La traduction systématique de tous les mots en Français est une catastrophe. J'essaye d'imprimer un PDF, et j'obtiens comme erreur:
"Canceled: Annulé au niveau de la station de libération" ...??
Sparse batched finetuning of #LLM beats SOTA model editing methods to acquire a small number of pieces of knowledge and generalize; this closes the debate of whether SFT can learn knowledge. Yet, full FT fails to acquire few such knowledge triplets, while sparse FT suceeds: there should be some continuum to explore there between full-FT on large datasets and sparse-FT on few facts only.
This paper combines test-time scaling with in-context search to solve hard problems. The most interesting is their proof that, with $n$ input length, $CoT(poly(n))=P$, $AoT(poly(n))=NP$, $CoT(exp(n))=EXP$, $AoT(exp(n))=NEXP$ and that only the *core* (essential for the IC algorithm) steps are counted in the thoughts length; so the capability of the #LLM is more important than the thoughts length.
https://arxiv.org/abs/2505.22290
I'm pretty sure I've already put this paper here before, but wanted to repost it to make it easier for me to find it again: it's really a good paper and it defines the knowledge entropy and why its decay partly explains loss of plasticity:
Scaling laws of batch size for #LLM training: larger batch size with larger learning rate are usually better, but it depends on the context:
#LLM EvoLM: comprehensive study of all training stages, from pretraining to RL:
https://www.semanticscholar.org/reader/6d7f20de3a43ddd12f1a7a1250466b67b7b07599
#LLM Corrective self-distillation to mitigate forgetting when finetuning in the low-data regime:
https://www.semanticscholar.org/reader/e45d5da5b92c33ab0045d1184b806579c7f2949d
#LLM parallel scaling law: a new form of ensembling to scale performances:
https://www.semanticscholar.org/reader/024acf921ca8dfe23089f6ffd65d4e893b7fa015
#LLM finetuning scaling law: it's better to increase LLM size than pretraining data;
with increasing finetuning data, prefer prompt-tuning, LoRA, then full finetune;
PEFT forgets less than full finetune:
https://openreview.net/pdf?id=5HCnKDeTws
An interesting paper that combines #LLM test-time scaling with in-context search and shows, for the first time afaik, encouraging results on 2 NP-hard tasks and on a difficult planning task. https://arxiv.org/pdf/2505.22290
Another paper (after the EMRL one) that proposes to replace DPO with a simple weighted SFT on a batch of trajectories. Another interesting insight from this paper is the fact that negative advantages impact the loss landscape / create instabilities during training, and so they propose exponential weights through variational inf: https://arxiv.org/abs/2502.11026
I see there a trend reminiscent of old-school "self-sup learning" but with reward-derived weights.
Cycles of compression-memorization during pretraining improves generalization:
https://arxiv.org/abs/2505.08727
Long context and long CoT are very different: the amount of information in long context makes it hard to combine all this inforation within the limited nb of layers, hence requiring long CoT, which are generated to "summarize" intermediate values during reasoning, thus enabling layers to use these intermediate values instead of the original ones.
https://arxiv.org/pdf/2505.04955