Christophe Cerisara
cerisara@mastodon.online

An interesting paper that combines #LLM test-time scaling with in-context search and shows, for the first time afaik, encouraging results on 2 NP-hard tasks and on a difficult planning task. https://arxiv.org/pdf/2505.22290

May 31, 2025
Christophe Cerisara
cerisara@mastodon.online

Another paper (after the EMRL one) that proposes to replace DPO with a simple weighted SFT on a batch of trajectories. Another interesting insight from this paper is the fact that negative advantages impact the loss landscape / create instabilities during training, and so they propose exponential weights through variational inf: https://arxiv.org/abs/2502.11026
I see there a trend reminiscent of old-school "self-sup learning" but with reward-derived weights.

May 30, 2025
Christophe Cerisara
cerisara@mastodon.online

Cycles of compression-memorization during pretraining improves generalization:
https://arxiv.org/abs/2505.08727

May 15, 2025
Christophe Cerisara
cerisara@mastodon.online

Long context and long CoT are very different: the amount of information in long context makes it hard to combine all this inforation within the limited nb of layers, hence requiring long CoT, which are generated to "summarize" intermediate values during reasoning, thus enabling layers to use these intermediate values instead of the original ones.
https://arxiv.org/pdf/2505.04955

May 10, 2025
Christophe Cerisara
cerisara@mastodon.online

Most "maths" #LLM benchmarks only evaluate the final output (e.g., a simple number). There's a tendency though right now to shift maths evaluations towards assessing the reasoning path instead of the output. Both papers propose ways to do this, for theorem proving in Lean4, and judging the errors during reasoning:

https://arxiv.org/pdf/2505.02735

https://openreview.net/pdf?id=br4H61LOoI

May 08, 2025
Christophe Cerisara
cerisara@mastodon.online

ICL generalizes better than FT when usng the same training dataset, but if you further increase the
FT dataset with in-context inference, then FT ends up the best at generalization:

https://arxiv.org/pdf/2505.00661

Note: 2 artificial test sets are used, exploiting reversion of relation and syllogisms.

#LLM

May 03, 2025
Christophe Cerisara
cerisara@mastodon.online

Transformers may encode a model of the World in their residual stream, that can be viewed as a space encoding the belief state (distribution over possible generating states): https://arxiv.org/pdf/2405.15943

April 16, 2025
Christophe Cerisara
cerisara@mastodon.online

Information compression in #LLM: the Kolmogorov Structure function is a lower bound of the scaling laws; assuming a Pitman-Yor data generation process (it's good to see Bayes again!), they demonstrate it leads to the actual scaling laws observed in LLM:
https://arxiv.org/pdf/2504.09597

April 16, 2025
Christophe Cerisara
cerisara@mastodon.online

@lauluompelimo @academicchatter

so the llm has read your papers and might be recognizing your style... That's interesting! A review is not that long tho so may be a bit uncommon wordings?

Thanks for sharing. not more ceepy than common trackers using your browser fingerprint I think

April 09, 2025
Christophe Cerisara
cerisara@mastodon.online

ClusComp paper confirms that quantization seems to be hitting some ceiling, and proposes an alternative based on VQ codebooks:
https://arxiv.org/pdf/2503.13089

March 28, 2025
Christophe Cerisara
cerisara@mastodon.online

Overtrained #LLM are harder to finetune: this paper looks like a serious stop to
the current trend of small LLMs trained on more and more tokens.
And this was kinda expected: continued training of LLM without noise is difficult,
and there are previous papers about entropy of
activations that decrease during training, making learning more difficult, etc.
Optima from scaling laws means something after all. This phenomon needs further study, nice paper:

https://arxiv.org/abs/2503.19206

March 26, 2025
Christophe Cerisara
cerisara@mastodon.online

Edited #LLM knowledge may fail to propagate in multi-hop questions (e.g. in "the birth country of the father of...");
this paper analyzes the 2 circuits that are responsible for each reasoning "hop" and use them to edit knowledge:
https://arxiv.org/pdf/2503.16356

March 22, 2025
Christophe Cerisara
cerisara@mastodon.online

#LLM adaptation with LoRA is still too costly, and several works try to further reduce this cost.
To do that, LoRAM proposes to first prune the LLM and only finetune this prunes LLM with LoRA:
https://arxiv.org/pdf/2502.13533

March 21, 2025
Christophe Cerisara
cerisara@mastodon.online

This paper proposes a new metric to locate knowledge neurons in #LLM, based on
the change in target word logits when removing this neuron. It shows that both attention
and MLP encode knowledge, and that various types of knowledge is stored in different layers:

https://arxiv.org/pdf/2312.12141

March 08, 2025
Christophe Cerisara
cerisara@mastodon.online

Job opportunity in Nancy, France: research engineer for pretraining LLMs. Candidate here: https://emploi.cnrs.fr/Offres/CDD/UMR7503-CHRCER-003/Default.aspx

March 05, 2025
Christophe Cerisara
cerisara@mastodon.online

Compression is prediction, and it is "closely related to the ability to generalize"; this is true for #LLM and Ilias Sutskever explains the effectiveness of unsupervised learning with compression. This paper exploits this connection to evaluate LLMs: https://arxiv.org/pdf/2402.00861

March 04, 2025