Christophe Cerisara
cerisara@mastodon.online

Great paper on test-time scaling; recipe: 1) careful creation of 1K examples; 2) finetune the LLM on these 1K samples (no RL!); 3) control thinking length with a test-time budget. This S1 LLM gives very good results for the small training set and presents clear test-time scaling behaviour.

https://arxiv.org/pdf/2501.19393

2 days ago
Christophe Cerisara
cerisara@mastodon.online

A new theory about how in-context learning works IN #llm: 3 phases: 1) encode in-context text, 2) merge with in-context labels, 3) retrieve most similar in-context features: https://arxiv.org/pdf/2410.04468

3 days ago
Christophe Cerisara
cerisara@mastodon.online

The dimension of the representation manifold computed at various layers of an #LLM has a maximum peak at early-intermediate
layers that separates surface-form processing (earlier layers) from syntax-semantic processing (later layers):
https://arxiv.org/pdf/2405.15471

5 days ago
Christophe Cerisara
cerisara@mastodon.online

New simple paradigm for zero-shot emergence in #LLM: MILS optimizes (gradient free) iteratively a generator (typically an LLM that outputs a text solution) and a scorer (model that has been pretrained to evaluate this solution for a task):
https://arxiv.org/pdf/2501.18096

This method relies on the capacity of the LLM to reason and improve the solution based on the scorer output.

January 31, 2025
Christophe Cerisara
cerisara@mastodon.online

Paper that compares SFT vs. RL to adapt to simple math reasoning tasks: https://arxiv.org/pdf/2501.17161v1
RL apparently generalizes better, although only when SFT is used first to adapt the base LLM to the task.
I'm not totally convinced, as the task and setup seems more adapted to RL than SFT, and aren't the worse results of SFT due to overfitting? What if you prevent that with more regularization?

January 31, 2025
Christophe Cerisara
cerisara@mastodon.online

Nice paper that compare several unlearning methods and propose 5 new ones: https://arxiv.org/pdf/2410.02159

January 31, 2025
Christophe Cerisara
cerisara@mastodon.online

A better knowledge distillation method to train smaller #LLM that
interpolates between the teacher and student distributions: https://arxiv.org/pdf/2501.16937

January 31, 2025
Christophe Cerisara
cerisara@mastodon.online

Two arxiv papers by the same group about planning complex tasks with #LLM and the help of external formal solvers:
https://arxiv.org/pdf/2404.11891

and

https://arxiv.org/pdf/2410.12112

January 30, 2025
Christophe Cerisara
cerisara@mastodon.online

knowledge editing in #LLM by modifying the loss to include a term that represents the KL distance between the output distributions with and without the new fact in context: https://arxiv.org/pdf/2406.11194

January 29, 2025
Christophe Cerisara
cerisara@mastodon.online

Analyzis of which strategies do the #LLM apply to solve problems: it appears to often be a combination of memorization and either reasoning or guessing. This does not tell whether they *could* solve the problem only with reasoning but rather *use* such a combination though: https://arxiv.org/pdf/2501.13833

January 24, 2025
Christophe Cerisara
cerisara@mastodon.online

AoT+: great paper that improves #LLM reasoning without external tools, but by exploiting memoization to make it easier for them to recall past reasoning states without errors, and by showing random backtracking paths in the few-shot examples to incite them to use backtracking: https://arxiv.org/pdf/2501.13545

January 24, 2025
Christophe Cerisara
cerisara@mastodon.online

Technical report of Yi-lightning: a MoE #LLM that top-scores on the arena benchmarks; a lot of very interesting insights in this report, like how they decompose pretraining into 3 main stages.

https://arxiv.org/pdf/2412.01253

January 23, 2025
Christophe Cerisara
cerisara@mastodon.online

Congratulations to Yaya Sy and Gaspard Michel for their papers about LLM compression and leveraging LLMs for book understanding both accepted at NAACL'2025 main conference!

January 23, 2025
Christophe Cerisara
cerisara@mastodon.online

Coconut: chain of continuous thoughts

https://arxiv.org/pdf/2412.06769

January 22, 2025
Christophe Cerisara
cerisara@mastodon.online

Towards System 2 reasoning with #LLM: Meta-Chain of Thoughts

https://arxiv.org/pdf/2501.04682

January 22, 2025
Christophe Cerisara
cerisara@mastodon.online

Hyperfitting: a heuristic that always improves quality of #LLM generation by overfitting on a small news corpus. Side note: don't worry if your perplexity increases, it's not a good metric anyway.

https://www.alphaxiv.org/abs/2412.04318

January 18, 2025
Christophe Cerisara
cerisara@mastodon.online

FLM-101B: 90% of the performances of GLM-130B for 10% of the costs thanks to growing the LLM during training:
costs are reduced as models are much smaller for a long part of training:

https://arxiv.org/pdf/2309.03852

(Similar conclusion reached in TokenFormer)

January 17, 2025
Christophe Cerisara
cerisara@mastodon.online

Two specific dimensions exist in all LLMs to determine whether they're lying and the negation:

https://www.alphaxiv.org/abs/2407.12831

January 17, 2025