Great paper on test-time scaling; recipe: 1) careful creation of 1K examples; 2) finetune the LLM on these 1K samples (no RL!); 3) control thinking length with a test-time budget. This S1 LLM gives very good results for the small training set and presents clear test-time scaling behaviour.
A new theory about how in-context learning works IN #llm: 3 phases: 1) encode in-context text, 2) merge with in-context labels, 3) retrieve most similar in-context features: https://arxiv.org/pdf/2410.04468
The dimension of the representation manifold computed at various layers of an #LLM has a maximum peak at early-intermediate
layers that separates surface-form processing (earlier layers) from syntax-semantic processing (later layers):
https://arxiv.org/pdf/2405.15471
New simple paradigm for zero-shot emergence in #LLM: MILS optimizes (gradient free) iteratively a generator (typically an LLM that outputs a text solution) and a scorer (model that has been pretrained to evaluate this solution for a task):
https://arxiv.org/pdf/2501.18096
This method relies on the capacity of the LLM to reason and improve the solution based on the scorer output.
Paper that compares SFT vs. RL to adapt to simple math reasoning tasks: https://arxiv.org/pdf/2501.17161v1
RL apparently generalizes better, although only when SFT is used first to adapt the base LLM to the task.
I'm not totally convinced, as the task and setup seems more adapted to RL than SFT, and aren't the worse results of SFT due to overfitting? What if you prevent that with more regularization?
Nice paper that compare several unlearning methods and propose 5 new ones: https://arxiv.org/pdf/2410.02159
A better knowledge distillation method to train smaller #LLM that
interpolates between the teacher and student distributions: https://arxiv.org/pdf/2501.16937
Two arxiv papers by the same group about planning complex tasks with #LLM and the help of external formal solvers:
https://arxiv.org/pdf/2404.11891
and
knowledge editing in #LLM by modifying the loss to include a term that represents the KL distance between the output distributions with and without the new fact in context: https://arxiv.org/pdf/2406.11194
Analyzis of which strategies do the #LLM apply to solve problems: it appears to often be a combination of memorization and either reasoning or guessing. This does not tell whether they *could* solve the problem only with reasoning but rather *use* such a combination though: https://arxiv.org/pdf/2501.13833
AoT+: great paper that improves #LLM reasoning without external tools, but by exploiting memoization to make it easier for them to recall past reasoning states without errors, and by showing random backtracking paths in the few-shot examples to incite them to use backtracking: https://arxiv.org/pdf/2501.13545
Technical report of Yi-lightning: a MoE #LLM that top-scores on the arena benchmarks; a lot of very interesting insights in this report, like how they decompose pretraining into 3 main stages.
Congratulations to Yaya Sy and Gaspard Michel for their papers about LLM compression and leveraging LLMs for book understanding both accepted at NAACL'2025 main conference!
Coconut: chain of continuous thoughts
Towards System 2 reasoning with #LLM: Meta-Chain of Thoughts
Hyperfitting: a heuristic that always improves quality of #LLM generation by overfitting on a small news corpus. Side note: don't worry if your perplexity increases, it's not a good metric anyway.
FLM-101B: 90% of the performances of GLM-130B for 10% of the costs thanks to growing the LLM during training:
costs are reduced as models are much smaller for a long part of training:
https://arxiv.org/pdf/2309.03852
(Similar conclusion reached in TokenFormer)
Two specific dimensions exist in all LLMs to determine whether they're lying and the negation: