Growing LLMs: SOLAR, [LlamaPro](https://arxiv.org/abs/2401.02415), [LiGO](https://arxiv.org/abs/2303.00980), [FLM-101b](https://huggingface.co/CofeAI/FLM-101B), [dynamic architectures do not scale](https://arxiv.org/abs/2307.06440)
#LLM benchmarks are not reliable... https://arxiv.org/abs/2402.01781
transformers can extrapolate context length to 2.5x but length generalization remains fragile: https://arxiv.org/abs/2402.09371
aespa: quantize weights separately down to 2-bit: https://arxiv.org/abs/2402.08958
Information gained through finetuning may be compressed down to 1 bit, suggesting that
most value really comes from pretraining and reducing inference cost further.
A side lesson: LoRA may not be the alpha and omega of adaptation after all...
https://arxiv.org/abs/2402.10193
DoRA makes LoRA better by decomposing the magnitude and direction of weights, and applying LoRA only on the direction:
https://arxiv.org/abs/2402.09353
Automatically learning permutation-invariant symmetry in DNN: https://arxiv.org/abs/2402.05232
Absolutly brilliant paper that replaces backprop with 2 forward passes,
hence greatly reducing memory requirements for LLM finetuning.
Great work, both theoretically and experimentally!
https://openreview.net/forum?id=Vota6rFhBQ
Great pedagogical point of view about important concepts with LLMs!
https://www.youtube.com/watch?v=KCXDr-UOb9A
a weak form of linear mode connectivity of SGD trained models can be achieved modulo permutation symmetry;
this applies to sequences of iteratively sparsified networks: https://openreview.net/forum?id=BwhEtJrnir
A bayesian view of in context learning: shows that ICL can generalize to unknown tasks
when the training tasks are more diverse than a given threshold: https://arxiv.org/abs/2306.15063
"sparse model soups" combine iterative magnitude pruning with model averaging (models with the same sparsity topology)
to obgain improved generalization and pruning: https://arxiv.org/abs/2306.16788
BitNet: https://arxiv.org/pdf/2310.11453.pdf
Very nice work, but still gradients are full precision. I'm not sure trying to force 1-bit parameters
into the standard training pipeline is the best way to go, but still, it's an impressive result.
Stabilize pytorch training by tracking the attention entropy: https://github.com/apple/ml-sigma-reparam
Interesting paper that exploits LLMs to predict physical properties of crystallines from their text description, which outperforms GNN-based models! https://spikelab.mycpanel.princeton.edu/papers/204.pdf
Another nice paper about low-rank #LLM matrix reductions, which shows that careful choice of which
weight matrices to compress may improve reasoning: https://arxiv.org/abs/2312.13558
To end the year, here is a very useful table of major hyperparameters used to train #LLM
compiled by Stella Biderman and reproduced in HTML at this link:
https://members.loria.fr/CCerisara/#llmHyperparms.html
Smart recipe to train (vs. finetune) a #LLM using low-rank adapters: https://arxiv.org/abs/2307.05695
It's called ReLoRA. With appropriate tricks (pruning 90% of adapter weights, LR warmup...) it's possible
to regularly restart LoRa and merge them for pretraining... Huge potential!
Contrastive learning of causal LM: https://arxiv.org/pdf/2210.01185.pdf
Beyond 1b parameters, representations suffer less from "anysotropy" problem
(all embeddings in the same small cone), but contrastive learning may still
help. After all, why do encoder-only models still top the embeddings leaderboards?
ASIF: trivial but efficient approach to create multimodal #LLM method: use a vision and a text model, label a few 1000s multimodal prototypes and represent new samples as a vector of distances to these prototypes, label with kNN. https://arxiv.org/pdf/2210.01738.pdf