cerisara@mastodon.online

Growing LLMs: SOLAR, [LlamaPro](https://arxiv.org/abs/2401.02415), [LiGO](https://arxiv.org/abs/2303.00980), [FLM-101b](https://huggingface.co/CofeAI/FLM-101B), [dynamic architectures do not scale](https://arxiv.org/abs/2307.06440)

February 17, 2024

Christophe Cerisara

cerisara@mastodon.online

#LLM benchmarks are not reliable... https://arxiv.org/abs/2402.01781

February 17, 2024

Christophe Cerisara

cerisara@mastodon.online

transformers can extrapolate context length to 2.5x but length generalization remains fragile: https://arxiv.org/abs/2402.09371

February 17, 2024

Christophe Cerisara

cerisara@mastodon.online

aespa: quantize weights separately down to 2-bit: https://arxiv.org/abs/2402.08958

February 17, 2024

Christophe Cerisara

cerisara@mastodon.online

Information gained through finetuning may be compressed down to 1 bit, suggesting that
most value really comes from pretraining and reducing inference cost further.
A side lesson: LoRA may not be the alpha and omega of adaptation after all...
https://arxiv.org/abs/2402.10193

February 17, 2024

Christophe Cerisara

cerisara@mastodon.online

DoRA makes LoRA better by decomposing the magnitude and direction of weights, and applying LoRA only on the direction:
https://arxiv.org/abs/2402.09353

February 17, 2024

Christophe Cerisara

cerisara@mastodon.online

Automatically learning permutation-invariant symmetry in DNN: https://arxiv.org/abs/2402.05232

February 17, 2024

Christophe Cerisara

cerisara@mastodon.online

Absolutly brilliant paper that replaces backprop with 2 forward passes,
hence greatly reducing memory requirements for LLM finetuning.
Great work, both theoretically and experimentally!
https://openreview.net/forum?id=Vota6rFhBQ

February 09, 2024

Christophe Cerisara

cerisara@mastodon.online

Great pedagogical point of view about important concepts with LLMs!
https://www.youtube.com/watch?v=KCXDr-UOb9A

January 31, 2024

Christophe Cerisara

cerisara@mastodon.online

a weak form of linear mode connectivity of SGD trained models can be achieved modulo permutation symmetry;
this applies to sequences of iteratively sparsified networks: https://openreview.net/forum?id=BwhEtJrnir

January 17, 2024

Christophe Cerisara

cerisara@mastodon.online

A bayesian view of in context learning: shows that ICL can generalize to unknown tasks
when the training tasks are more diverse than a given threshold: https://arxiv.org/abs/2306.15063

January 17, 2024

Christophe Cerisara

cerisara@mastodon.online

"sparse model soups" combine iterative magnitude pruning with model averaging (models with the same sparsity topology)
to obgain improved generalization and pruning: https://arxiv.org/abs/2306.16788

January 17, 2024

Christophe Cerisara

cerisara@mastodon.online

BitNet: https://arxiv.org/pdf/2310.11453.pdf
Very nice work, but still gradients are full precision. I'm not sure trying to force 1-bit parameters
into the standard training pipeline is the best way to go, but still, it's an impressive result.

January 13, 2024

Christophe Cerisara

cerisara@mastodon.online

Stabilize pytorch training by tracking the attention entropy: https://github.com/apple/ml-sigma-reparam

January 11, 2024

Christophe Cerisara

cerisara@mastodon.online

Interesting paper that exploits LLMs to predict physical properties of crystallines from their text description, which outperforms GNN-based models! https://spikelab.mycpanel.princeton.edu/papers/204.pdf

January 09, 2024

Christophe Cerisara

cerisara@mastodon.online

Another nice paper about low-rank #LLM matrix reductions, which shows that careful choice of which
weight matrices to compress may improve reasoning: https://arxiv.org/abs/2312.13558

December 31, 2023

Christophe Cerisara

cerisara@mastodon.online

To end the year, here is a very useful table of major hyperparameters used to train #LLM
compiled by Stella Biderman and reproduced in HTML at this link:
https://members.loria.fr/CCerisara/#llmHyperparms.html

December 31, 2023

Christophe Cerisara

cerisara@mastodon.online

Smart recipe to train (vs. finetune) a #LLM using low-rank adapters: https://arxiv.org/abs/2307.05695
It's called ReLoRA. With appropriate tricks (pruning 90% of adapter weights, LR warmup...) it's possible
to regularly restart LoRa and merge them for pretraining... Huge potential!

December 22, 2023

Christophe Cerisara

cerisara@mastodon.online

Contrastive learning of causal LM: https://arxiv.org/pdf/2210.01185.pdf
Beyond 1b parameters, representations suffer less from "anysotropy" problem
(all embeddings in the same small cone), but contrastive learning may still
help. After all, why do encoder-only models still top the embeddings leaderboards?

December 21, 2023

Christophe Cerisara

cerisara@mastodon.online

ASIF: trivial but efficient approach to create multimodal #LLM method: use a vision and a text model, label a few 1000s multimodal prototypes and represent new samples as a vector of distances to these prototypes, label with kNN. https://arxiv.org/pdf/2210.01738.pdf

December 17, 2023

More