Christophe Cerisara
cerisara@mastodon.online

Growing LLMs: SOLAR, [LlamaPro](arxiv.org/abs/2401.02415), [LiGO](arxiv.org/abs/2303.00980), [FLM-101b](huggingface.co/CofeAI/FLM-101B), [dynamic architectures do not scale](arxiv.org/abs/2307.06440)

February 17, 2024
Christophe Cerisara
cerisara@mastodon.online

benchmarks are not reliable... arxiv.org/abs/2402.01781

February 17, 2024
Christophe Cerisara
cerisara@mastodon.online

transformers can extrapolate context length to 2.5x but length generalization remains fragile: arxiv.org/abs/2402.09371

February 17, 2024
Christophe Cerisara
cerisara@mastodon.online

aespa: quantize weights separately down to 2-bit: arxiv.org/abs/2402.08958

February 17, 2024
Christophe Cerisara
cerisara@mastodon.online

Information gained through finetuning may be compressed down to 1 bit, suggesting that
most value really comes from pretraining and reducing inference cost further.
A side lesson: LoRA may not be the alpha and omega of adaptation after all...
arxiv.org/abs/2402.10193

February 17, 2024
Christophe Cerisara
cerisara@mastodon.online

DoRA makes LoRA better by decomposing the magnitude and direction of weights, and applying LoRA only on the direction:
arxiv.org/abs/2402.09353

February 17, 2024
Christophe Cerisara
cerisara@mastodon.online

Automatically learning permutation-invariant symmetry in DNN: arxiv.org/abs/2402.05232

February 17, 2024
Christophe Cerisara
cerisara@mastodon.online

Absolutly brilliant paper that replaces backprop with 2 forward passes,
hence greatly reducing memory requirements for LLM finetuning.
Great work, both theoretically and experimentally!
openreview.net/forum?id=Vota6r

February 09, 2024
Christophe Cerisara
cerisara@mastodon.online

Great pedagogical point of view about important concepts with LLMs!
youtube.com/watch?v=KCXDr-UOb9

January 31, 2024
Christophe Cerisara
cerisara@mastodon.online

a weak form of linear mode connectivity of SGD trained models can be achieved modulo permutation symmetry;
this applies to sequences of iteratively sparsified networks: openreview.net/forum?id=BwhEtJ

January 17, 2024
Christophe Cerisara
cerisara@mastodon.online

A bayesian view of in context learning: shows that ICL can generalize to unknown tasks
when the training tasks are more diverse than a given threshold: arxiv.org/abs/2306.15063

January 17, 2024
Christophe Cerisara
cerisara@mastodon.online

"sparse model soups" combine iterative magnitude pruning with model averaging (models with the same sparsity topology)
to obgain improved generalization and pruning: arxiv.org/abs/2306.16788

January 17, 2024
Christophe Cerisara
cerisara@mastodon.online

BitNet: arxiv.org/pdf/2310.11453.pdf
Very nice work, but still gradients are full precision. I'm not sure trying to force 1-bit parameters
into the standard training pipeline is the best way to go, but still, it's an impressive result.

January 13, 2024
Christophe Cerisara
cerisara@mastodon.online

Stabilize pytorch training by tracking the attention entropy: github.com/apple/ml-sigma-repa

January 11, 2024
Christophe Cerisara
cerisara@mastodon.online

Interesting paper that exploits LLMs to predict physical properties of crystallines from their text description, which outperforms GNN-based models! spikelab.mycpanel.princeton.ed

January 09, 2024
Christophe Cerisara
cerisara@mastodon.online

Another nice paper about low-rank matrix reductions, which shows that careful choice of which
weight matrices to compress may improve reasoning: arxiv.org/abs/2312.13558

December 31, 2023
Christophe Cerisara
cerisara@mastodon.online

To end the year, here is a very useful table of major hyperparameters used to train
compiled by Stella Biderman and reproduced in HTML at this link:
members.loria.fr/CCerisara/#ll

December 31, 2023
Christophe Cerisara
cerisara@mastodon.online

Smart recipe to train (vs. finetune) a using low-rank adapters: arxiv.org/abs/2307.05695
It's called ReLoRA. With appropriate tricks (pruning 90% of adapter weights, LR warmup...) it's possible
to regularly restart LoRa and merge them for pretraining... Huge potential!

December 22, 2023
Christophe Cerisara
cerisara@mastodon.online

Contrastive learning of causal LM: arxiv.org/pdf/2210.01185.pdf
Beyond 1b parameters, representations suffer less from "anysotropy" problem
(all embeddings in the same small cone), but contrastive learning may still
help. After all, why do encoder-only models still top the embeddings leaderboards?

December 21, 2023
Christophe Cerisara
cerisara@mastodon.online

ASIF: trivial but efficient approach to create multimodal method: use a vision and a text model, label a few 1000s multimodal prototypes and represent new samples as a vector of distances to these prototypes, label with kNN. arxiv.org/pdf/2210.01738.pdf

December 17, 2023