Christophe Cerisara
cerisara@mastodon.online

On big LLMs:
- tech report about Gemini1.5: arxiv.org/abs/2403.05530
- 10-lang 128k-context 35b CMD-R LLM from Cohere: huggingface.co/CohereForAI/c4a
- tiny benchmarks: t.co/lcL9thyZzS

March 14, 2024
Christophe Cerisara
cerisara@mastodon.online

RAT revises each step from CoT with RAG, reducing hallucinations and improving reasoning: arxiv.org/pdf/2403.05313.pdf

March 14, 2024
Christophe Cerisara
cerisara@mastodon.online

Great work that studies the impact of LR re-warmup and re-decaying and rehearsal to continue pretrain an LLM: arxiv.org/abs/2403.08763

March 14, 2024
Christophe Cerisara
cerisara@mastodon.online

A memory-efficient side network can be combined with a long-term memory, giving the LongMem :
proceedings.neurips.cc/paper_f

March 10, 2024
Christophe Cerisara
cerisara@mastodon.online

Don't backprop in the main transformer, but rather only in a side network, which greatly
reduces the memory footprint: enter LST: arxiv.org/abs/2206.06522

Related but not yet published: LIFT, layer-wise finetuning: openreview.net/forum?id=u0INlp

March 10, 2024
Christophe Cerisara
cerisara@mastodon.online

Grokking, i.e., delayed generalization and delayed robustness, is much more frequent than expected:
arxiv.org/pdf/2402.15555.pdf
Starting from the fact that DNNs decompose the input space with splines, they derive a complexity
measure (linked to VC) from the local density of splines.
They observe with it 3 training phases: decrease, increase (learn), decrease (compress);
it's linked to circuits, is it related to Tishby's Info bottleneck?
Conclusion: don't use early stopping nor batch norm!

March 06, 2024
Christophe Cerisara
cerisara@mastodon.online
March 06, 2024
Christophe Cerisara
cerisara@mastodon.online

Nice theoretical paper about generalization bounds for LLM:
arxiv.org/pdf/2312.17173.pdf
pretrain the LLM in a compressed suspace (LoRA); their bound improves with LLM size,
which matches evidence but still lacks theoretical explanation.
They also show that their bounds degrade when shuffling text, showing that temporal
structure in text is used in generalization/compression.

March 06, 2024
Christophe Cerisara
cerisara@mastodon.online

Bonito: a tool to transform unstructured text into task-specific instructions for
instruction tuning of an LLM: github.com/BatsResearch/bonito

March 06, 2024
Christophe Cerisara
cerisara@mastodon.online

Anthropic's proposal to interpret LLM: a theory based on circuits:
transformer-circuits.pub/

March 06, 2024
Christophe Cerisara
cerisara@mastodon.online

A generic pretrained model for time-series ?
arxiv.org/abs/2403.00131

March 06, 2024
Christophe Cerisara
cerisara@mastodon.online

High level review of methods to mitigate LLM hallucinations:
amatriain.net/blog/images/Miti

Very likely not complete though.

March 06, 2024
Christophe Cerisara
cerisara@mastodon.online
March 06, 2024
Christophe Cerisara
cerisara@mastodon.online

It's better to prune later layers but not the last ones: arxiv.org/abs/2310.05175v2

March 05, 2024
Christophe Cerisara
cerisara@mastodon.online

LoSparse: low-rank decomposition combined with pruning so that one compensates the drawback of the other...
Nice idea, but only tested on BERT-like models so far.
proceedings.mlr.press/v202/li2

February 19, 2024
Christophe Cerisara
cerisara@mastodon.online

Unstructured pruning is slower than structured pruning... But we can be faster by rewritting the
dot-product and pruning at the bit-level, which is actually smaller than the weight level:
openreview.net/pdf?id=YUDiZcZT

February 19, 2024
Christophe Cerisara
cerisara@mastodon.online

Nice blog post explaining alignment, finetuning and pretraining and
compiliing evidence that important capabilities all come from large pretraining:
jingfengyang.github.io/alignme

February 19, 2024
Christophe Cerisara
cerisara@mastodon.online

- Embed optimistically store 1 byte per dim (768 dim = 768B). With optimal text compression (3.5x loss-free), this means 2688B (768 *3.5) of text can compress into a 768 dim embed
- The avg token is 5 chars, and each char is ~1B. So, a 768 dim embedding can encode ~540 tokens (2688/5) before info storage quality drops: it's << 32k!
- Increasing dim of embeddings isn't a solution. Embed models struggle to encode > 1 topic per embedding effectively. Longer text > more topics > worse embed.

February 17, 2024
Christophe Cerisara
cerisara@mastodon.online

long context: [LongLM](arxiv.org/abs/2401.01325), [activation beacon](arxiv.org/abs/2401.03462), [large World models](huggingface.co/LargeWorldModel)

But 32k context for embeddings is debatable: (see next recap of SN discussion from Nils Reimers and Kunal Tangri)

February 17, 2024
Christophe Cerisara
cerisara@mastodon.online

May pruning reduce hallucinations? arxiv.org/abs/2311.09335

February 17, 2024