On big LLMs:
- tech report about Gemini1.5: https://arxiv.org/abs/2403.05530
- 10-lang 128k-context 35b CMD-R LLM from Cohere: https://huggingface.co/CohereForAI/c4ai-command-r-v01
- tiny benchmarks: https://t.co/lcL9thyZzS
RAT revises each step from CoT with RAG, reducing hallucinations and improving reasoning: https://arxiv.org/pdf/2403.05313.pdf
Great work that studies the impact of LR re-warmup and re-decaying and rehearsal to continue pretrain an LLM: https://arxiv.org/abs/2403.08763
A memory-efficient side network can be combined with a long-term memory, giving the LongMem #LLM :
https://proceedings.neurips.cc/paper_files/paper/2023/file/ebd82705f44793b6f9ade5a669d0f0bf-Paper-Conference.pdf
#LLM Don't backprop in the main transformer, but rather only in a side network, which greatly
reduces the memory footprint: enter LST: https://arxiv.org/abs/2206.06522
Related but not yet published: LIFT, layer-wise finetuning: https://openreview.net/forum?id=u0INlprg3U
Grokking, i.e., delayed generalization and delayed robustness, is much more frequent than expected:
https://arxiv.org/pdf/2402.15555.pdf
Starting from the fact that DNNs decompose the input space with splines, they derive a complexity
measure (linked to VC) from the local density of splines.
They observe with it 3 training phases: decrease, increase (learn), decrease (compress);
it's linked to circuits, is it related to Tishby's Info bottleneck?
Conclusion: don't use early stopping nor batch norm!
The foundation model development cheatsheet:
https://github.com/allenai/fm-cheatsheet/blob/main/app/resources/paper.pdf
Nice theoretical paper about generalization bounds for LLM:
https://arxiv.org/pdf/2312.17173.pdf
pretrain the LLM in a compressed suspace (LoRA); their bound improves with LLM size,
which matches evidence but still lacks theoretical explanation.
They also show that their bounds degrade when shuffling text, showing that temporal
structure in text is used in generalization/compression.
Bonito: a tool to transform unstructured text into task-specific instructions for
instruction tuning of an LLM: https://github.com/BatsResearch/bonito
Anthropic's proposal to interpret LLM: a theory based on circuits:
https://transformer-circuits.pub/
A generic pretrained model for time-series ?
https://arxiv.org/abs/2403.00131
High level review of methods to mitigate LLM hallucinations:
https://amatriain.net/blog/images/Mitigating_Hallucinations.pdf
Very likely not complete though.
TableLlama: https://arxiv.org/abs/2311.09206
It's better to prune later layers but not the last ones: https://arxiv.org/abs/2310.05175v2
LoSparse: low-rank decomposition combined with pruning so that one compensates the drawback of the other...
Nice idea, but only tested on BERT-like models so far.
https://proceedings.mlr.press/v202/li23ap/li23ap.pdf
Unstructured #LLM pruning is slower than structured pruning... But we can be faster by rewritting the
dot-product and pruning at the bit-level, which is actually smaller than the weight level:
https://openreview.net/pdf?id=YUDiZcZTI8
Nice blog post explaining #LLM alignment, finetuning and pretraining and
compiliing evidence that important capabilities all come from large pretraining:
https://jingfengyang.github.io/alignment
- Embed optimistically store 1 byte per dim (768 dim = 768B). With optimal text compression (3.5x loss-free), this means 2688B (768 *3.5) of text can compress into a 768 dim embed
- The avg token is 5 chars, and each char is ~1B. So, a 768 dim embedding can encode ~540 tokens (2688/5) before info storage quality drops: it's << 32k!
- Increasing dim of embeddings isn't a solution. Embed models struggle to encode > 1 topic per embedding effectively. Longer text > more topics > worse embed.
long context: [LongLM](https://arxiv.org/abs/2401.01325), [activation beacon](https://arxiv.org/abs/2401.03462), [large World models](https://huggingface.co/LargeWorldModel)
But 32k context for embeddings is debatable: (see next recap of SN discussion from Nils Reimers and Kunal Tangri)
May #LLM pruning reduce hallucinations? https://arxiv.org/abs/2311.09335