During my PhD I published at speech conferences and missed the opportunity to attend an ML-focused conference. So I was glad to finally attend NeurIPS for the first time last month. The scale of the poster sessions definitely surprised me, but I met tons of interesting people, got to join great parties, and made new friends. It was a bonus that it was in New Orleans, the city was full of amazing food, music, architecture, and people!

Posters as far as the eye can see

Posters as far as the eye can see

Jazz at the Royal Frenchmen

Jazz at the Royal Frenchmen

What follows are the papers whose ideas stood out to me the most at NeurIPS, starting with transformer-related topics and moving towards diffusion at the end.

Death by 1,000 cuts

I’ve been working on large transformer models for speech synthesis, so I was curious to find some interesting new ideas in this area. With all the progress on real-world problems using LLMs and foundation models, I really enjoyed Christopher Ré’s perspective on using foundation models to tackle “death by 1,000 cuts” problems: problems made up of individually easy tasks, but where the number, variety, and breadth of tasks makes solving the problem impractical, tedious, or expensive. He focused on data cleaning—a laborious job make up of many different tasks—where LLMs have proved invaluable. I think this perspective of death by 1,000 cuts neatly conveys how lots of expensive or tedious tasks can now be tackled with foundation models

He also dove into hardware-aware algorithms and state space models (SSMs), which have recently been able to out-perform attention-based LLMs at a 1.3B parameter scale with Mamba. People were very excited about Mamba at NeurIPS, it’s core architectural change to SSMs is to make the SSM parameters input-dependent (data-controlled; like attention). I’m curious to see if SSMs truly are SotA on many of the downstream tasks Transformers have been tested with (e.g. can it pass the bar), and if SSMs are picked up and deployed over transformer-based LLM products soon. Look out for a future post going more in depth on SSMs!

(top)

Fickleness of in-context learning

Aaditya Singh and Stephanie Chan’s paper (UCL and DeepMind) was a great study on in-context learning (ICL). They found that models learn to rely more on knowledge from pre-training and learn to ignore new contextual information provided at inference time.

Their method used a synthetic task they constructed based on the Omniglot dataset. They have two evaluations: 1) ICL eval - Changing each class label so the model can only solve the task using in-context learning (few-shot learning); and 2) IWL eval - Corrupting all labels so in-context learning fails, thus the model must rely on in-weights learning (using information in the weights). For various model and dataset sizes, they found that ICL capability decreased as training continues, whereas IWL capability increases (as expected).

Screenshot 2024-01-26 at 13.08.13.png

This is particularly worrying in situations where we overtrain models. But fear not: using weight decay in MLP layers stopped models forgetting how to do ICL in their experiments!

Type 2 thinking for LLMs

There was lots of work in 2023 on prompting and generation strategies for LLMs, Shunyu Yao’s Tree of Thoughts in particular received a lot of attention. Their overall motivation was to give LLMs an alternative to pure next-token generation, which can trap the model with a subpar output: almost like Type 1 thinking. They used different tasks that are particularly hard for vanilla LLMs: 1) Game of 24, write an equation that equals 24 using given numbers; 2) Creative writing, write a longer piece that ends with the given sentences; and 3) Mini crossword, solve a 5x5 crossword using 10 clues.

The Tree of Thoughts (ToT) process has 4 steps

  1. Define thought → ToT chooses to use information about a task to decompose the generation into nodes in the tree
  2. Generate thoughts → Use a prompt best designed for the task, e.g. for creative tasks they sample with the same prompt, and for analytical tasks they request proposals of answers given the current state
  3. Evaluate thoughts → Use an LLM to rate or rank the thoughts, this can include a few lookahead steps
  4. Explore thoughts → Use a search algorithm such as depth-first search, if it’s impossible to solve the task (according to the evaluator) they allow for backtracking

Screenshot 2024-02-15 at 07.48.42.png

The results for the analytical tasks are really impressive. For the Game of 24, they go from 7.3% to 45% using ToT. By considering the best 5 candidates during breadth-first search they achieve 74%. For crosswords they go from 1% using chain of thought (CoT) prompting to 20% using ToT. For their creative task evaluation is harder, but in a human evaluation they find ToT outputs are more coherent than CoT outputs.

While this approach is very general, it requires a lot of task-specific configuration. This is a feature or a limitation, depending on your perspective. It did make me think how we could leverage tool use and have LLMs pick the best search strategy, evaluation method, or even change the mode of thinking (e.g. a tree) when appropriate—maybe CoT with self-consistency is better for logical reasoning tasks. I’m sure there will be developments on making this a more general purpose mode of generation.