The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
Abstract Language models have shown impressive performance on tasks within their training distri- bution, but often struggle with novel problems requiring complex reasoning. We investigate the effectiveness of test-time training (TTT)—updating model parameters temporarily dur- ing inference using a loss derived from input data—as a mechanism for improving models’ reasoning capabilities, using the Abstraction and Reasoning Corpus (ARC) as a benchmark. Through systematic experimentation, we identify three crucial components for successful TTT: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training. TTT significantly improves performance on ARC tasks, achieving up to 6× improvement in accuracy compared to base fine-tuned models; applying TTT to an 8B-parameter language model, we achieve 53% accuracy on the ARC’s public validation set, improving the state-of-the-art by nearly 25% for public and purely neural approaches. By ensembling our method with recent program generation approaches, we get SoTA public validation accuracy of 61.9%, matching the average human score. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models; additional test-time applied to continued training on few-shot examples can also be extremely effective.
Learn more:
- Github: The Surprising Effectiveness of Test-Time Training for Abstract Reasoning – MIT
- New Research Proves AGI Was Achieved… TheAIGRID
- 0:00 AGI Threshold 0:45 Arc Benchmark 1:38 Benchmark Design 2:23 Test Examples 3:12 MIT Research 4:11 Training Methods 5:10 Search Algorithm 5:57 Human Level 6:38 AGI Path 7:23 O1 Paradigm 8:10 AlphaGo Insights 9:28 Creative Search 10:12 Hanabi Results 11:46 Test Compute 12:52 Human Efficiency 13:54 Altman’s View 14:39 Performance Threshold 15:27 Final Thoughts
- MIT’s AI Discovers New Science – “Intelligence Explosion” – Matthew Berman
- Artificial Intelligence, Scientific Discovery, and Product Innovation* Aidan Toner-Rodgers – MIT November 6, 2024
- This paper studies the impact of artificial intelligence on innovation, exploiting the randomized introduction of a new materials discovery technology to 1,018 scientists in the R&D lab of a large U.S. firm. AI-assisted researchers discover 44% more materials, resulting in a 39% increase in patent filings and a 17% rise in downstream product in- novation. These compounds possess more novel chemical structures and lead to more radical inventions. However, the technology has strikingly disparate effects across the productivity distribution: while the bottom third of scientists see little benefit, the output of top researchers nearly doubles. Investigating the mechanisms behind these results, I show that AI automates 57% of “idea-generation” tasks, reallocating researchers to the new task of evaluating model-produced candidate materials. Top scientists leverage their domain knowledge to prioritize promising AI suggestions, while others waste significant resources testing false positives. Together, these findings demonstrate the potential of AI-augmented research and highlight the complemen- tarity between algorithms and expertise in the innovative process. Survey evidence reveals that these gains come at a cost, however, as 82% of scientists report reduced satisfaction with their work due to decreased creativity and skill underutilization.