SampleMix Boosts Language Models With 50% Less Training Data
SampleMix boosts language models with 50% less training data by balancing quality & diversity at the sample level, outperforming traditional mixing approaches.
This is a Plain English Papers summary of a research paper called SampleMix: New Data Strategy Boosts Language Models with 50% Less Training Data. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter. Overview SampleMix is a new strategy for mixing pre-training data for language models Balances both data quality and diversity at the sample level Outperforms traditional dataset-level mixing approaches Uses a bivariate beta distribution to coordinate quality and diversity Achieves significant improvements on benchmark tasks Reduces training data req...