Simple SGD Method Matches Adam's Performance With Half Memory Usage
SGD-SaI enhances classic stochastic gradient descent with momentum, using half the memory of AdamW while matching or exceeding performance, effective for large models like Llama2-7B.
This is a Plain English Papers summary of a research paper called Simple SGD Method Matches Adam's Performance While Using Half the Memory. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter. Overview SGD-SaI enhances classic stochastic gradient descent with momentum Adjusts learning rates at initialization based on gradient signal-to-noise ratios Uses half the memory of AdamW while matching or exceeding performance Effective for training Transformers, Vision Transformers, and large language models Reduces memory usage by up to 25GB for large mo...