Software Engineering Risks In Publicly Released LLM Weights

Jun 4, 2024

Large language models like Llama 2-Chat can be easily misused even with safety fine-tuning, researchers find it's possible to undo these safeguards for under $200.

This is a Plain English Papers summary of a research paper called BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B. If you like these kinds of analysis, you should subscribe to the AImodels.fyi newsletter or follow me on Twitter.

  
  
  Overview

This paper investigates the risks of publicly releasing the weights of large language models (LLMs) like Llama 2-Chat, which Meta developed and released.
The authors hypothesize that even though Meta fine-tuned Llama 2-Chat to refuse harmful outputs, bad actors could bypass these safeguards and misuse the model's capabilities.
The...

Read the full article