Exploring Steering Language Model Refusal With Sparse Autoencoders

If you are looking for information about Steering Language Model Refusal With Sparse Autoencoders, you have come to the right place.

  • I made a video about one of my favorite papers! I hope you enjoy :) ===Summary=== "Applying
  • Sparse Autoencoders
  • State-of-the-art foundation
  • Warning: This is an ad-libbed talk, and I'm sure I got some facts wrong. This is a talk I gave to my MATS 9.0 training program on ...
  • A visual explanation of how transformers piece concepts together, told in the style of 3Blue1Brown. Introducing SAEs. What truly ...

In-Depth Information on Steering Language Model Refusal With Sparse Autoencoders

The paper explores using The paper explores using This has been my favorite video so far to make! I think interpretability is so important both in terms of ensuring safe AI and also ... One of the core roadblocks to understanding the computation inside a transformer is the fact that individual neurons do not seem ...

Slides: https://jinen.setpal.net/slides/sae.pdf.

We hope this detailed breakdown of Steering Language Model Refusal With Sparse Autoencoders was helpful.

Steering Language Model Refusal With Sparse Autoencoders.pdf

Size: 4.1 MB · Format: PDF · Secure Download

Download PDF Read Online

Related Documents