Steering Language Model Refusal With Sparse Autoencoders

Exploring Steering Language Model Refusal With Sparse Autoencoders

If you are looking for information about Steering Language Model Refusal With Sparse Autoencoders, you have come to the right place.

I made a video about one of my favorite papers! I hope you enjoy :) ===Summary=== "Applying
Sparse Autoencoders
State-of-the-art foundation
Warning: This is an ad-libbed talk, and I'm sure I got some facts wrong. This is a talk I gave to my MATS 9.0 training program on ...
A visual explanation of how transformers piece concepts together, told in the style of 3Blue1Brown. Introducing SAEs. What truly ...

In-Depth Information on Steering Language Model Refusal With Sparse Autoencoders

The paper explores using The paper explores using This has been my favorite video so far to make! I think interpretability is so important both in terms of ensuring safe AI and also ... One of the core roadblocks to understanding the computation inside a transformer is the fact that individual neurons do not seem ...

Slides: https://jinen.setpal.net/slides/sae.pdf.

We hope this detailed breakdown of Steering Language Model Refusal With Sparse Autoencoders was helpful.

Latest Updates on Steering Language Model Refusal With Sparse Autoencoders

Exploring Steering Language Model Refusal With Sparse Autoencoders

In-Depth Information on Steering Language Model Refusal With Sparse Autoencoders

Steering Language Model Refusal With Sparse Autoencoders.pdf

Related Documents