Skip to content
@safety-research

Safety Research

Popular repositories Loading

  1. bloom bloom Public

    bloom - evaluate any behavior immediately  🌸🌱

    Python 1.2k 140

  2. petri petri Public

    An alignment auditing agent capable of quickly exploring alignment hypothesis

    Python 907 130

  3. persona_vectors persona_vectors Public

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Python 358 90

  4. SCONE-bench SCONE-bench Public

    160 27

  5. safety-tooling safety-tooling Public

    Inference API for many LLMs and other useful tools for empirical research

    Python 107 34

  6. assistant-axis assistant-axis Public

    The Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away from the Assistant during conversations—sometimes toward bizarr…

    Jupyter Notebook 81 18

Repositories

Showing 10 of 38 repositories
  • PurpleLlama Public Forked from meta-llama/PurpleLlama

    Set of tools to assess and improve LLM security.

    safety-research/PurpleLlama’s past year of commit activity
    Python 0 805 0 0 Updated Feb 23, 2026
  • petri Public

    An alignment auditing agent capable of quickly exploring alignment hypothesis

    safety-research/petri’s past year of commit activity
    Python 907 MIT 130 3 5 Updated Feb 19, 2026
  • bloom Public

    bloom - evaluate any behavior immediately  🌸🌱

    safety-research/bloom’s past year of commit activity
    Python 1,185 MIT 140 0 1 Updated Feb 17, 2026
  • safety-tooling Public

    Inference API for many LLMs and other useful tools for empirical research

    safety-research/safety-tooling’s past year of commit activity
    Python 107 MIT 34 13 15 Updated Feb 16, 2026
  • casr Public Forked from ispras/casr

    Collect crash (or UndefinedBehaviorSanitizer error) reports, triage, and estimate severity.

    safety-research/casr’s past year of commit activity
    Rust 0 Apache-2.0 36 0 0 Updated Feb 3, 2026
  • assistant-axis Public

    The Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away from the Assistant during conversations—sometimes toward bizarre or harmful personas. This repo contains a pipeline for generating the Assistant Axis and notebooks for monitoring and steering with it.

    safety-research/assistant-axis’s past year of commit activity
    Jupyter Notebook 81 18 1 0 Updated Jan 20, 2026
  • selective-gradient-masking Public

    Training Transformers with knowledge localization (SGTM)

    safety-research/selective-gradient-masking’s past year of commit activity
    Python 48 MIT 5 0 0 Updated Jan 11, 2026
  • how-ai-impacts-skill-formation Public

    Repo for measuring whether using AI tools inhibits skill formation and development

    safety-research/how-ai-impacts-skill-formation’s past year of commit activity
    Python 9 2 0 1 Updated Jan 3, 2026
  • A3 Public
    safety-research/A3’s past year of commit activity
    Python 4 Apache-2.0 0 0 0 Updated Dec 29, 2025
  • inverse-scaling-ttc Public

    Inverse Scaling in Test-Time Compute

    safety-research/inverse-scaling-ttc’s past year of commit activity
    Python 25 MIT 2 0 0 Updated Dec 3, 2025

Most used topics

Loading…