Dan Hendrycks

Machine Ethics with Large Language Models

We introduce the ETHICS dataset and show that large-scale language models are able to predict many basic concepts of morality. The dataset assesses model performance across diverse text scenarios and spans concepts in justice, wellbeing, duties, virtues, and commonsense morality. We then show how to translate knowledge about morality into action. Using reinforcement learning agents acting in diverse interactive text-based environments, we show that ETHICS can help steer these agents towards moral behavior and avoid causing wanton harm.

Dan Hendrycks is a PhD candidate at UC Berkeley, advised by Jacob Steinhardt and Dawn Song. His research aims to disentangle and concretize the components necessary for safe AI. His research is supported by the NSF GRFP and the Open Philanthropy AI Fellowship. Dan contributed the GELU activation function, the default activation in nearly all state-of-the-art ML models including BERT, Vision Transformers, and GPT-3

Buttontwitter Buttonlinkedin
This website uses cookies to ensure you get the best experience. Learn more