convergent instrumental goals
prevent counterfeit utility
complexity of value
fragility of value
How easy/difficult to solve?
What are other approaches?
Who has work on this?
at least tolerates or preferably assists many forms of outside correction
can be turn off
: tolerate / assist the programmers in their attempts to alter or turn off the system
: not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so
What's a formal definition of 'manipulation'?
: have a tendency to repair safety measures (such
as shutdown buttons) if they break, or at least to notify
programmers that this breakage has occurred
: preserve the programmers’ ability to correct or shut down the system (even if the system creates new subsystems, self-modifies, etc.)
won't automatically turn off itself
or prove that it's impossible