Subject: Online etiquette/Gadjets in our life

Yüklə 1,04 Mb.

səhifə	32/38
tarix	10.06.2023
ölçüsü	1,04 Mb.
	#128150

1 ... 28 29 30 31 32 33 34 35 ... 38

Farux

Solving Scientific Problems
Using the compelling insight heuristic to evaluate alignment research directions

Organic chemistry was developed by Justus von Liebig and others, following Friedrich Wöhler's synthesis of urea.[68] Other crucial 19th century advances were; an understanding of valence bonding (Edward Frankland in 1852) and the application of thermodynamics to chemistry

Solving Scientific Problems

Solving hard scientific problems usually requires compelling insights

Here’s a heuristic which plays an important role in my reasoning about solving hard scientific problems: that when you’ve made an important breakthrough, you should be able to explain the key insight(s) behind that breakthrough in an intuitively compelling way. By “intuitively compelling” I don’t mean “listeners should be easily persuaded that the idea solves the problem”, but instead: “listeners should be easily persuaded that this is the type of idea which, if true, would constitute a big insight”.
The best examples are probably from Einstein: time being relative, and gravity being equivalent to acceleration, are both insights in this category. The same for Malthus and Darwin and Godel; the same for Galileo and Newton and Shannon.

I read these as Aaronson claiming that a successful solution to this very hard problem is likely to contain big insights that can be clearly explained.
Perhaps the best counterexample is the invention of Turing machines. Even after Turing explained the whole construction, it seems reasonable to still be uncertain whether there's actually something interesting there, or whether he's just presented you with a complicated mess. I think that uncertainty would be particularly reasonable if we imagine trying to understand the formalism before Turing figures out how to implement any nontrivial algorithm (like prime factorisation) on a Turing machine, or how to prove any theorems about universal Turing machines.

Using the compelling insight heuristic to evaluate alignment research directions

It’s possible that alignment will in practice end up being more of an engineering problem than a scientific problem like the ones I described above. E.g. perhaps we're in a world where, with sufficient caution about scaling up existing algorithms, we'll produce aligned AIs capable of solving the full version of the problem for us.[1] But suppose we're trying to produce a fully scalable solution ourselves; are there existing insights which might be sufficient for that? Here are some candidates, which I’ll only discuss very briefly, and plan to discuss in more detail in a forthcoming post (I’d also welcome suggestions for any I've missed):

“Trustworthy imitation of human external behavior would avert many default dooms as they manifest in external behavior unlike human behavior.”

This is Eliezer’s description of the core insight behind Paul’s imitative amplification proposal. I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).

Decomposing supervision of complex tasks allows better human oversight.

Again, I’ve found this less compelling over time - in this case because I’ve realized that decomposition is the “default” approach we follow whenever we evaluate things, and so the real “work” of the insight needs to be in describing how we’ll decompose tasks, which I don’t think we’ve made much progress on (with techniques like cross-examination being possible exceptions).

Weight-sharing makes deception much harder.

I think this is the main argument pushing me towards optimism about ELK; thanks to Ajeya for articulating it to me.

Uncertainty about human preferences makes agents corrigible.

This is Stuart Russell’s claim about why assistance games are a good potential solution to alignment; I basically don’t buy it at all, for the same reasons as Yudkowsky (but kudos to Stuart for stating the proposed insight clearly enough that the disagreement is obvious).

Myopic agents can be capable while lacking incentives for long-term misbehavior.

This claim seems to drive a bunch of Evan Hubinger’s work, but I don’t buy it. In order for an agent’s behavior to be competent over long time horizons, it needs to be doing some kind of cognition aimed towards long time horizons, and we don’t know how to stop that cognition from being goal-directed.

Problems that arise in limited-data regimes (e.g. inner misalignment) go away when you have methods of procedurally generating realistic data (e.g. separable world-models).

This claim was made to me by Michael Cohen. It’s interesting, but I don’t think it solves the core alignment problem, because we don’t understand cognition well enough to efficiently factor out world-models from policies. E.g. training a world-model to predict observations step-by-step seems like it loses out on all the benefits of thinking in terms of abstractions; whereas training it just on long-term predictive accuracy makes the intermediate computations uninterpretable and therefore unusable.

By default we’ll train models to perform bounded tasks of bounded scope, and then achieve more complex tasks by combining them.

This seems like the core claim motivating Eric Drexler’s CAIS framework. I think it dramatically underrates the importance of general intelligence, and the returns to scaling up single models, for reasons I explain further here.

Functional decision theory.

I don’t think this directly addresses the alignment problem, but it feels like the level of insight I’m looking for, in a related domain.

Note that I do think each of these claims gestures towards interesting research possibilities which might move the needle in worlds where the alignment problem is easy. But I don’t think any of them are sufficiently powerful insights to scalably solve the hard version of the alignment problem. Why do many of the smart people listed above think otherwise? I think it’s because they’re not accounting properly for the sense in which the alignment problem is an adversarial one: that by default, optimization which pushes towards general intelligence will also push towards misalignment, and we'll need to do something unusual to be confident we're separating them. In other words, the set of insights about consequentialism and optimization which made us worry about the alignment problem in the first place (along with closely-related insights like the orthogonality thesis and instrumental convergence) are sufficiently high-level, and sufficiently robust, that unless you're guided by other powerful insights you're less likely to find exceptions to those principles, and more likely to find proposals where you can no longer spot the flaw.

This claim is very counterintuitive from a ML perspective, where loosely-directed exploration of new algorithms often leads to measurable improvements. I don’t know how to persuade people that applying this approach to alignment leads to proposals which are deceptively appealing, except by getting them to analyze each of the proposals above until they convince themselves that the purported insights are insufficient to solve the problem. Unfortunately, this is very time-consuming. To save effort, I’d like to promote a norm for proposals for alignment techniques to be very explicit about where the hard work is done, i.e. which part is surprising or insightful or novel enough to make us think that it could solve alignment even in worlds where that’s quite difficult. Or, alternatively, if the proposal is only aimed at worlds where the problem is relatively easy, please tell me that explicitly. E.g. I spent quite a while being confused about which part of the ELK agenda was meant to do the hard work of solving ontology identification; after asking Paul about it, though, his main response was “maybe ontology identification is easy”, which I feel very skeptical about.[2] (He also suggested something to do with the structure of explanations as a potential solving-ELK-level-insight; but I don’t understand this well enough to discuss it in detail.)

Yüklə 1,04 Mb.

Dostları ilə paylaş:

1 ... 28 29 30 31 32 33 34 35 ... 38