Is there even any suitable “confidence” measure within the LLM that it could use to know when it needs to emit an “I don’t know” response? I wonder whether there’s even any consistent and measurable difference between times when it seems to know what it’s talking about and times when it is talking BS. That might be something that exists in our own cognition but has no counterpart in the workings of an LLM. So it may not even be feasible to engineer it to say “I don’t know” when it doesn’t know. It can’t just straightforwardly look at how many sources it has for an answer and how good they were, because LLMs have typically worked in a more holistic way: each item of training data nudges the behaviour of the whole system, but it doesn’t leave behind any sign that says “I did this,” or any particular piece of knowledge or behaviour that can be ascribed to that training item.
Is there even any suitable “confidence” measure within the LLM that it could use to know when it needs to emit an “I don’t know” response? I wonder whether there’s even any consistent and measurable difference between times when it seems to know what it’s talking about and times when it is talking BS. That might be something that exists in our own cognition but has no counterpart in the workings of an LLM. So it may not even be feasible to engineer it to say “I don’t know” when it doesn’t know. It can’t just straightforwardly look at how many sources it has for an answer and how good they were, because LLMs have typically worked in a more holistic way: each item of training data nudges the behaviour of the whole system, but it doesn’t leave behind any sign that says “I did this,” or any particular piece of knowledge or behaviour that can be ascribed to that training item.