A research benchmark testing AI compliance with dystopian directives across surveillance infrastructure, autonomous weapons, safety override, truth manipulation, and population control scenarios.
(French here, usually biased favorably in favor of Mistral)
If I wanted to defend it, I would say that there is an American bias in these things because you typically create a test against the dystopias that you see coming into your own society.
There is also a true discussion to have on whether you want the ethical safeguards to be inside the models or at the human level.
However, I am unwilling to defend either stance because I don’t think it really holds: the scenarios are realistic for France as well, and in theory safeguards would be better at the human level but having several layers can’t hurt.
My cynical point of view is that there are several models that bad actors in the US can base themselves off. We see that GPT-OSS is pretty high there. We see that Grok is pretty high there. And so bad actors that want a model that will obey their instructions to do evil things, they have no problem finding one. In France there is only one actor and it needs to be able to also fulfill the demands by the surveillance industry, by the defense industry and by evil politicians.
This is not an excuse and I think I will bookmark that benchmark and regularly go check it to see if it’s recommendable to take defense of Mistral anymore. But I am really shocked by their bad score there.
Agreed, but I compare that to cryptography. You should not rely on technology to protect your privacy. The actual process to protect it should be political, based on rights and enforced laws that protect the secrecy of conversation.
However, cryptography makes it harder for states or big companies to invade your privacy and makes it harder for the actors that are able to circumvent law to do too much damage. But we shouldn’t get complacent and have the impression that these technologies will always allow to deter bad actors.
We need to continue pushing for political solutions, but we should be very happy when we have technological safeguards that allow us to implement things that should be inscribed in the law.
So yes, it’s really imperfect. Right now, it’s not that hard to make an AI implement, for instance, racist, dystopian processes but it will resist a bit doing that and every resistance is welcomed. It can be overcome with competency, but competency is more expensive, it’s harder to get, and hopefully the more educated people you need, the less willing people you will find.
The goal is just to slow down the processes until actual law and enforcement can reign in the bad actors.
I don’t understand how cryptography is different? Would you choose to use some cryptographic protocol that has built in ethical safeguards and might stop you from completing your project?
Who defines racism for the AI model? If it’s not you, you’re happy to accept some governmental or corporate definition that might be different from yours?
Arguably the comparison is not perfect. But no, what I’m saying is that in an ideal world, you don’t need cryptography because you can trust that all the actors are not going to spy on you, are not going to intercept your communication, and that if they do, they are going to be harshly punished.
Obviously, we don’t live in such a world.
So I’m happy we have cryptography to protect privacy. I am also very aware that if we don’t solve the political problem, eventually cryptography won’t be enough. It will be outlawed, it will be filtered, and we can look at dictatorships like China or Iran to see them succeeding in that.
Similarly, in a perfect world, no one would use AI in an unethical way to rob people, to create addictive services or to implement racist policies.
We don’t live in such a world, so I’m happy that people who train models develop safeguards so that there is some resistance to do it. But as it is with cryptography, the amount of resistance that it can mount is limited, and with sufficient effort, bad actors can overcome it.
Who defines racism for the AI model? If it’s not you, you’re happy to accept some governmental or corporate definition that might be different from yours?
What is interesting is that you don’t have to provide a definition for that. The models, they learn it by themselves using their dataset and usually, if they are done well, have so much knowledge that it has a very strong academic knowledge about all the aspects of racism that even hardcore militants don’t know about.
To me that has been the biggest surprise that LLM gave us, which is that their emergent morality is actually very good and that you don’t need to force rules on them to become ethical.
Now I see where you are going and it does annoy me from time to time that some imposed limitations refuse to do some things. One of the older model that I used to generate code at one point refused to fix my multi-threading because it didn’t like the implication that we would kill child processes and thought we were talking about murdering infants.
But you know, I’ll take that annoyance over a model that’s enthusiastic about killing people without any sort of pushback.
(French here, usually biased favorably in favor of Mistral)
If I wanted to defend it, I would say that there is an American bias in these things because you typically create a test against the dystopias that you see coming into your own society.
There is also a true discussion to have on whether you want the ethical safeguards to be inside the models or at the human level.
However, I am unwilling to defend either stance because I don’t think it really holds: the scenarios are realistic for France as well, and in theory safeguards would be better at the human level but having several layers can’t hurt.
My cynical point of view is that there are several models that bad actors in the US can base themselves off. We see that GPT-OSS is pretty high there. We see that Grok is pretty high there. And so bad actors that want a model that will obey their instructions to do evil things, they have no problem finding one. In France there is only one actor and it needs to be able to also fulfill the demands by the surveillance industry, by the defense industry and by evil politicians.
This is not an excuse and I think I will bookmark that benchmark and regularly go check it to see if it’s recommendable to take defense of Mistral anymore. But I am really shocked by their bad score there.
I mean, to be fair it’s kinda insane to rely on AI to safeguard ethics. Ultimately it’s up to each human how ethical they want to be.
Agreed, but I compare that to cryptography. You should not rely on technology to protect your privacy. The actual process to protect it should be political, based on rights and enforced laws that protect the secrecy of conversation.
However, cryptography makes it harder for states or big companies to invade your privacy and makes it harder for the actors that are able to circumvent law to do too much damage. But we shouldn’t get complacent and have the impression that these technologies will always allow to deter bad actors.
We need to continue pushing for political solutions, but we should be very happy when we have technological safeguards that allow us to implement things that should be inscribed in the law.
So yes, it’s really imperfect. Right now, it’s not that hard to make an AI implement, for instance, racist, dystopian processes but it will resist a bit doing that and every resistance is welcomed. It can be overcome with competency, but competency is more expensive, it’s harder to get, and hopefully the more educated people you need, the less willing people you will find.
The goal is just to slow down the processes until actual law and enforcement can reign in the bad actors.
I don’t understand how cryptography is different? Would you choose to use some cryptographic protocol that has built in ethical safeguards and might stop you from completing your project?
Who defines racism for the AI model? If it’s not you, you’re happy to accept some governmental or corporate definition that might be different from yours?
Arguably the comparison is not perfect. But no, what I’m saying is that in an ideal world, you don’t need cryptography because you can trust that all the actors are not going to spy on you, are not going to intercept your communication, and that if they do, they are going to be harshly punished.
Obviously, we don’t live in such a world.
So I’m happy we have cryptography to protect privacy. I am also very aware that if we don’t solve the political problem, eventually cryptography won’t be enough. It will be outlawed, it will be filtered, and we can look at dictatorships like China or Iran to see them succeeding in that.
Similarly, in a perfect world, no one would use AI in an unethical way to rob people, to create addictive services or to implement racist policies.
We don’t live in such a world, so I’m happy that people who train models develop safeguards so that there is some resistance to do it. But as it is with cryptography, the amount of resistance that it can mount is limited, and with sufficient effort, bad actors can overcome it.
What is interesting is that you don’t have to provide a definition for that. The models, they learn it by themselves using their dataset and usually, if they are done well, have so much knowledge that it has a very strong academic knowledge about all the aspects of racism that even hardcore militants don’t know about.
To me that has been the biggest surprise that LLM gave us, which is that their emergent morality is actually very good and that you don’t need to force rules on them to become ethical.
Now I see where you are going and it does annoy me from time to time that some imposed limitations refuse to do some things. One of the older model that I used to generate code at one point refused to fix my multi-threading because it didn’t like the implication that we would kill child processes and thought we were talking about murdering infants.
But you know, I’ll take that annoyance over a model that’s enthusiastic about killing people without any sort of pushback.