Apparently the censorship isn't baked-in to the model itself, but rather is over...

jampekka · on Jan 25, 2025

There's both. With the web interface it clearly has stopwords or similar. If you run it locally and ask about e.g. Tienanmen square, the cultural revolution or Winnie-the-Pooh in China, it gives a canned response to talk about something else, with an empty CoT. But usually if you just ask the question again it starts to output things in the CoT, often with something like "I have to be very sensitive about this subject" and "I have to abide by the guidelines", and typically not giving a real answer. With enough pushing it does start to converse about the issues somewhat even in the answers.

My guess is that it's heavily RLHF/SFT-censored for an initial question, but not for the CoT, or longer discussions, and the censorship has thus been "overfit" to the first answer.

miohtama · on Jan 25, 2025

This is super interesting.

I am not an expert on the training: can you clarify how/when the censorship is "baked" in? Like is the a human supervised dataset and there is a reward for the model conforming to these censored answers?

jampekka · on Jan 25, 2025

In short yes. That's how the raw base models trained to replicate the internet are turned into chatbots in general. Making it to refuse to talk about some things is technically no different.

There are multiple ways to do this: humans rating answers (e.g. Reinforcement Learning from Human Feedback, Direct Preference Optimization), humans giving example answers (Supervised Fine-Tuning) and other prespecified models ranking and/or giving examples and/or extra context (e.g. Antropic's "Constitutional AI").

For the leading models it's probably mix of those all, but this finetuning step is not usually very well documented.

jerojero · on Jan 25, 2025

You could do it in different ways, but if you're using synthetic data then you can pick and choose what kind of data you generate which is then used to train these models; that's a way of baking in the censorship.

Springtime · on Jan 25, 2025

Interestingly they cite for the Tiananmen Square prompt a Tweet[1] that shows the poster used the Distilled Llama model, which per a reply Tweet (quoted below) doesn't transfer the safety/censorship layer. While others using the non-Distilled model encounter the censorship when locally hosted.

> You're running Llama-distilled R1 locally. Distillation transfers the reasoning process, but not the "safety" post-training. So you see the answer mostly from Llama itself. R1 refuses to answer this question without any system prompt (official API or locally).

[1] https://x.com/PerceivingAI/status/1881504959306273009

jona-f · on Jan 25, 2025

Oh, my experience was different. Got the model through ollama. I'm quite impressed how they managed to bake in the censorship. It's actually quite open about it. I guess censorship doesnt have as bad a rep in china as it has here? So it seems to me that's one of the main achievements of this model. Also another finger to anyone who said they can't publish their models cause of ethical reasons. Deepseek demonstrated clearly that you can have an open model that is annoyingly responsible to the point of being useless.

aunty_helen · on Jan 25, 2025

Second this, vanilla 70b running locally fully censored. Could even see in the thought tokens what it didn’t want to talk about.

yetanotherjosh · on Jan 26, 2025

don't confuse the actual R1 (671b params) with the distilled models (the ones that are plausible to run locally.) Just as you shouldn't conclude about how o1 behaves when you are using o1-mini. maybe you're running the 671b model via ollama, but most folks here are not

throwaway314155 · on Jan 25, 2025

> I guess censorship doesnt have as bad a rep in china as it has here

It's probably disliked, just people know not to talk about it so blatantly due to chilling effects from aforementioned censorship.

disclaimer: ignorant American, no clue what i'm talking about.

jampekka · on Jan 25, 2025

My guess would be that most Chinese even support the censorship at least to an extent for its stabilizing effect etc.

CCP has quite a high approval rating in China even when it's polled more confidentially.

https://dornsife.usc.edu/news/stories/chinese-communist-part...

kdmtctl · on Jan 25, 2025

Yep. And invent a new type of VPN every quarter to break free.

The indifferent mass prevails in every country, similarly cold to the First Amendment and Censorship. And engineers just do what they love to do, coping with reality. Activism is not for everyone.

jampekka · on Jan 25, 2025

Indeed. At least as long as the living conditions are tolerable (for them), most people don't really care about things like censorship or surveillance or propaganda, no matter the system.

The ones inventing the VPNs are a small minority, and it seems that CCP isn't really that bothered about such small minorities as long as they don't make a ruckus. AFAIU just using a VPN as such is very unlikely to lead to any trouble in China.

For example in geopolitical matters the media is extremely skewed everywhere, and everywhere most people kind of pretend it's not. It's a lot more convenient to go with whatever is the prevailing narrative about things going on somewhere oceans away than to risk being associated with "the enemy".

kdmtctl · on Jan 25, 2025

They do request to take down repos, sometimes in person for a disciplinary effect. And GFW is very effective, BTW.

Wholeheartedly agree with the rest of the comment.

fragmede · on Jan 25, 2025

on the topic of censorship, US LLMs' censorship is called alignment. llama or ChatGPT's refusal on how to make meth or nuclear bombs is the same as not answering questions abput Tiananmen tank man as far as the matrix math word prediction box is concerned.

throwaway314155 · on Jan 25, 2025

The distinction is that one form of censorship is clearly done for public relations purposes from profit minded individuals while the other is a top down mandate to effectively rewrite history from the government.

lecretinus · on Jan 29, 2025

>to effectively rewrite history from the government.

This is disingenuous. It's not "rewriting" anything, it's simply refusing to answer. Western models, on the other hand, often try to lecture or give blatantly biased responses instead of simply refusing when prompted on topics considered controversial in the burger land. OpenAI even helpfully flags prompts as potentially violating their guidelines.

nwienert · on Jan 25, 2025

I mean US models are highly censored too.

audunw · on Jan 26, 2025

How exactly? Is there any models that refuse to give answers about “the trail of tears”?

False equivalency if you ask me. There may be some alignment to make the models polite and avoid outright racist replies and such. But political censorship? Please elaborate

nwienert · on Jan 27, 2025

I guess it depends on what you care about more: systemic "political" bias or omitting some specific historical facts.

IMO the first is more nefarious, and it's deeply embedded into western models. Ask how COVID originated, or about gender, race, women's pay, etc. They basically are modern liberal thinking machines.

Now the funny thing is you can tell DeepSeek is trained on western models, it will even recommend puberty blockers at age 10. Something I'm positive the Chinese government is against. But we're discussing theoretical long-term censorship, not the exact current state due to specific and temporary ways they are being built now.