Any product that accepts user generated content eventually needs moderation. Comments, product reviews, marketplace listings, community posts, profile bios. The moment you let strangers write something that other strangers will read, you have inherited a moderation problem. The good news in 2026 is that the tooling has caught up. The setup below can be built in a few days and will handle most submissions without a human in the loop.
Why the old approach broke
Manual moderation does not scale. Even a small community generates more submissions than a part time reviewer can handle, and response time targets collapse under load. Pre-2023 keyword lists and regex are worse than no moderation in many ways. They miss misspellings and coded phrases, and they flag legitimate content because the word "kill" appears in "killing it at work."
Modern moderation needs to read intent. That is what AI brings.
The 2026 moderation stack
A reasonable production stack today has three tiers.
- OpenAI Moderation API. Free, fast, and continually improved. Returns category scores for hate, harassment, self harm, sexual content, and violence. It is the right default for the first pass on most workloads.
- Anthropic's Claude with a moderation prompt. Useful when you need nuanced judgment that a category classifier cannot make. Slower and not free, but much better at context dependent calls (sarcasm, in-jokes, community specific norms).
- A custom classifier. Fine tune a small model on your historical moderation decisions if you have a large enough labeled dataset. This is what mature platforms do for the highest volume content types.
For most teams, start with OpenAI Moderation API, add a Claude pass for borderline cases, and only invest in a custom classifier once volume justifies it. Third party vendors like Hive and Spectrum Labs exist for teams that want a managed solution, but the build versus buy math has shifted toward build for anyone with a competent engineering team.
The practical setup
The flow is straightforward. Your CMS fires a webhook when content is submitted. Your moderation service scores the content, applies your routing rules, and writes a decision back to the CMS or directly to your moderation queue.
The three routing outcomes are:
- Auto-approve. Score is below your threshold across all categories. Publish immediately.
- Flag for human review. Score is in the gray zone, or the content is in a sensitive category for your community.
- Auto-reject. Score is over a high threshold, or the content matches a hard rule (CSAM, explicit threats).
Auto-reject should be conservative. False rejections damage trust faster than slow approvals do.
A working example
Here is a minimal Node.js handler that takes submitted content, calls OpenAI's moderation endpoint, and returns a decision. It is intentionally simple. Production code would add structured logging, retries with backoff, idempotency on the content ID, and a deadletter queue.
// app/api/moderate/route.ts
import OpenAI from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
type Decision = "approve" | "review" | "reject";
const REJECT_THRESHOLD = 0.85;
const REVIEW_THRESHOLD = 0.5;
const HARD_CATEGORIES = ["sexual/minors", "violence/graphic"];
export async function POST(request: Request) {
const { contentId, text, contentType } = await request.json();
if (!text || text.length === 0) {
return Response.json({ error: "missing text" }, { status: 400 });
}
const result = await openai.moderations.create({
model: "omni-moderation-latest",
input: text,
});
const scores = result.results[0].category_scores;
const flagged = result.results[0].flagged;
let decision: Decision = "approve";
let reason = "";
for (const category of HARD_CATEGORIES) {
if (scores[category] > REJECT_THRESHOLD) {
decision = "reject";
reason = category;
break;
}
}
if (decision === "approve" && flagged) {
const maxScore = Math.max(...Object.values(scores));
if (maxScore > REJECT_THRESHOLD) {
decision = "reject";
reason = "high confidence violation";
} else if (maxScore > REVIEW_THRESHOLD) {
decision = "review";
reason = "borderline content";
}
}
await logModerationDecision({
contentId,
contentType,
decision,
reason,
scores,
timestamp: new Date().toISOString(),
});
return Response.json({ contentId, decision, reason });
}
The CMS side is a webhook subscription on the content submit event that calls this endpoint, blocks publication on a reject, queues a review on a review, and writes a moderation log entry in every case.
The legal layer
Moderation is not just a product problem. It is a regulatory one, and the rules differ by jurisdiction.
In the United States, Section 230 of the Communications Decency Act has historically given platforms broad immunity for user generated content, including immunity for moderation decisions made in good faith. Legislative proposals continue to chip at the edges, but the core protection remains. You have wide latitude in what you moderate, but still face liability for content that violates federal criminal law.
In Canada, Bill C-63 (the Online Harms Act) introduced statutory duties for operators of regulated services to address specified categories of harmful content, with reporting obligations and significant penalties for non-compliance. If your service falls within scope, you need a documented moderation process, transparent reporting, and a path for users to flag harm and appeal decisions.
GDPR adds a retention dimension for EU and UK users. Moderation logs contain personal data. Pick a defensible retention period (typically 90 days to two years), document it, and actually delete data when the period ends. Build your audit logs from day one.
The hidden cost is false positives
The instinct when you stand up moderation is to crank thresholds tight and minimize anything offensive getting through. Resist it. False positives are a quieter but more corrosive failure mode than false negatives.
When you wrongly remove a legitimate post:
- The author loses trust in the platform, usually permanently.
- Other community members see the action and self-censor.
- Your engagement metrics quietly decline over months.
Always build a human appeal flow. A user whose content was rejected must be able to ask a human to look again, and that loop must close in reasonable time. Communities die from over-moderation as often as under-moderation, and that death is harder to diagnose because it does not show up in headline metrics.
Where to start
If you are evaluating moderation for your platform, start by tagging a week of historical submissions by hand. You will learn what your real failure modes are, and you will have a small evaluation set to test any model you put in front of users. Pick OpenAI Moderation as your first pass, layer Claude for borderline cases, and write the human appeal flow before you ship.
If you would like help wiring this into your stack, get in touch. We have built moderation into headless CMS deployments enough times to know where the sharp edges are.
Tags