A deep look at building a comment system that actually works — database schema for threaded replies, moderation queues, heuristic spam detection, reCAPTCHA v3 scoring, and email notifications without drowning your inbox.
Why Most CMS Comment Systems Fail
Every CMS ships with some version of a comment system. Most of them are afterthoughts — a single database table, a textarea, and a submit button. They work fine for the first week. Then the spam arrives. Within a month, you have 4,000 pending comments about cryptocurrency and weight loss supplements, your legitimate commenters have given up because their replies never appear, and you've disabled comments entirely because moderating them takes longer than writing the actual posts.
We went through this cycle three times before we sat down and built the comment system that now ships with JekCMS. The key insight was that comments are not a single feature — they are four separate systems that need to work together: storage and threading, spam prevention, moderation workflow, and notification delivery. Get any one of them wrong and the whole thing collapses.
This guide covers the full architecture. I'll show the actual database schema, the spam detection pipeline, the reCAPTCHA integration, and the moderation interface. Everything here runs in production across dozens of JekCMS sites.
Database Schema for Threaded Comments
The most common mistake in comment system design is using a flat table with no threading support. You end up with a chronological list where reply context is lost entirely. The second most common mistake is implementing full adjacency lists with unlimited nesting depth — which creates recursive query nightmares and UI indentation that becomes unreadable after four levels.
JekCMS uses a two-level threading model. Top-level comments reply to the post. Replies always target a top-level comment, never another reply. This keeps the UI clean and the queries simple.
CREATE TABLE comments (
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
post_id INT UNSIGNED NOT NULL,
parent_id INT UNSIGNED DEFAULT NULL,
author_name VARCHAR(100) NOT NULL,
author_email VARCHAR(255) NOT NULL,
author_ip VARCHAR(45) NOT NULL,
author_user_agent TEXT,
content TEXT NOT NULL,
status ENUM('pending', 'approved', 'spam', 'trash') DEFAULT 'pending',
spam_score DECIMAL(3,2) DEFAULT 0.00,
recaptcha_score DECIMAL(3,2) DEFAULT NULL,
is_admin_reply TINYINT(1) DEFAULT 0,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_post_status (post_id, status),
INDEX idx_parent (parent_id),
INDEX idx_email (author_email),
INDEX idx_created (created_at),
FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE,
FOREIGN KEY (parent_id) REFERENCES comments(id) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
A few things worth noting about this schema. The parent_id is nullable — NULL means it is a top-level comment. The spam_score column stores our heuristic analysis result as a value between 0 and 1. The recaptcha_score stores Google's score separately so we can tune our thresholds independently. And author_ip uses VARCHAR(45) to accommodate IPv6 addresses, which are longer than most people expect.
Fetching Threaded Comments Efficiently
With two-level threading, we only need two queries — or one query with application-level grouping:
// Single query approach - group in PHP
$comments = $db->fetchAll(
"SELECT c.*,
(SELECT COUNT(*) FROM comments r
WHERE r.parent_id = c.id AND r.status = 'approved') as reply_count
FROM comments c
WHERE c.post_id = ? AND c.status = 'approved'
ORDER BY c.parent_id IS NULL DESC, c.parent_id, c.created_at ASC",
[$postId]
);
// Group into threads
$threads = [];
$replies = [];
foreach ($comments as $comment) {
if ($comment['parent_id'] === null) {
$threads[$comment['id']] = $comment;
$threads[$comment['id']]['replies'] = [];
} else {
$replies[] = $comment;
}
}
foreach ($replies as $reply) {
if (isset($threads[$reply['parent_id']])) {
$threads[$reply['parent_id']]['replies'][] = $reply;
}
}
This runs as a single database round-trip regardless of how many comments exist. The ORDER BY clause puts top-level comments first (parent_id IS NULL DESC evaluates to 1 for top-level), then sorts replies by their parent and creation time.
Spam Detection Pipeline
Spam detection is not a single check — it is a pipeline of increasingly expensive operations. We run cheap checks first and bail early when possible. The pipeline has four stages: format analysis, content heuristics, rate limiting, and reCAPTCHA verification.
Stage 1: Format Analysis
Before we even look at the content, we check structural indicators. These are nearly free computationally and catch about 40% of spam outright:
function analyzeFormat(string $name, string $email, string $content): float {
$score = 0.0;
// Name contains URL
if (preg_match('#https?://#i', $name)) {
$score += 0.4;
}
// Name is ALL CAPS
if ($name === mb_strtoupper($name) && mb_strlen($name) > 3) {
$score += 0.15;
}
// Disposable email domain
$disposableDomains = ['mailinator.com', 'guerrillamail.com', 'tempmail.com',
'throwaway.email', '10minutemail.com', 'trashmail.com'];
$emailDomain = substr($email, strpos($email, '@') + 1);
if (in_array(strtolower($emailDomain), $disposableDomains)) {
$score += 0.3;
}
// Content is extremely short (under 10 chars) or extremely long (over 5000)
$len = mb_strlen($content);
if ($len < 10) $score += 0.2;
if ($len > 5000) $score += 0.15;
return min($score, 1.0);
}
Stage 2: Content Heuristics
This stage examines what the commenter actually wrote. We look for patterns that show up in spam but rarely in legitimate comments:
function analyzeContent(string $content): float {
$score = 0.0;
$lower = mb_strtolower($content);
// Link density - more than 2 URLs in a comment is suspicious
$urlCount = preg_match_all('#https?://S+#i', $content);
if ($urlCount > 2) $score += 0.3;
if ($urlCount > 5) $score += 0.3;
// Pharmaceutical and gambling keywords
$spamKeywords = ['viagra', 'cialis', 'casino', 'poker', 'bitcoin trading',
'weight loss pills', 'buy cheap', 'free money', 'click here',
'work from home', 'earn $', 'limited offer'];
foreach ($spamKeywords as $keyword) {
if (strpos($lower, $keyword) !== false) {
$score += 0.4;
break;
}
}
// Cyrillic or CJK characters in a Latin-alphabet post
// (legitimate multilingual comments exist, so this is a mild signal)
if (preg_match('/[x{0400}-x{04FF}]/u', $content) &&
!preg_match('/[x{0400}-x{04FF}]{10,}/u', $content)) {
$score += 0.1;
}
// HTML tags that commenters should never use
if (preg_match('/<(script|iframe|object|embed|form)/i', $content)) {
$score += 0.5;
}
return min($score, 1.0);
}
The keyword list is intentionally short. Long keyword lists generate false positives and require constant maintenance. We rely on the combination of all four pipeline stages rather than trying to make any single stage perfect.
Stage 3: Rate Limiting
Legitimate commenters rarely post more than two or three comments in a short window. Spam bots post dozens. We track submission rates per IP and per email:
function checkRateLimit(string $ip, string $email): float {
global $db;
// Comments from this IP in the last hour
$recentByIp = $db->fetch(
"SELECT COUNT(*) as cnt FROM comments
WHERE author_ip = ? AND created_at > DATE_SUB(NOW(), INTERVAL 1 HOUR)",
[$ip]
)['cnt'];
// Comments from this email in the last day
$recentByEmail = $db->fetch(
"SELECT COUNT(*) as cnt FROM comments
WHERE author_email = ? AND created_at > DATE_SUB(NOW(), INTERVAL 1 DAY)",
[$email]
)['cnt'];
$score = 0.0;
if ($recentByIp > 5) $score += 0.3;
if ($recentByIp > 10) $score += 0.4;
if ($recentByEmail > 10) $score += 0.3;
return min($score, 1.0);
}
Stage 4: reCAPTCHA v3 Integration
reCAPTCHA v3 runs invisibly — no checkbox, no image puzzles. It returns a score between 0 (likely bot) and 1 (likely human). The integration has two parts: frontend token generation and backend verification.
Frontend — load the reCAPTCHA script and generate a token when the form is submitted:
<script src="https://www.google.com/recaptcha/api.js?render=YOUR_SITE_KEY"></script>
<script>
document.getElementById('comment-form').addEventListener('submit', function(e) {
e.preventDefault();
grecaptcha.ready(function() {
grecaptcha.execute('YOUR_SITE_KEY', {action: 'comment'}).then(function(token) {
document.getElementById('recaptcha-token').value = token;
e.target.submit();
});
});
});
</script>
<input type="hidden" id="recaptcha-token" name="recaptcha_token">
Backend — verify the token with Google's API:
function verifyRecaptcha(string $token): float {
$secret = get_setting('recaptcha_secret_key');
if (empty($secret) || empty($token)) return 0.5;
$response = file_get_contents(
'https://www.google.com/recaptcha/api/siteverify?' .
http_build_query(['secret' => $secret, 'response' => $token])
);
$result = json_decode($response, true);
if (!$result['success']) return 0.0;
if ($result['action'] !== 'comment') return 0.1;
return (float) $result['score'];
}
The score from reCAPTCHA is inverted before being added to our spam score — a reCAPTCHA score of 0.9 (very human) contributes almost nothing to spam score, while 0.1 (very bot-like) adds significantly.
Combining Pipeline Scores
Each stage returns a score between 0 and 1. We combine them with weighted averaging:
function calculateSpamScore(array $stages, float $recaptchaScore): float {
$weights = [
'format' => 0.20,
'content' => 0.25,
'rateLimit' => 0.20,
'recaptcha' => 0.35,
];
$invertedRecaptcha = 1.0 - $recaptchaScore;
$total = ($stages['format'] * $weights['format'])
+ ($stages['content'] * $weights['content'])
+ ($stages['rateLimit'] * $weights['rateLimit'])
+ ($invertedRecaptcha * $weights['recaptcha']);
return round($total, 2);
}
// Decision thresholds
$spamScore = calculateSpamScore($stages, $recaptchaScore);
if ($spamScore >= 0.70) {
$status = 'spam'; // Auto-reject, never shows in moderation queue
} elseif ($spamScore >= 0.40) {
$status = 'pending'; // Needs manual review
} else {
$status = 'approved'; // Auto-approve
}
These thresholds are configurable from the admin panel. Sites with high traffic might lower the auto-approve threshold to reduce moderation workload. Sites in sensitive niches might raise it so nothing goes live without a human check.
Building the Moderation Queue
The moderation interface is where site admins spend most of their comment-related time, so it needs to be fast and keyboard-friendly. Our moderation page shows pending comments with full context — the post they belong to, the parent comment they replied to (if threaded), the spam score breakdown, and one-click approve/spam/trash actions.
// admin/comments.php - Fetch pending comments with context
$pending = $db->fetchAll(
"SELECT c.*, p.title as post_title, p.slug as post_slug,
parent.content as parent_content, parent.author_name as parent_author
FROM comments c
JOIN posts p ON p.id = c.post_id
LEFT JOIN comments parent ON parent.id = c.parent_id
WHERE c.status = 'pending'
ORDER BY c.created_at DESC
LIMIT 50"
);
// Bulk action handler
if ($_SERVER['REQUEST_METHOD'] === 'POST' && !empty($_POST['action'])) {
$ids = array_map('intval', $_POST['comment_ids'] ?? []);
if (!empty($ids)) {
$placeholders = implode(',', array_fill(0, count($ids), '?'));
$newStatus = match($_POST['action']) {
'approve' => 'approved',
'spam' => 'spam',
'trash' => 'trash',
default => null
};
if ($newStatus) {
$db->execute(
"UPDATE comments SET status = ? WHERE id IN ($placeholders)",
array_merge([$newStatus], $ids)
);
}
}
redirect(ADMIN_URL . '/comments.php?moderated=' . count($ids));
}
The interface also includes keyboard shortcuts: a for approve, s for spam, d for trash, j/k for navigating between comments. This is directly inspired by Gmail's keyboard navigation and it makes processing a queue of 50 comments take under two minutes instead of ten.
Email Notifications Without Inbox Overload
Comment notifications are tricky to get right. Too many emails and the admin disables them. Too few and legitimate comments sit in the queue for days. JekCMS uses a digest approach with configurable thresholds.
Instead of sending one email per comment, we batch notifications. The system checks for unnotified comments every 15 minutes via cron. If there are pending comments, it sends a single digest email with all of them:
// cron/comment-digest.php
$unnotified = $db->fetchAll(
"SELECT c.*, p.title as post_title
FROM comments c
JOIN posts p ON p.id = c.post_id
WHERE c.status = 'pending' AND c.notified = 0
ORDER BY c.created_at ASC"
);
if (empty($unnotified)) exit;
$subject = count($unnotified) . ' new comment' . (count($unnotified) > 1 ? 's' : '') . ' awaiting moderation';
$body = "You have " . count($unnotified) . " comments to review:
";
foreach ($unnotified as $c) {
$body .= "---
";
$body .= "Post: {$c['post_title']}
";
$body .= "From: {$c['author_name']} ({$c['author_email']})
";
$body .= "Spam Score: {$c['spam_score']}
";
$body .= mb_substr($c['content'], 0, 200) . "
";
}
$body .= "Review all: " . ADMIN_URL . "/comments.php
";
send_mail(get_setting('admin_email'), $subject, $body);
// Mark as notified
$ids = array_column($unnotified, 'id');
$placeholders = implode(',', array_fill(0, count($ids), '?'));
$db->execute("UPDATE comments SET notified = 1 WHERE id IN ($placeholders)", $ids);
For reply notifications — when someone replies to a commenter — we send individual emails but only if the original commenter opted in. There is a checkbox on the comment form: "Notify me of replies." We store this preference in a notify_replies column and check it before sending.
An Akismet Alternative: Local Spam Database
Akismet is the industry standard for spam detection, but it requires an external API call for every comment and costs money at scale. For sites that want to stay self-contained, JekCMS maintains a local spam fingerprint database.
When a comment is marked as spam (either automatically or by a moderator), we extract fingerprints and store them:
function recordSpamFingerprint(array $comment): void {
global $db;
// Email fingerprint
$db->insert('spam_fingerprints', [
'type' => 'email',
'value' => hash('sha256', strtolower($comment['author_email'])),
'hits' => 1
]);
// IP fingerprint (with expiry - IPs change)
$db->insert('spam_fingerprints', [
'type' => 'ip',
'value' => hash('sha256', $comment['author_ip']),
'hits' => 1,
'expires' => date('Y-m-d', strtotime('+30 days'))
]);
// Content similarity hash (simhash for fuzzy matching)
$simhash = calculateSimhash($comment['content']);
$db->insert('spam_fingerprints', [
'type' => 'content',
'value' => $simhash,
'hits' => 1
]);
}
During the spam detection pipeline, we check incoming comments against these fingerprints. An email that has been flagged as spam five times gets an automatic 0.5 added to the spam score. A matching content hash gets 0.6. This local database grows more effective over time — the more spam you mark, the better it gets at catching similar spam in the future.
Frontend Comment Display
The comment display template needs to handle three states: approved comments visible to everyone, pending comments visible only to the author (via email cookie), and admin replies highlighted visually. It also needs to produce valid comment markup for search engines.
<section class="comments" id="comments">
<h2><?= $commentCount ?> Comment<?= $commentCount !== 1 ? 's' : '' ?></h2>
<?php foreach ($threads as $comment): ?>
<article class="comment <?= $comment['is_admin_reply'] ? 'comment--admin' : '' ?>"
id="comment-<?= $comment['id'] ?>">
<header class="comment-meta">
<strong><?= htmlspecialchars($comment['author_name']) ?></strong>
<time datetime="<?= $comment['created_at'] ?>">
<?= format_date($comment['created_at']) ?>
</time>
</header>
<div class="comment-body">
<?= nl2br(htmlspecialchars($comment['content'])) ?>
</div>
<button class="reply-btn" data-parent="<?= $comment['id'] ?>">Reply</button>
<?php if (!empty($comment['replies'])): ?>
<div class="comment-replies">
<?php foreach ($comment['replies'] as $reply): ?>
<article class="comment comment--reply" id="comment-<?= $reply['id'] ?>">
<header class="comment-meta">
<strong><?= htmlspecialchars($reply['author_name']) ?></strong>
<time><?= format_date($reply['created_at']) ?></time>
</header>
<div class="comment-body">
<?= nl2br(htmlspecialchars($reply['content'])) ?>
</div>
</article>
<?php endforeach; ?>
</div>
<?php endif; ?>
</article>
<?php endforeach; ?>
</section>
The CSS for comments is minimal — indentation for replies, a left border accent for admin replies, and proper spacing. No avatars by default (Gravatar adds external requests and privacy concerns), though you can enable them in settings if you want.
Comment Schema Markup
Google supports Comment schema markup, which can improve how your pages appear in search results. JekCMS automatically generates this as JSON-LD:
function outputCommentSchema(array $post, array $threads): void {
if (empty($threads)) return;
$comments = [];
foreach ($threads as $thread) {
$commentData = [
'@type' => 'Comment',
'text' => mb_substr($thread['content'], 0, 500),
'dateCreated' => $thread['created_at'],
'author' => [
'@type' => 'Person',
'name' => $thread['author_name']
]
];
$comments[] = $commentData;
}
$schema = [
'@context' => 'https://schema.org',
'@type' => 'DiscussionForumPosting',
'headline' => $post['title'],
'url' => get_post_url($post),
'commentCount' => count($threads),
'comment' => $comments
];
echo '<script type="application/ld+json">' .
json_encode($schema, JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES) .
'</script>';
}
Configuration and Settings
All comment settings are managed through Admin > Settings > Comments. The available options include enabling or disabling comments globally, auto-close comments after a configurable number of days, set moderation thresholds for auto-approve and auto-spam, toggle reCAPTCHA and enter API keys, configure notification digest frequency, and enable or disable threaded replies. These settings are stored in the site_settings table under the comments group and cached aggressively since they rarely change.
The settings page also includes a spam statistics dashboard showing how many comments were auto-approved, sent to moderation, and auto-rejected over the past 30 days. This helps admins tune their thresholds based on actual data rather than guesswork.
Closing Thoughts
A comment system is deceptively complex. The textarea and submit button are the easy part. The hard part is everything that happens after submission — validating, scoring, routing, notifying, and displaying comments in a way that encourages legitimate discussion while keeping spam invisible. The pipeline approach works because no single stage needs to be perfect. Each stage catches what it can, and the combined score is far more accurate than any individual check. If you are building on JekCMS, comments are ready to go out of the box. If you are building from scratch, start with the schema and the pipeline — those two foundations make everything else straightforward.