Contents
Introduction
Duplicate content is an SEO and user-experience problem. On WordPress sites duplicates can arise from repeated publishing, syndicated content, programmatic imports, REST/API submissions, translations, revisions, attachments and mistakes in editorial workflows. Preventing duplicates as early as possible (prior verification) is the safest way to keep your site clean, avoid search-engine penalties or ranking dilution, and reduce manual cleanup.
What this tutorial covers
- Why duplicate content matters (SEO, analytics, crawl budget)
- Strategies to detect and handle duplicates: exact duplicate, near-duplicate, title/slug duplicates, revisions and attachments
- Step-by-step PHP implementations for WordPress (admin, front-end, REST API and background processing)
- Full example plugin/snippets with code (pre-save checks, hash storage, REST API blocking, admin notices)
- Scalability, performance, database indexing and heavyweight options (search engines, LSH)
- Testing guidance including WP-CLI and unit tests
Why prevent duplicates before saving?
Catching duplicates before they are stored prevents wasted storage, reduces noise in editorial workflows, avoids accidental publishing, and gives you full control over canonicalization decisions. Post-save fixes (301, canonical) are useful, but preventing duplicates is more robust and helps maintain a single authoritative source for each topic.
High-level approaches
- Exact-match detection: Normalize content and compute a deterministic hash (SHA1/MD5). Fast and exact.
- Near-duplicate detection: Use similarity metrics (levenshtein, similar_text) or shingling/MinHash for large collections. More complex and computation-heavy.
- Title/slug collision detection: Prevent same-slug or same-title publishes ensure slug uniqueness with wp_unique_post_slug.
- Postmeta fingerprinting: Store fingerprints as postmeta for fast lookup rather than scanning full content on every save.
- REST/API checks: Block duplicates on REST endpoints and external submitters.
Design decisions and tradeoffs
- Exact hash is cheap and reliable for verbatim duplicates, but misses paraphrases/near-duplicates.
- Near-duplicate detection finds paraphrases but needs O(n) comparisons or specialized LSH/Elasticsearch. Use near-duplicate only for smaller sets or via external indexing.
- Where to block: Admin UI (wp_insert_post_data), REST API (rest_pre_insert), front-end submission handlers, and WP-CLI imports must all be covered.
- Storage: Use postmeta _content_hash for quick lookup. For very large sites consider a dedicated normalized table with indexes.
Core implementation plan (summary)
- Normalize incoming post content (strip tags, collapse whitespace, case-fold).
- Compute a content fingerprint (sha1 recommended).
- Before saving, check the fingerprint against existing posts (skip revisions/auto-drafts and optionally skip attachments).
- If duplicate found:
- Prevent publish (set status to draft or return an error for REST API).
- Save a meta record linking to original (e.g. _duplicate_of).
- Present an admin notice explaining the reason.
- Optionally: add rel=canonical to the duplicate pointing to the original, or 301 redirect on front-end if appropriate.
- Save fingerprint to postmeta when a post is saved successfully.
- Backfill existing posts fingerprints in a background process (cron or WP-CLI).
Complete example: a plugin-style implementation
The code below shows a complete, pragmatic implementation. It:
- Normalizes content and computes a SHA1 hash.
- Checks for exact duplicates before insert/save (wp_insert_post_data and REST API).
- Stores the hash in postmeta _content_hash on save.
- Shows an admin notice when a duplicate is blocked.
- Includes a background backfill for missing hashes via WP-Cron and a WP-CLI example (commented).
normalize_content( content ) // If empty after normalization return null to avoid colliding empty posts if ( norm === ) { return null } return sha1( norm ) } // Check duplication using wpdb for speed skip trash and autosave private function find_duplicate_by_hash( hash, exclude_post_id = 0, post_type = post ) { global wpdb if ( empty( hash ) ) { return false } query = SELECT pm.post_id FROM {wpdb->postmeta} pm INNER JOIN {wpdb->posts} p ON pm.post_id = p.ID WHERE pm.meta_key = %s AND pm.meta_value = %s AND p.post_status != trash AND p.post_type = %s AND p.ID != %d LIMIT 1 result = wpdb->get_var( wpdb->prepare( query, this->meta_key, hash, post_type, exclude_post_id ) ) return result ? intval( result ) : false } // wp_insert_post_data filter callback public function check_for_duplicate_on_insert( data, postarr ) { // Skip autosaves, revisions and attachments if ( ! empty( postarr[post_type] ) postarr[post_type] === attachment ) { return data } if ( isset( postarr[post_status] ) postarr[post_status] === auto-draft ) { return data } // Provide a fallback content string content = isset( postarr[post_content] ) ? postarr[post_content] : ( isset( data[post_content] ) ? data[post_content] : ) hash = this->content_hash( content ) // If hash is null (empty content) skip duplicate check if ( hash === null ) { return data } post_id = isset( postarr[ID] ) ? intval( postarr[ID] ) : 0 post_type = isset( postarr[post_type] ) ? postarr[post_type] : post dup_id = this->find_duplicate_by_hash( hash, post_id, post_type ) if ( dup_id ) { // Prevent the post from being published. Set to draft and store a transient to show admin notice. data[post_status] = draft // Store duplicate info in a transient keyed by user so we can show an admin notice notice = array( message => sprintf( Duplicate detected: this content seems identical to post #%d. The post was saved as draft., dup_id ), duplicate_of => dup_id, ) // Keep transient short-lived set_transient( this->notice_transient . get_current_user_id(), notice, 30 ) // Attach an indicator (optional) to the post content so editors see the automated action if ( ! empty( data[post_excerpt] ) ) { data[post_excerpt] = [Duplicate prevented] . data[post_excerpt] } else { data[post_excerpt] = [Duplicate prevented] } } return data } // Save the hash as postmeta after a successful save public function save_content_hash( post_id, post, update ) { // Skip revisions and autosaves and attachments if ( wp_is_post_revision( post_id ) wp_is_post_autosave( post_id ) post->post_type === attachment ) { return } hash = this->content_hash( post->post_content ) if ( hash === null ) { // Remove meta if empty content delete_post_meta( post_id, this->meta_key ) } else { update_post_meta( post_id, this->meta_key, hash ) } } // Show admin notice based on transient public function maybe_show_admin_notice() { transient_key = this->notice_transient . get_current_user_id() notice = get_transient( transient_key ) if ( notice is_array( notice ) ) { printf(, esc_html( notice[message] ) ) delete_transient( transient_key ) } } // REST pre-insert: prepared_post is a WP_Post object-like array in newer WP versions. public function rest_check_for_duplicate( prepared_post, request ) { // prepared_post may be an object or array try to extract content and post_type if ( is_array( prepared_post ) ) { content = isset( prepared_post[post_content] ) ? prepared_post[post_content] : post_type = isset( prepared_post[post_type] ) ? prepared_post[post_type] : post post_id = isset( prepared_post[ID] ) ? intval( prepared_post[ID] ) : 0 } elseif ( is_object( prepared_post ) ) { content = isset( prepared_post->post_content ) ? prepared_post->post_content : post_type = isset( prepared_post->post_type ) ? prepared_post->post_type : post post_id = isset( prepared_post->ID ) ? intval( prepared_post->ID ) : 0 } else { return prepared_post } hash = this->content_hash( content ) if ( hash === null ) { return prepared_post } dup_id = this->find_duplicate_by_hash( hash, post_id, post_type ) if ( dup_id ) { // Return a WP_Error to prevent REST creation return new WP_Error( duplicate_content, sprintf( Duplicate content detected: identical to post #%d, dup_id ), array( status => 409 ) ) } return prepared_post } // Backfill missing hashes in batches (Cron job) public function backfill_content_hashes() { args = array( post_type => any, post_status => array( publish, private, draft ), posts_per_page => 100, meta_query => array( array( key => this->meta_key, compare => NOT EXISTS, ), ), fields => ids, ) q = new WP_Query( args ) if ( q->have_posts() ) { foreach ( q->posts as id ) { post = get_post( id ) if ( post ) { hash = this->content_hash( post->post_content ) if ( hash !== null ) { update_post_meta( id, this->meta_key, hash ) } } } } wp_reset_postdata() } } // Initialize new WP_Prevent_Duplicate_Content() / Optional: schedule backfill on plugin activation (example) register_activation_hook( __FILE__, function() { if ( ! wp_next_scheduled( wp_prevent_dup_backfill ) ) { wp_schedule_event( time(), hourly, wp_prevent_dup_backfill ) } } ) register_deactivation_hook( __FILE__, function() { wp_clear_scheduled_hook( wp_prevent_dup_backfill ) } ) WP-CLI command to backfill (for huge sites, use WP-CLI to run once): if ( defined( WP_CLI ) WP_CLI ) { WP_CLI::add_command( wp-prevent-dup backfill, function() { ( new WP_Prevent_Duplicate_Content() )->backfill_content_hashes() WP_CLI::success( Backfill complete ) } ) } / ?>%s
Explanation of key parts
- normalize_content(): Uses wp_strip_all_tags and html_entity_decode, collapses whitespace and lowercases. This ensures semantically identical content with different whitespace or HTML tags hash the same way.
- content_hash(): Computes sha1 of the normalized content. Use sha1 for reasonable collision resistance and small footprint. MD5 is also possible but slightly less collision resistant.
- find_duplicate_by_hash(): Fast wpdb query joining postmeta and posts to locate an existing post with same hash, skipping the currently-updated post and trash. This relies on storing hash in postmeta for existing posts.
- wp_insert_post_data filter: Runs before WordPress inserts/updates post data. Used here to set post status to draft and set a transient to show admin notice if duplicate detected. This approach is robust for the old and new editors.
- rest_pre_insert: Returns a WP_Error with HTTP 409 to block duplicate posts created via REST API.
- save_post action: After a post is saved, we write the content hash to postmeta so future inserts can check quickly.
- Backfill: For existing posts that lack a hash, a cron or WP-CLI backfill computes and stores the hash in batches to avoid heavy load.
Near-duplicate detection (optional)
Exact hashes only catch verbatim copies. For near-duplicates you can compute similarity percentages. The simplest approach is to compare normalized strings with PHP methods similar_text() or levenshtein(). For larger sites, use shingling and MinHash or an external index (Elasticsearch/Algolia) supporting fuzzy matching.
Simple percent similarity example (small sites)
normalize_content( new_content ) norm_existing = this->normalize_content( existing_content ) similar = 0 similar_text( norm_new, norm_existing, similar ) // similar is percent if ( similar >= 85 ) { // treat as duplicate/near-duplicate } ?>
Caveats: similar_text and levenshtein are O(n^2) for large strings and comparing a new post against thousands of posts is slow. For production-level near-duplicate detection use external indexers or locality-sensitive hashing (LSH).
Slug and title duplicates
Slugs are handled by WordPress with wp_unique_post_slug by default, but if you want to prevent publishing posts with the same title entirely, check get_page_by_title() or a WP_Query on post_title before saving and block or flag as desired.
ID != postarr[ID] ) ) { // title duplicate found } ?>
Canonicalization and front-end handling
If a duplicate exists and you choose to allow publishing (for editorial reasons), you must instruct search engines which copy to index:
- Use a rel=canonical link to the original posts permalink on the duplicate page.
- OR serve a 301 redirect from the duplicate URL to the canonical post (only do this if duplicates are truly redundant).
- OR mark the duplicate as noindex via meta robots if the original should be indexed.
Example: add rel=canonical for duplicate posts
ID, _duplicate_of, true ) if ( orig_id ) { url = get_permalink( intval( orig_id ) ) if ( url ) { echo . n } } } } ) ?>
Scaling and performance advice
- Indexing: Postmeta lookups on meta_key meta_value can be slow on large sites. Consider a custom table like wp_content_fingerprints with columns (post_id, post_type, content_hash, created_at) and an index on (content_hash).
- Batch backfills: Use WP-CLI to run backfill once in batches or run WP-Cron hourly with a conservative posts_per_page (e.g. 100).
- Near-duplicate at scale: Use search engines (Elasticsearch) or LSH libraries. Index normalized content and use built-in fuzzy search or vector similarity.
- Long meta_value: If your fingerprint approach stores large strings, be mindful of MySQL index length limits (191 chars at utf8mb4). Use fixed-length hashes (sha1 is 40 chars) to be safe.
Security and best practices
- Always sanitize input and fully escape output (wp_strip_all_tags, esc_html, esc_url).
- Use prepared SQL queries if you access wpdb directly (wpdb->prepare).
- Only set transients or options with expiration to avoid stale notices.
- Consider user capability checks: allow admins to override duplicates if your workflow demands it.
Testing strategies
- Manual tests: Try posting the same content via admin UI, REST API and frontend forms. Confirm duplicates are blocked and notices appear.
- Automated tests: Use WP_UnitTestCase to create posts and assert meta exists and new duplicates are blocked.
- WP-CLI: Useful for bulk backfill and batch testing on staging environments.
Example WP unit test snippet
factory->user->create( array( role => editor ) ) wp_set_current_user( author ) content = This is a unique content sample. // Create first post post_id = wp_insert_post( array( post_title => Sample, post_content => content, post_status => publish, ) ) this->assertGreaterThan( 0, post_id ) // Attempt to create duplicate post_id2 = wp_insert_post( array( post_title => Sample Duplicate, post_content => content, post_status => publish, ) ) // Our plugin sets duplicates to draft. Ensure second post not published this->assertEquals( draft, get_post_status( post_id2 ) ) } } ?>
Edge cases and special considerations
- Attachments: Images and media sometimes have duplicate captions or descriptions. Decide if attachments should be hashed/checked.
- Revisions: Revisions should not be treated as roots of duplicates skip them.
- Multisite: Decide whether duplication across sites in a network is relevant. Postmeta lookup must be site-scoped for cross-site checking use central index or network-wide table.
- Translations: If you intentionally create translated duplicates, use language tags and exclude those from duplicate checks.
- Imports: For CSV or feed imports, run backfill/hashing and check duplicates prior to insert. Prefer import pipelines that call wp_insert_post_data filter or your custom checks.
Database design recommendation for very large sites
The wp_postmeta table is not ideal for frequent meta-value lookups at scale. For sites with hundreds of thousands of posts:
Option | Pros | Cons |
---|---|---|
Use postmeta with SHA1 (40 chars) | Simple to implement, integrates with current code | Postmeta table gets large joins may be slow without additional indexing |
Custom table (recommended) | Fast indexed lookups on content_hash, small compact rows | More code to maintain must keep synchronized. |
External index (Elasticsearch) | Powerful fuzzy/near-duplicate search scalable | Extra infrastructure and integration complexity |
Putting it all together: deployment checklist
- Install plugin or add snippet to mu-plugins for early execution.
- Backfill existing posts content hashes (WP-CLI recommended on large sites).
- Test with sample duplicates via admin, REST, front-end and import flows.
- Decide on canonicalization policy (redirect, canonical tag, noindex) for allowed duplicates.
- Monitor logs and user reports for false positives provide an admin override if needed.
- For large sites, implement a custom table or external index for performance.
Further enhancements you can add
- Add admin UI to review blocked duplicates and choose to merge or publish.
- Integrate with editorial workflows: queue duplicates for review with suggested canonical.
- Implement fuzzy matching with threshold options (e.g. 85% similarity).
- Track sources (imports, API user ids) and warn when an external source repeatedly submits duplicates.
- Provide a WP-CLI tool to export duplicates and produce reconciliation reports.
References and useful links
- WordPress Hooks (Actions Filters)
- WordPress REST API
- WP-CLI
- Google: Consolidate duplicate URLs (canonical)
Summary
Preventing duplicate content in WordPress via prior verification can be achieved with a pragmatic combination of normalization, hashing, and pre-save checks. Store fingerprints in postmeta (or a custom table for scale), block duplicates at admin, REST and import points, show clear editor feedback, and backfill existing content. For paraphrase/near-duplicate detection, upgrade to similarity algorithms or external search services. The approach in this article gives you a strong foundation for exact-duplicate prevention with sensible performance and user experience tradeoffs.
|
Acepto donaciones de BAT's mediante el navegador Brave 🙂 |