How to prevent duplicate content with prior verification in PHP in WordPress

Contents

Introduction

Duplicate content is an SEO and user-experience problem. On WordPress sites duplicates can arise from repeated publishing, syndicated content, programmatic imports, REST/API submissions, translations, revisions, attachments and mistakes in editorial workflows. Preventing duplicates as early as possible (prior verification) is the safest way to keep your site clean, avoid search-engine penalties or ranking dilution, and reduce manual cleanup.

What this tutorial covers

Why duplicate content matters (SEO, analytics, crawl budget)
Strategies to detect and handle duplicates: exact duplicate, near-duplicate, title/slug duplicates, revisions and attachments
Step-by-step PHP implementations for WordPress (admin, front-end, REST API and background processing)
Full example plugin/snippets with code (pre-save checks, hash storage, REST API blocking, admin notices)
Scalability, performance, database indexing and heavyweight options (search engines, LSH)
Testing guidance including WP-CLI and unit tests

Why prevent duplicates before saving?

Catching duplicates before they are stored prevents wasted storage, reduces noise in editorial workflows, avoids accidental publishing, and gives you full control over canonicalization decisions. Post-save fixes (301, canonical) are useful, but preventing duplicates is more robust and helps maintain a single authoritative source for each topic.

High-level approaches

Exact-match detection: Normalize content and compute a deterministic hash (SHA1/MD5). Fast and exact.
Near-duplicate detection: Use similarity metrics (levenshtein, similar_text) or shingling/MinHash for large collections. More complex and computation-heavy.
Title/slug collision detection: Prevent same-slug or same-title publishes ensure slug uniqueness with wp_unique_post_slug.
Postmeta fingerprinting: Store fingerprints as postmeta for fast lookup rather than scanning full content on every save.
REST/API checks: Block duplicates on REST endpoints and external submitters.

Design decisions and tradeoffs

Exact hash is cheap and reliable for verbatim duplicates, but misses paraphrases/near-duplicates.
Near-duplicate detection finds paraphrases but needs O(n) comparisons or specialized LSH/Elasticsearch. Use near-duplicate only for smaller sets or via external indexing.
Where to block: Admin UI (wp_insert_post_data), REST API (rest_pre_insert), front-end submission handlers, and WP-CLI imports must all be covered.
Storage: Use postmeta _content_hash for quick lookup. For very large sites consider a dedicated normalized table with indexes.

Core implementation plan (summary)

Normalize incoming post content (strip tags, collapse whitespace, case-fold).
Compute a content fingerprint (sha1 recommended).
Before saving, check the fingerprint against existing posts (skip revisions/auto-drafts and optionally skip attachments).
If duplicate found:
- Prevent publish (set status to draft or return an error for REST API).
- Save a meta record linking to original (e.g. _duplicate_of).
- Present an admin notice explaining the reason.
- Optionally: add rel=canonical to the duplicate pointing to the original, or 301 redirect on front-end if appropriate.
Save fingerprint to postmeta when a post is saved successfully.
Backfill existing posts fingerprints in a background process (cron or WP-CLI).

Complete example: a plugin-style implementation

The code below shows a complete, pragmatic implementation. It:

Normalizes content and computes a SHA1 hash.
Checks for exact duplicates before insert/save (wp_insert_post_data and REST API).
Stores the hash in postmeta _content_hash on save.
Shows an admin notice when a duplicate is blocked.
Includes a background backfill for missing hashes via WP-Cron and a WP-CLI example (commented).

normalize_content( content )
        // If empty after normalization return null to avoid colliding empty posts
        if ( norm ===  ) {
            return null
        }
        return sha1( norm )
    }

    // Check duplication using wpdb for speed skip trash and autosave
    private function find_duplicate_by_hash( hash, exclude_post_id = 0, post_type = post ) {
        global wpdb

        if ( empty( hash ) ) {
            return false
        }

        query = 
            SELECT pm.post_id
            FROM {wpdb->postmeta} pm
            INNER JOIN {wpdb->posts} p ON pm.post_id = p.ID
            WHERE pm.meta_key = %s
            AND pm.meta_value = %s
            AND p.post_status != trash
            AND p.post_type = %s
            AND p.ID != %d
            LIMIT 1
        

        result = wpdb->get_var(
            wpdb->prepare(
                query,
                this->meta_key,
                hash,
                post_type,
                exclude_post_id
            )
        )

        return result ? intval( result ) : false
    }

    // wp_insert_post_data filter callback
    public function check_for_duplicate_on_insert( data, postarr ) {
        // Skip autosaves, revisions and attachments
        if ( ! empty( postarr[post_type] )  postarr[post_type] === attachment ) {
            return data
        }

        if ( isset( postarr[post_status] )  postarr[post_status] === auto-draft ) {
            return data
        }

        // Provide a fallback content string
        content = isset( postarr[post_content] ) ? postarr[post_content] : ( isset( data[post_content] ) ? data[post_content] :  )
        hash = this->content_hash( content )

        // If hash is null (empty content) skip duplicate check
        if ( hash === null ) {
            return data
        }

        post_id = isset( postarr[ID] ) ? intval( postarr[ID] ) : 0
        post_type = isset( postarr[post_type] ) ? postarr[post_type] : post

        dup_id = this->find_duplicate_by_hash( hash, post_id, post_type )

        if ( dup_id ) {
            // Prevent the post from being published. Set to draft and store a transient to show admin notice.
            data[post_status] = draft

            // Store duplicate info in a transient keyed by user so we can show an admin notice
            notice = array(
                message => sprintf( Duplicate detected: this content seems identical to post #%d. The post was saved as draft., dup_id ),
                duplicate_of => dup_id,
            )
            // Keep transient short-lived
            set_transient( this->notice_transient . get_current_user_id(), notice, 30 )

            // Attach an indicator (optional) to the post content so editors see the automated action
            if ( ! empty( data[post_excerpt] ) ) {
                data[post_excerpt] = [Duplicate prevented]  . data[post_excerpt]
            } else {
                data[post_excerpt] = [Duplicate prevented]
            }
        }

        return data
    }

    // Save the hash as postmeta after a successful save
    public function save_content_hash( post_id, post, update ) {
        // Skip revisions and autosaves and attachments
        if ( wp_is_post_revision( post_id )  wp_is_post_autosave( post_id )  post->post_type === attachment ) {
            return
        }

        hash = this->content_hash( post->post_content )

        if ( hash === null ) {
            // Remove meta if empty content
            delete_post_meta( post_id, this->meta_key )
        } else {
            update_post_meta( post_id, this->meta_key, hash )
        }
    }

    // Show admin notice based on transient
    public function maybe_show_admin_notice() {
        transient_key = this->notice_transient . get_current_user_id()
        notice = get_transient( transient_key )
        if ( notice  is_array( notice ) ) {
            printf( %s, esc_html( notice[message] ) )
            delete_transient( transient_key )
        }
    }

    // REST pre-insert: prepared_post is a WP_Post object-like array in newer WP versions.
    public function rest_check_for_duplicate( prepared_post, request ) {
        // prepared_post may be an object or array try to extract content and post_type
        if ( is_array( prepared_post ) ) {
            content = isset( prepared_post[post_content] ) ? prepared_post[post_content] : 
            post_type = isset( prepared_post[post_type] ) ? prepared_post[post_type] : post
            post_id = isset( prepared_post[ID] ) ? intval( prepared_post[ID] ) : 0
        } elseif ( is_object( prepared_post ) ) {
            content = isset( prepared_post->post_content ) ? prepared_post->post_content : 
            post_type = isset( prepared_post->post_type ) ? prepared_post->post_type : post
            post_id = isset( prepared_post->ID ) ? intval( prepared_post->ID ) : 0
        } else {
            return prepared_post
        }

        hash = this->content_hash( content )

        if ( hash === null ) {
            return prepared_post
        }

        dup_id = this->find_duplicate_by_hash( hash, post_id, post_type )

        if ( dup_id ) {
            // Return a WP_Error to prevent REST creation
            return new WP_Error( duplicate_content, sprintf( Duplicate content detected: identical to post #%d, dup_id ), array( status => 409 ) )
        }

        return prepared_post
    }

    // Backfill missing hashes in batches (Cron job)
    public function backfill_content_hashes() {
        args = array(
            post_type      => any,
            post_status    => array( publish, private, draft ),
            posts_per_page => 100,
            meta_query     => array(
                array(
                    key     => this->meta_key,
                    compare => NOT EXISTS,
                ),
            ),
            fields => ids,
        )

        q = new WP_Query( args )
        if ( q->have_posts() ) {
            foreach ( q->posts as id ) {
                post = get_post( id )
                if ( post ) {
                    hash = this->content_hash( post->post_content )
                    if ( hash !== null ) {
                        update_post_meta( id, this->meta_key, hash )
                    }
                }
            }
        }
        wp_reset_postdata()
    }

}

// Initialize
new WP_Prevent_Duplicate_Content()

/
Optional: schedule backfill on plugin activation (example)
register_activation_hook( __FILE__, function() {
    if ( ! wp_next_scheduled( wp_prevent_dup_backfill ) ) {
        wp_schedule_event( time(), hourly, wp_prevent_dup_backfill )
    }
} )
register_deactivation_hook( __FILE__, function() {
    wp_clear_scheduled_hook( wp_prevent_dup_backfill )
} )

WP-CLI command to backfill (for huge sites, use WP-CLI to run once):
if ( defined( WP_CLI )  WP_CLI ) {
    WP_CLI::add_command( wp-prevent-dup backfill, function() {
        ( new WP_Prevent_Duplicate_Content() )->backfill_content_hashes()
        WP_CLI::success( Backfill complete )
    } )
}
/
?>

Explanation of key parts

normalize_content(): Uses wp_strip_all_tags and html_entity_decode, collapses whitespace and lowercases. This ensures semantically identical content with different whitespace or HTML tags hash the same way.
content_hash(): Computes sha1 of the normalized content. Use sha1 for reasonable collision resistance and small footprint. MD5 is also possible but slightly less collision resistant.
find_duplicate_by_hash(): Fast wpdb query joining postmeta and posts to locate an existing post with same hash, skipping the currently-updated post and trash. This relies on storing hash in postmeta for existing posts.
wp_insert_post_data filter: Runs before WordPress inserts/updates post data. Used here to set post status to draft and set a transient to show admin notice if duplicate detected. This approach is robust for the old and new editors.
rest_pre_insert: Returns a WP_Error with HTTP 409 to block duplicate posts created via REST API.
save_post action: After a post is saved, we write the content hash to postmeta so future inserts can check quickly.
Backfill: For existing posts that lack a hash, a cron or WP-CLI backfill computes and stores the hash in batches to avoid heavy load.

Near-duplicate detection (optional)

Exact hashes only catch verbatim copies. For near-duplicates you can compute similarity percentages. The simplest approach is to compare normalized strings with PHP methods similar_text() or levenshtein(). For larger sites, use shingling and MinHash or an external index (Elasticsearch/Algolia) supporting fuzzy matching.

Simple percent similarity example (small sites)

normalize_content( new_content )
norm_existing = this->normalize_content( existing_content )

similar = 0
similar_text( norm_new, norm_existing, similar ) // similar is percent

if ( similar >= 85 ) {
    // treat as duplicate/near-duplicate
}
?>

Caveats: similar_text and levenshtein are O(n^2) for large strings and comparing a new post against thousands of posts is slow. For production-level near-duplicate detection use external indexers or locality-sensitive hashing (LSH).

Slug and title duplicates

Slugs are handled by WordPress with wp_unique_post_slug by default, but if you want to prevent publishing posts with the same title entirely, check get_page_by_title() or a WP_Query on post_title before saving and block or flag as desired.

ID != postarr[ID] ) ) {
    // title duplicate found
}
?>

Canonicalization and front-end handling

If a duplicate exists and you choose to allow publishing (for editorial reasons), you must instruct search engines which copy to index:

Use a rel=canonical link to the original posts permalink on the duplicate page.
OR serve a 301 redirect from the duplicate URL to the canonical post (only do this if duplicates are truly redundant).
OR mark the duplicate as noindex via meta robots if the original should be indexed.

Example: add rel=canonical for duplicate posts

ID, _duplicate_of, true )
        if ( orig_id ) {
            url = get_permalink( intval( orig_id ) )
            if ( url ) {
                echo  . n
            }
        }
    }
} )
?>

Scaling and performance advice

Indexing: Postmeta lookups on meta_key meta_value can be slow on large sites. Consider a custom table like wp_content_fingerprints with columns (post_id, post_type, content_hash, created_at) and an index on (content_hash).
Batch backfills: Use WP-CLI to run backfill once in batches or run WP-Cron hourly with a conservative posts_per_page (e.g. 100).
Near-duplicate at scale: Use search engines (Elasticsearch) or LSH libraries. Index normalized content and use built-in fuzzy search or vector similarity.
Long meta_value: If your fingerprint approach stores large strings, be mindful of MySQL index length limits (191 chars at utf8mb4). Use fixed-length hashes (sha1 is 40 chars) to be safe.

Security and best practices

Always sanitize input and fully escape output (wp_strip_all_tags, esc_html, esc_url).
Use prepared SQL queries if you access wpdb directly (wpdb->prepare).
Only set transients or options with expiration to avoid stale notices.
Consider user capability checks: allow admins to override duplicates if your workflow demands it.

Testing strategies

Manual tests: Try posting the same content via admin UI, REST API and frontend forms. Confirm duplicates are blocked and notices appear.
Automated tests: Use WP_UnitTestCase to create posts and assert meta exists and new duplicates are blocked.
WP-CLI: Useful for bulk backfill and batch testing on staging environments.

Example WP unit test snippet

factory->user->create( array( role => editor ) )
        wp_set_current_user( author )

        content = This is a unique content sample.

        // Create first post
        post_id = wp_insert_post( array(
            post_title   => Sample,
            post_content => content,
            post_status  => publish,
        ) )
        this->assertGreaterThan( 0, post_id )

        // Attempt to create duplicate
        post_id2 = wp_insert_post( array(
            post_title   => Sample Duplicate,
            post_content => content,
            post_status  => publish,
        ) )

        // Our plugin sets duplicates to draft. Ensure second post not published
        this->assertEquals( draft, get_post_status( post_id2 ) )
    }
}
?>

Edge cases and special considerations

Attachments: Images and media sometimes have duplicate captions or descriptions. Decide if attachments should be hashed/checked.
Revisions: Revisions should not be treated as roots of duplicates skip them.
Multisite: Decide whether duplication across sites in a network is relevant. Postmeta lookup must be site-scoped for cross-site checking use central index or network-wide table.
Translations: If you intentionally create translated duplicates, use language tags and exclude those from duplicate checks.
Imports: For CSV or feed imports, run backfill/hashing and check duplicates prior to insert. Prefer import pipelines that call wp_insert_post_data filter or your custom checks.

Database design recommendation for very large sites

The wp_postmeta table is not ideal for frequent meta-value lookups at scale. For sites with hundreds of thousands of posts:

Option	Pros	Cons
Use postmeta with SHA1 (40 chars)	Simple to implement, integrates with current code	Postmeta table gets large joins may be slow without additional indexing
Custom table (recommended)	Fast indexed lookups on content_hash, small compact rows	More code to maintain must keep synchronized.
External index (Elasticsearch)	Powerful fuzzy/near-duplicate search scalable	Extra infrastructure and integration complexity

Putting it all together: deployment checklist

Install plugin or add snippet to mu-plugins for early execution.
Backfill existing posts content hashes (WP-CLI recommended on large sites).
Test with sample duplicates via admin, REST, front-end and import flows.
Decide on canonicalization policy (redirect, canonical tag, noindex) for allowed duplicates.
Monitor logs and user reports for false positives provide an admin override if needed.
For large sites, implement a custom table or external index for performance.

Further enhancements you can add

Add admin UI to review blocked duplicates and choose to merge or publish.
Integrate with editorial workflows: queue duplicates for review with suggested canonical.
Implement fuzzy matching with threshold options (e.g. 85% similarity).
Track sources (imports, API user ids) and warn when an external source repeatedly submits duplicates.
Provide a WP-CLI tool to export duplicates and produce reconciliation reports.

References and useful links

Summary

Preventing duplicate content in WordPress via prior verification can be achieved with a pragmatic combination of normalization, hashing, and pre-save checks. Store fingerprints in postmeta (or a custom table for scale), block duplicates at admin, REST and import points, show clear editor feedback, and backfill existing content. For paraphrase/near-duplicate detection, upgrade to similarity algorithms or external search services. The approach in this article gives you a strong foundation for exact-duplicate prevention with sensible performance and user experience tradeoffs.

Acepto donaciones de BAT's mediante el navegador Brave 🙂

How to prevent duplicate content with prior verification in PHP in WordPress

Introduction

What this tutorial covers

Why prevent duplicates before saving?

High-level approaches

Design decisions and tradeoffs

Core implementation plan (summary)

Complete example: a plugin-style implementation

Explanation of key parts

Near-duplicate detection (optional)

Simple percent similarity example (small sites)

Slug and title duplicates

Canonicalization and front-end handling

Example: add rel=canonical for duplicate posts

Scaling and performance advice

Security and best practices

Testing strategies

Example WP unit test snippet

Edge cases and special considerations

Database design recommendation for very large sites

Putting it all together: deployment checklist

Further enhancements you can add

References and useful links

Summary

Related

Leave a Reply Cancel reply

Introduction

What this tutorial covers

Why prevent duplicates before saving?

High-level approaches

Design decisions and tradeoffs

Core implementation plan (summary)

Complete example: a plugin-style implementation

Explanation of key parts

Near-duplicate detection (optional)

Simple percent similarity example (small sites)

Slug and title duplicates

Canonicalization and front-end handling

Example: add rel=canonical for duplicate posts

Scaling and performance advice

Security and best practices

Testing strategies

Example WP unit test snippet

Edge cases and special considerations

Database design recommendation for very large sites

Putting it all together: deployment checklist

Further enhancements you can add

References and useful links

Summary

¡Si te ha servido el artículo ayúdame compartiendolo en algún sitio! Pero si no te ha sido útil o tienes dudas déjame un comentario! 🙂

Related

Leave a Reply Cancel reply