How to audit broken links and report in admin with PHP in WordPress

Contents

Introduction

This article is a complete, detailed tutorial showing how to audit broken links in a WordPress site and present a comprehensive report in the WordPress admin using PHP. It includes architecture, best practices, security considerations, performance optimizations, a robust scanning implementation (chunked/batched), storage (custom table), admin UI, AJAX recheck, scheduled scans, CSV export, email notifications, and example code for each part. All code examples are ready to drop into a plugin file or persistent mu-plugin adapt file organization to your project structure.

High-level architecture and flow

  • Scanner/Collector: Extracts every URL (anchors, images, scripts, CSS links, iframe src, srcset items, inline styles) from posts, pages, custom fields, widgets, and optionally theme/plugin files.
  • Queue Batch Processor: Processes URLs in small batches to avoid timeouts and rate limits. Uses WP Cron or scheduled single events to continue processing.
  • HTTP Checker: Performs HEAD or lightweight GET requests with careful redirect handling and timeouts to determine status codes and final destination for redirects.
  • Storage: Custom database table (via dbDelta) storing each found URL, context (post ID, field, anchor text), status, last checked timestamp, HTTP code, final redirect target, and notes.
  • Admin UI: WP admin page showing paginated table of broken links with filters (site internal vs external, status code, last checked), actions (re-check, mark fixed, view context, open URL), bulk export to CSV.
  • Notifications: Scheduled email reports and optional Slack/webhook notifications for critical failures.

Design decisions and constraints

  • Prefer wp_remote_head first to minimize bandwidth fall back to wp_remote_get for servers that block HEAD.
  • Track redirects explicitly: treat 3xx as not necessarily broken but log the final URL and final status.
  • Use chunked scanning (e.g., 50 posts per batch, 20 URLs per HTTP batch) to avoid PHP timeouts and WP Cron overlaps.
  • Respect remote sites: add a configurable delay between requests and an adjustable user-agent string.
  • Sanitize and escape all input/output for admin UI require capability check (manage_options) for management actions.

Database schema

Create a custom table to store link scan results. Key columns: id, url, normalized_url, context_type (post, media, comment, meta), context_id, anchor_text, source_excerpt, http_code, status (ok, redirect, broken, unknown), final_url, last_checked (datetime), check_count, notes.

CREATE TABLE wp_broken_links (
  id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
  url TEXT NOT NULL,
  normalized_url TEXT NOT NULL,
  context_type VARCHAR(50) NOT NULL,
  context_id BIGINT(20) UNSIGNED DEFAULT 0,
  anchor_text TEXT DEFAULT NULL,
  source_excerpt TEXT DEFAULT NULL,
  http_code SMALLINT(5) DEFAULT NULL,
  status VARCHAR(20) DEFAULT unknown,
  final_url TEXT DEFAULT NULL,
  last_checked DATETIME DEFAULT NULL,
  check_count INT DEFAULT 0,
  notes TEXT DEFAULT NULL,
  PRIMARY KEY (id),
  KEY normalized_url_idx (normalized_url(191)),
  KEY status_idx (status),
  KEY last_checked_idx (last_checked)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

Plugin bootstrap and activation (create table)

Use register_activation_hook to create the table with dbDelta. Ensure charset/collation and require upgrade functions file.

prefix . broken_links
    charset_collate = wpdb->get_charset_collate()
    sql = CREATE TABLE table (
      id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
      url TEXT NOT NULL,
      normalized_url TEXT NOT NULL,
      context_type VARCHAR(50) NOT NULL,
      context_id BIGINT(20) UNSIGNED DEFAULT 0,
      anchor_text TEXT DEFAULT NULL,
      source_excerpt TEXT DEFAULT NULL,
      http_code SMALLINT(5) DEFAULT NULL,
      status VARCHAR(20) DEFAULT unknown,
      final_url TEXT DEFAULT NULL,
      last_checked DATETIME DEFAULT NULL,
      check_count INT DEFAULT 0,
      notes TEXT DEFAULT NULL,
      PRIMARY KEY (id),
      KEY normalized_url_idx (normalized_url(191)),
      KEY status_idx (status),
      KEY last_checked_idx (last_checked)
    ) charset_collate
    require_once( ABSPATH . wp-admin/includes/upgrade.php )
    dbDelta( sql )
}

Link extraction (collector)

Parsing HTML with DOMDocument is robust compared to regex. Handle libxml errors, UTF-8, and collect URLs from common attributes: a[href], img[src], img[srcset], script[src], link[href] (stylesheets), iframe[src], inline styles (background-image), and within srcset values.

loadHTML( mb_convert_encoding( html, HTML-ENTITIES, UTF-8 ), LIBXML_NOWARNING  LIBXML_NOERROR )
    // anchors
    foreach ( dom->getElementsByTagName(a) as a ) {
        href = a->getAttribute(href)
        if ( href ) {
            urls[] = array( url => href, anchor => a->textContent )
        }
    }
    // images
    foreach ( dom->getElementsByTagName(img) as img ) {
        if ( img->hasAttribute(src) ) {
            urls[] = array( url => img->getAttribute(src), anchor =>  )
        }
        if ( img->hasAttribute(srcset) ) {
            srcset = img->getAttribute(srcset)
            // parse srcset entries
            parts = preg_split(/s,s/, srcset)
            foreach( parts as part ) {
                p = preg_split(/s /, trim(part))
                if ( !empty(p[0]) ) {
                    urls[] = array( url => p[0], anchor =>  )
                }
            }
        }
    }
    // script src
    foreach ( dom->getElementsByTagName(script) as s ) {
        if ( s->hasAttribute(src) ) {
            urls[] = array( url => s->getAttribute(src), anchor =>  )
        }
    }
    // link rel=stylesheet etc
    foreach ( dom->getElementsByTagName(link) as l ) {
        if ( l->hasAttribute(href) ) {
            urls[] = array( url => l->getAttribute(href), anchor =>  )
        }
    }
    // iframe
    foreach ( dom->getElementsByTagName(iframe) as if ) {
        if ( if->hasAttribute(src) ) {
            urls[] = array( url => if->getAttribute(src), anchor =>  )
        }
    }
    libxml_clear_errors()
    // normalize and dedupe
    normalized = array()
    foreach ( urls as u ) {
        u[url] = trim( u[url] )
        if ( ! u[url] ) continue
        normalized[ normalize_url_for_db( u[url] ) ] = u
    }
    return array_values( normalized )
}

Normalization helper

Normalize URLs for deduplication (strip fragments, trailing slash optionally, lowercase scheme/host). Store raw and normalized versions.


HTTP checking: robust status detection

Strategy:

  1. Try HEAD request with wp_remote_head. Many servers respond properly others deny HEAD.
  2. If HEAD fails or returns an error code indicating server blocks HEAD, fallback to GET with small timeout and limited body download.
  3. Do not automatically follow redirects when you need to inspect intermediate codes but allow following to find final_url optionally.
  4. Treat 2xx as OK, 3xx as redirect (log Location header and optionally follow to final), 4xx and 5xx as broken.
  5. Handle timeouts, SSL errors and DNS errors: mark as unknown and record WP_Error message.
 8,
        user-agent => WP-BrokenLinkChecker/1.0 ( https://example.com),
        follow_redirects => false,
        max_redirects => 5,
        allow_head => true,
    )
    opts = wp_parse_args( opts, defaults )
    result = array(
        url => url,
        http_code => null,
        status => unknown,
        final_url => null,
        error => null,
    )
    // Skip local anchors and protocols we dont check
    if ( strpos( url, # ) === 0  stripos( url, mailto: ) === 0  stripos( url, javascript: ) === 0  strpos( url, data: ) === 0 ) {
        result[status] = skip
        return result
    }
    args = array(
        timeout => opts[timeout],
        redirection => opts[follow_redirects] ? opts[max_redirects] : 0,
        user-agent => opts[user-agent],
        sslverify => true,
    )
    // Try HEAD first
    if ( opts[allow_head] ) {
        head = wp_remote_head( url, args )
        if ( is_wp_error( head ) ) {
            // fallback to GET
            get = wp_remote_get( url, args )
            if ( is_wp_error( get ) ) {
                result[error] = get->get_error_message()
                result[status] = unknown
                return result
            } else {
                code = wp_remote_retrieve_response_code( get )
                result[http_code] = (int) code
                result[final_url] = wp_remote_retrieve_header( get, x-final-url ) ? wp_remote_retrieve_header( get, x-final-url ) : url
            }
        } else {
            code = wp_remote_retrieve_response_code( head )
            result[http_code] = (int) code
            // check redirect location
            loc = wp_remote_retrieve_header( head, location )
            if ( loc ) {
                result[final_url] = loc
            } else {
                result[final_url] = url
            }
        }
    } else {
        get = wp_remote_get( url, args )
        if ( is_wp_error( get ) ) {
            result[error] = get->get_error_message()
            result[status] = unknown
            return result
        }
        code = wp_remote_retrieve_response_code( get )
        result[http_code] = (int) code
        result[final_url] = wp_remote_retrieve_header( get, x-final-url ) ? wp_remote_retrieve_header( get, x-final-url ) : url
    }
    if ( result[http_code] >= 200  result[http_code] < 300 ) {
        result[status] = ok
    } elseif ( result[http_code] >= 300  result[http_code] < 400 ) {
        result[status] = redirect
        // optionally follow redirects to final status using separate call
    } elseif ( result[http_code] >= 400  result[http_code] < 600 ) {
        result[status] = broken
    } else {
        result[status] = unknown
    }
    return result
}

Storing results (insert/update)

Upsert logic: if normalized_url context exists, update otherwise insert new. Increment check_count and set last_checked.

prefix . broken_links
    normalized = normalize_url_for_db( data[url] )
    // Check for existing row for same url   context
    row = wpdb->get_row( wpdb->prepare(
        SELECT  FROM table WHERE normalized_url = %s AND context_type = %s AND context_id = %d,
        normalized, data[context_type], isset(data[context_id]) ? (int)data[context_id] : 0
    ) )
    now = current_time( mysql )
    insert_data = array(
        url => data[url],
        normalized_url => normalized,
        context_type => data[context_type],
        context_id => isset(data[context_id]) ? (int)data[context_id] : 0,
        anchor_text => isset(data[anchor_text]) ? data[anchor_text] : ,
        source_excerpt => isset(data[source_excerpt]) ? data[source_excerpt] : ,
        http_code => isset(data[http_code]) ? (int)data[http_code] : null,
        status => isset(data[status]) ? data[status] : unknown,
        final_url => isset(data[final_url]) ? data[final_url] : null,
        last_checked => now,
        notes => isset(data[notes]) ? data[notes] : ,
    )
    if ( row ) {
        wpdb->update(
            table,
            array_merge( insert_data, array( check_count => row->check_count   1 ) ),
            array( id => row->id ),
            null,
            array( %d )
        )
        return row->id
    } else {
        insert_data[check_count] = 1
        wpdb->insert( table, insert_data )
        return wpdb->insert_id
    }
}

Scanning posts/pages/media in batches

Perform a scheduled batch that queries posts by offset/limit. Each batch extracts URLs for each post and enqueues checks for those URLs or immediately processes them in small sub-batches. Use wp_schedule_single_event to re-schedule the next batch to avoid long-running requests.

 0, per_page => 50 ) )
    }
}
add_action( bl_scan_batch, bl_scan_batch_handler, 10, 2 )

function bl_scan_batch_handler( offset = 0, per_page = 50 ) {
    args = array(
        post_type => array( post, page ),
        post_status => publish,
        posts_per_page => per_page,
        offset => offset,
        fields => ids,
    )
    query = new WP_Query( args )
    if ( ! query->have_posts() ) {
        // finished optionally schedule next complete scan in one week
        wp_schedule_single_event( time()   WEEK_IN_SECONDS, bl_scan_batch, array( offset => 0, per_page => per_page ) )
        return
    }
    foreach ( query->posts as post_id ) {
        content = get_post_field( post_content, post_id )
        urls = bl_extract_urls_from_html( content )
        foreach ( urls as u ) {
            // prepare storage and optionally immediate check or push onto queue
            entry = array(
                url => u[url],
                context_type => post,
                context_id => post_id,
                anchor_text => u[anchor],
                source_excerpt => wp_trim_words( strip_tags( content ), 20 ),
            )
            // Optionally: store now as unknown then enqueue check
            bl_store_check_result( entry )
            // Enqueue check: schedule single event to check this URL (throttled)
            wp_schedule_single_event( time()   2, bl_check_url_event, array( url => u[url], context_type => post, context_id => post_id ) )
        }
    }
    // schedule next batch with increased offset
    next_offset = offset   per_page
    wp_schedule_single_event( time()   3, bl_scan_batch, array( offset => next_offset, per_page => per_page ) )
}

Checking queued URLs via scheduled events

Rather than firing a large chain of immediate HTTP requests, schedule individual checks at small intervals to spread load. Provide a separate handler for the check event that calls the HTTP checker and stores results.

 10, allow_head => true ) )
    data = array(
        url => url,
        context_type => context_type,
        context_id => context_id,
        anchor_text => isset(payload[anchor_text]) ? payload[anchor_text] : ,
        source_excerpt => isset(payload[source_excerpt]) ? payload[source_excerpt] : ,
        http_code => result[http_code],
        status => result[status],
        final_url => result[final_url],
        notes => isset(result[error]) ? result[error] : ,
    )
    bl_store_check_result( data )
}

Admin UI: menu, capabilities and listing

Provide an admin page under Tools or its own top-level menu. Only users with manage_options (or a more granular capability) can access. Implement pagination, filters and nonces for actions.

prefix . broken_links
    paged = max(1, intval( _GET[paged] ?? 1 ))
    per_page = 25
    offset = (paged - 1)  per_page
    where = WHERE 1=1
    if ( ! empty( _GET[status] ) ) {
        where .= wpdb->prepare(  AND status = %s, sanitize_text_field( _GET[status] ) )
    }
    total = wpdb->get_var( SELECT COUNT() FROM table where )
    rows = wpdb->get_results( wpdb->prepare( SELECT  FROM table where ORDER BY last_checked DESC LIMIT %d OFFSET %d, per_page, offset ) )
    // Render table header and rows
    echo 
wp_nonce_field( bl_admin_action, bl_admin_nonce ) echo

Filter: echo Broken echo All

echo echo if ( rows ) { foreach ( rows as r ) { echo echo echo echo echo echo echo echo } } else { echo } echo
URLContextCodeStatusLast checkedActions
url ). target=_blank>.esc_html( r->url )..esc_html( r->context_type. #.r->context_id )..esc_html( r->http_code )..esc_html( r->status )..esc_html( r->last_checked ). echo echo echo
No results
echo
// pagination (simple) total_pages = ceil( total / per_page ) if ( total_pages > 1 ) { echo

Page: for ( i=1 i <= total_pages i ) { if ( i == paged ) { echo .i. } else { echo .i. } } echo

} }

Admin actions processing (handling recheck and mark fixed)

Validate nonce, verify capability, parse action button value, then call check or update DB accordingly. For recheck, schedule a check event rather than performing heavy HTTP in the request.

prefix . broken_links
        row = wpdb->get_row( wpdb->prepare( SELECT  FROM table WHERE id = %d, id ) )
        if ( row ) {
            wp_schedule_single_event( time()   1, bl_check_url_event, array( url => row->url, context_type => row->context_type, context_id => row->context_id ) )
            echo 

Re-check scheduled.

} } elseif ( strpos( action, mark_fixed_ ) === 0 ) { id = intval( substr( action, 11 ) ) global wpdb table = wpdb->prefix . broken_links wpdb->update( table, array( status => ok, notes => Manually marked fixed, last_checked => current_time(mysql) ), array( id => id ) ) echo

Marked fixed.

} }

AJAX support for on-demand re-checks

Add a secure AJAX action for re-checking a specific row. Use wp_ajax_ hooks and verify nonce and capability. Return JSON result with updated status.

prefix . broken_links
    row = wpdb->get_row( wpdb->prepare( SELECT  FROM table WHERE id = %d, id ) )
    if ( ! row ) {
        wp_send_json_error( not_found, 404 )
    }
    // Perform synchronous check (consider doing async to avoid long requests)
    res = bl_check_url_status( row->url )
    data = array(
        url => row->url,
        http_code => res[http_code],
        status => res[status],
        final_url => res[final_url],
        error => isset(res[error]) ? res[error] : ,
    )
    bl_store_check_result( array_merge( (array)row, data ) )
    wp_send_json_success( data )
}

Export CSV report

Provide a simple export endpoint that streams CSV headers and rows. Ensure capability check and nonce before streaming. Offer filters (status, date range).

 403 ) )
    }
    if ( ! check_admin_referer( bl_export_csv, bl_export_nonce ) ) {
        wp_die( Invalid nonce, , array( response => 403 ) )
    }
    global wpdb
    table = wpdb->prefix . broken_links
    rows = wpdb->get_results( SELECT  FROM table ORDER BY last_checked DESC )
    header( Content-Type: text/csv charset=utf-8 )
    header( Content-Disposition: attachment filename=broken-links.csv )
    output = fopen(php://output, w)
    fputcsv( output, array( id,url,context_type,context_id,http_code,status,final_url,last_checked,check_count,notes ) )
    foreach ( rows as r ) {
        fputcsv( output, array( r->id, r->url, r->context_type, r->context_id, r->http_code, r->status, r->final_url, r->last_checked, r->check_count, r->notes ) )
    }
    fclose( output )
    exit
}

Scheduled weekly summary and alerting

Schedule a weekly job that queries the table for broken links found in the last week and emails an admin report. Optionally call a webhook/Slack integration for urgent notifications.

prefix . broken_links
    since = date( Y-m-d H:i:s, strtotime( -7 days ) )
    rows = wpdb->get_results( wpdb->prepare( SELECT  FROM table WHERE status = %s AND last_checked >= %s ORDER BY last_checked DESC, broken, since ) )
    if ( empty( rows ) ) {
        return
    }
    message = Weekly broken links reportnn
    foreach ( rows as r ) {
        message .= sprintf( %s (code: %s) — context: %s #%d — last checked: %sn, r->url, r->http_code, r->context_type, r->context_id, r->last_checked )
    }
    admin_email = get_option( admin_email )
    wp_mail( admin_email, Weekly Broken Links Report, message )
}
// schedule on plugin init if not scheduled
if ( ! wp_next_scheduled( bl_weekly_report ) ) {
    wp_schedule_event( time(), weekly, bl_weekly_report )
}

Performance and operational considerations

  • Rate limiting and politeness: Add configurable delay between requests to the same host. Honor robots or opt-out domains if required.
  • Caching: Use transients to avoid repeatedly checking popular external resources within a short window. E.g., cache a successful check for 24 hours.
  • HTTP concurrency: Default wp_remote_ is synchronous. For larger sites consider using Action Scheduler or an async HTTP client, or third-party libraries to parallelize with care and throttling.
  • Retries: Implement exponential backoff for transient network errors and attempt 2-3 checks before marking permanently broken.
  • Relative URLs: Resolve relative URLs to absolute using the post/site context prior to checking.
  • Link frequency: Only re-check links that are stale or changed (compare a hash of content or last_checked timestamp) to reduce unnecessary checks.
  • Large sites: For very large sites, consider indexing unique external domains and checking domain-level health first (DNS, TLS), then per-URL checks.

Security and hardening

  • Always verify nonces for admin actions and AJAX endpoints.
  • Ensure proper capability checks for any administration or export operations.
  • Escape output in admin UI with esc_html/esc_url. Use prepared SQL for queries.
  • Limit direct outbound connections to avoid being used as a proxy. Avoid returning raw HTTP responses to users.
  • Be careful when checking internal-only URLs behind authentication: treat 401/403 as potentially broken or special-case internal checks with credentials (rare).

Advanced features and extensions

  • Automatic rewrite suggestions: when 301/302 point to a new final_url, optionally offer to update posts or create a redirect rule.
  • Integration with editors: inline context links in the post edit screen highlighting broken links in content and allowing one-click fixes.
  • Support for custom fields and ACF: scan meta_value fields for URLs by scanning postmeta or using field definitions.
  • Use Action Scheduler library for scalable background processing (recommended for sites with high volume).
  • Store HTTP response headers for deeper diagnostics (cache-control, content-type, server).

Troubleshooting tips

  1. If many HEAD requests fail: some hosts block HEAD requests. Fall back to GET or set allow_head false for problematic domains.
  2. If timeouts are common: increase timeout slightly or process fewer URLs per batch. Consider using WP CLI for manual scans on demand.
  3. SSL verification failures: record the error message and treat separately. Do not disable sslverify globally in production.
  4. Missing URLs from content: ensure you also scan serialized data in postmeta and widget outputs if necessary.

Example: Minimal end-to-end flow summary

1) On activation create table. 2) Schedule a scan job that reads 50 posts at a time and extracts URLs. 3) For every extracted URL insert/update a record and schedule a check event. 4) Check event runs, performs HEAD/GET, updates status and last_checked. 5) Admin page shows summary and allows re-check, mark fixed, and export CSV. 6) Weekly report emailed to admin with a summary of broken links.

Helpful links and references

Complete code notes and packaging

Organize code into a plugin with clear separation:

  • includes/schema.php — table creation and upgrade logic
  • includes/collector.php — extraction helpers
  • includes/checker.php — HTTP checking and normalization
  • includes/storage.php — DB insert/update helpers
  • includes/scheduler.php — batch scan and cron scheduling
  • admin/admin-page.php — admin listing, export, action handling, AJAX enpoints
  • assets/js/admin.js — optional AJAX recheck UI

Test thoroughly on a staging server before running full scans on production. For very large sites, prefer CLI scans via WP-CLI to avoid web server/cron limits.

Short example: key snippets recap

Normalization, extraction, checking, storing, and admin listing are shown above as full snippets. Use them as the foundation and extend for your environment. The approach balances correctness and scalability: chunked scans, scheduled checks, robust HTTP detection and cautious defaults (timeouts, user-agent) to avoid abuse.

End of tutorial



Acepto donaciones de BAT's mediante el navegador Brave 🙂



Leave a Reply

Your email address will not be published. Required fields are marked *