Contents
Introduction
This article is a complete, detailed tutorial showing how to audit broken links in a WordPress site and present a comprehensive report in the WordPress admin using PHP. It includes architecture, best practices, security considerations, performance optimizations, a robust scanning implementation (chunked/batched), storage (custom table), admin UI, AJAX recheck, scheduled scans, CSV export, email notifications, and example code for each part. All code examples are ready to drop into a plugin file or persistent mu-plugin adapt file organization to your project structure.
High-level architecture and flow
- Scanner/Collector: Extracts every URL (anchors, images, scripts, CSS links, iframe src, srcset items, inline styles) from posts, pages, custom fields, widgets, and optionally theme/plugin files.
- Queue Batch Processor: Processes URLs in small batches to avoid timeouts and rate limits. Uses WP Cron or scheduled single events to continue processing.
- HTTP Checker: Performs HEAD or lightweight GET requests with careful redirect handling and timeouts to determine status codes and final destination for redirects.
- Storage: Custom database table (via dbDelta) storing each found URL, context (post ID, field, anchor text), status, last checked timestamp, HTTP code, final redirect target, and notes.
- Admin UI: WP admin page showing paginated table of broken links with filters (site internal vs external, status code, last checked), actions (re-check, mark fixed, view context, open URL), bulk export to CSV.
- Notifications: Scheduled email reports and optional Slack/webhook notifications for critical failures.
Design decisions and constraints
- Prefer wp_remote_head first to minimize bandwidth fall back to wp_remote_get for servers that block HEAD.
- Track redirects explicitly: treat 3xx as not necessarily broken but log the final URL and final status.
- Use chunked scanning (e.g., 50 posts per batch, 20 URLs per HTTP batch) to avoid PHP timeouts and WP Cron overlaps.
- Respect remote sites: add a configurable delay between requests and an adjustable user-agent string.
- Sanitize and escape all input/output for admin UI require capability check (manage_options) for management actions.
Database schema
Create a custom table to store link scan results. Key columns: id, url, normalized_url, context_type (post, media, comment, meta), context_id, anchor_text, source_excerpt, http_code, status (ok, redirect, broken, unknown), final_url, last_checked (datetime), check_count, notes.
CREATE TABLE wp_broken_links ( id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT, url TEXT NOT NULL, normalized_url TEXT NOT NULL, context_type VARCHAR(50) NOT NULL, context_id BIGINT(20) UNSIGNED DEFAULT 0, anchor_text TEXT DEFAULT NULL, source_excerpt TEXT DEFAULT NULL, http_code SMALLINT(5) DEFAULT NULL, status VARCHAR(20) DEFAULT unknown, final_url TEXT DEFAULT NULL, last_checked DATETIME DEFAULT NULL, check_count INT DEFAULT 0, notes TEXT DEFAULT NULL, PRIMARY KEY (id), KEY normalized_url_idx (normalized_url(191)), KEY status_idx (status), KEY last_checked_idx (last_checked) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
Plugin bootstrap and activation (create table)
Use register_activation_hook to create the table with dbDelta. Ensure charset/collation and require upgrade functions file.
prefix . broken_links charset_collate = wpdb->get_charset_collate() sql = CREATE TABLE table ( id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT, url TEXT NOT NULL, normalized_url TEXT NOT NULL, context_type VARCHAR(50) NOT NULL, context_id BIGINT(20) UNSIGNED DEFAULT 0, anchor_text TEXT DEFAULT NULL, source_excerpt TEXT DEFAULT NULL, http_code SMALLINT(5) DEFAULT NULL, status VARCHAR(20) DEFAULT unknown, final_url TEXT DEFAULT NULL, last_checked DATETIME DEFAULT NULL, check_count INT DEFAULT 0, notes TEXT DEFAULT NULL, PRIMARY KEY (id), KEY normalized_url_idx (normalized_url(191)), KEY status_idx (status), KEY last_checked_idx (last_checked) ) charset_collate require_once( ABSPATH . wp-admin/includes/upgrade.php ) dbDelta( sql ) }
Link extraction (collector)
Parsing HTML with DOMDocument is robust compared to regex. Handle libxml errors, UTF-8, and collect URLs from common attributes: a[href], img[src], img[srcset], script[src], link[href] (stylesheets), iframe[src], inline styles (background-image), and within srcset values.
loadHTML( mb_convert_encoding( html, HTML-ENTITIES, UTF-8 ), LIBXML_NOWARNING LIBXML_NOERROR ) // anchors foreach ( dom->getElementsByTagName(a) as a ) { href = a->getAttribute(href) if ( href ) { urls[] = array( url => href, anchor => a->textContent ) } } // images foreach ( dom->getElementsByTagName(img) as img ) { if ( img->hasAttribute(src) ) { urls[] = array( url => img->getAttribute(src), anchor => ) } if ( img->hasAttribute(srcset) ) { srcset = img->getAttribute(srcset) // parse srcset entries parts = preg_split(/s,s/, srcset) foreach( parts as part ) { p = preg_split(/s /, trim(part)) if ( !empty(p[0]) ) { urls[] = array( url => p[0], anchor => ) } } } } // script src foreach ( dom->getElementsByTagName(script) as s ) { if ( s->hasAttribute(src) ) { urls[] = array( url => s->getAttribute(src), anchor => ) } } // link rel=stylesheet etc foreach ( dom->getElementsByTagName(link) as l ) { if ( l->hasAttribute(href) ) { urls[] = array( url => l->getAttribute(href), anchor => ) } } // iframe foreach ( dom->getElementsByTagName(iframe) as if ) { if ( if->hasAttribute(src) ) { urls[] = array( url => if->getAttribute(src), anchor => ) } } libxml_clear_errors() // normalize and dedupe normalized = array() foreach ( urls as u ) { u[url] = trim( u[url] ) if ( ! u[url] ) continue normalized[ normalize_url_for_db( u[url] ) ] = u } return array_values( normalized ) }
Normalization helper
Normalize URLs for deduplication (strip fragments, trailing slash optionally, lowercase scheme/host). Store raw and normalized versions.
// pagination (simple) total_pages = ceil( total / per_page ) if ( total_pages > 1 ) { echoHTTP checking: robust status detection
Strategy:
- Try HEAD request with wp_remote_head. Many servers respond properly others deny HEAD.
- If HEAD fails or returns an error code indicating server blocks HEAD, fallback to GET with small timeout and limited body download.
- Do not automatically follow redirects when you need to inspect intermediate codes but allow following to find final_url optionally.
- Treat 2xx as OK, 3xx as redirect (log Location header and optionally follow to final), 4xx and 5xx as broken.
- Handle timeouts, SSL errors and DNS errors: mark as unknown and record WP_Error message.
8, user-agent => WP-BrokenLinkChecker/1.0 ( https://example.com), follow_redirects => false, max_redirects => 5, allow_head => true, ) opts = wp_parse_args( opts, defaults ) result = array( url => url, http_code => null, status => unknown, final_url => null, error => null, ) // Skip local anchors and protocols we dont check if ( strpos( url, # ) === 0 stripos( url, mailto: ) === 0 stripos( url, javascript: ) === 0 strpos( url, data: ) === 0 ) { result[status] = skip return result } args = array( timeout => opts[timeout], redirection => opts[follow_redirects] ? opts[max_redirects] : 0, user-agent => opts[user-agent], sslverify => true, ) // Try HEAD first if ( opts[allow_head] ) { head = wp_remote_head( url, args ) if ( is_wp_error( head ) ) { // fallback to GET get = wp_remote_get( url, args ) if ( is_wp_error( get ) ) { result[error] = get->get_error_message() result[status] = unknown return result } else { code = wp_remote_retrieve_response_code( get ) result[http_code] = (int) code result[final_url] = wp_remote_retrieve_header( get, x-final-url ) ? wp_remote_retrieve_header( get, x-final-url ) : url } } else { code = wp_remote_retrieve_response_code( head ) result[http_code] = (int) code // check redirect location loc = wp_remote_retrieve_header( head, location ) if ( loc ) { result[final_url] = loc } else { result[final_url] = url } } } else { get = wp_remote_get( url, args ) if ( is_wp_error( get ) ) { result[error] = get->get_error_message() result[status] = unknown return result } code = wp_remote_retrieve_response_code( get ) result[http_code] = (int) code result[final_url] = wp_remote_retrieve_header( get, x-final-url ) ? wp_remote_retrieve_header( get, x-final-url ) : url } if ( result[http_code] >= 200 result[http_code] < 300 ) { result[status] = ok } elseif ( result[http_code] >= 300 result[http_code] < 400 ) { result[status] = redirect // optionally follow redirects to final status using separate call } elseif ( result[http_code] >= 400 result[http_code] < 600 ) { result[status] = broken } else { result[status] = unknown } return result }Storing results (insert/update)
Upsert logic: if normalized_url context exists, update otherwise insert new. Increment check_count and set last_checked.
prefix . broken_links normalized = normalize_url_for_db( data[url] ) // Check for existing row for same url context row = wpdb->get_row( wpdb->prepare( SELECT FROM table WHERE normalized_url = %s AND context_type = %s AND context_id = %d, normalized, data[context_type], isset(data[context_id]) ? (int)data[context_id] : 0 ) ) now = current_time( mysql ) insert_data = array( url => data[url], normalized_url => normalized, context_type => data[context_type], context_id => isset(data[context_id]) ? (int)data[context_id] : 0, anchor_text => isset(data[anchor_text]) ? data[anchor_text] : , source_excerpt => isset(data[source_excerpt]) ? data[source_excerpt] : , http_code => isset(data[http_code]) ? (int)data[http_code] : null, status => isset(data[status]) ? data[status] : unknown, final_url => isset(data[final_url]) ? data[final_url] : null, last_checked => now, notes => isset(data[notes]) ? data[notes] : , ) if ( row ) { wpdb->update( table, array_merge( insert_data, array( check_count => row->check_count 1 ) ), array( id => row->id ), null, array( %d ) ) return row->id } else { insert_data[check_count] = 1 wpdb->insert( table, insert_data ) return wpdb->insert_id } }Scanning posts/pages/media in batches
Perform a scheduled batch that queries posts by offset/limit. Each batch extracts URLs for each post and enqueues checks for those URLs or immediately processes them in small sub-batches. Use wp_schedule_single_event to re-schedule the next batch to avoid long-running requests.
0, per_page => 50 ) ) } } add_action( bl_scan_batch, bl_scan_batch_handler, 10, 2 ) function bl_scan_batch_handler( offset = 0, per_page = 50 ) { args = array( post_type => array( post, page ), post_status => publish, posts_per_page => per_page, offset => offset, fields => ids, ) query = new WP_Query( args ) if ( ! query->have_posts() ) { // finished optionally schedule next complete scan in one week wp_schedule_single_event( time() WEEK_IN_SECONDS, bl_scan_batch, array( offset => 0, per_page => per_page ) ) return } foreach ( query->posts as post_id ) { content = get_post_field( post_content, post_id ) urls = bl_extract_urls_from_html( content ) foreach ( urls as u ) { // prepare storage and optionally immediate check or push onto queue entry = array( url => u[url], context_type => post, context_id => post_id, anchor_text => u[anchor], source_excerpt => wp_trim_words( strip_tags( content ), 20 ), ) // Optionally: store now as unknown then enqueue check bl_store_check_result( entry ) // Enqueue check: schedule single event to check this URL (throttled) wp_schedule_single_event( time() 2, bl_check_url_event, array( url => u[url], context_type => post, context_id => post_id ) ) } } // schedule next batch with increased offset next_offset = offset per_page wp_schedule_single_event( time() 3, bl_scan_batch, array( offset => next_offset, per_page => per_page ) ) }Checking queued URLs via scheduled events
Rather than firing a large chain of immediate HTTP requests, schedule individual checks at small intervals to spread load. Provide a separate handler for the check event that calls the HTTP checker and stores results.
10, allow_head => true ) ) data = array( url => url, context_type => context_type, context_id => context_id, anchor_text => isset(payload[anchor_text]) ? payload[anchor_text] : , source_excerpt => isset(payload[source_excerpt]) ? payload[source_excerpt] : , http_code => result[http_code], status => result[status], final_url => result[final_url], notes => isset(result[error]) ? result[error] : , ) bl_store_check_result( data ) }
Provide an admin page under Tools or its own top-level menu. Only users with manage_options (or a more granular capability) can access. Implement pagination, filters and nonces for actions.
prefix . broken_links paged = max(1, intval( _GET[paged] ?? 1 )) per_page = 25 offset = (paged - 1) per_page where = WHERE 1=1 if ( ! empty( _GET[status] ) ) { where .= wpdb->prepare( AND status = %s, sanitize_text_field( _GET[status] ) ) } total = wpdb->get_var( SELECT COUNT() FROM table where ) rows = wpdb->get_results( wpdb->prepare( SELECT FROM table where ORDER BY last_checked DESC LIMIT %d OFFSET %d, per_page, offset ) ) // Render table header and rows echo
Page: for ( i=1 i <= total_pages i ) { if ( i == paged ) { echo .i. } else { echo .i. } } echo
} }Admin actions processing (handling recheck and mark fixed)
Validate nonce, verify capability, parse action button value, then call check or update DB accordingly. For recheck, schedule a check event rather than performing heavy HTTP in the request.
prefix . broken_links row = wpdb->get_row( wpdb->prepare( SELECT FROM table WHERE id = %d, id ) ) if ( row ) { wp_schedule_single_event( time() 1, bl_check_url_event, array( url => row->url, context_type => row->context_type, context_id => row->context_id ) ) echo} } elseif ( strpos( action, mark_fixed_ ) === 0 ) { id = intval( substr( action, 11 ) ) global wpdb table = wpdb->prefix . broken_links wpdb->update( table, array( status => ok, notes => Manually marked fixed, last_checked => current_time(mysql) ), array( id => id ) ) echoRe-check scheduled.
} }Marked fixed.
AJAX support for on-demand re-checks
Add a secure AJAX action for re-checking a specific row. Use wp_ajax_ hooks and verify nonce and capability. Return JSON result with updated status.
prefix . broken_links row = wpdb->get_row( wpdb->prepare( SELECT FROM table WHERE id = %d, id ) ) if ( ! row ) { wp_send_json_error( not_found, 404 ) } // Perform synchronous check (consider doing async to avoid long requests) res = bl_check_url_status( row->url ) data = array( url => row->url, http_code => res[http_code], status => res[status], final_url => res[final_url], error => isset(res[error]) ? res[error] : , ) bl_store_check_result( array_merge( (array)row, data ) ) wp_send_json_success( data ) }
Export CSV report
Provide a simple export endpoint that streams CSV headers and rows. Ensure capability check and nonce before streaming. Offer filters (status, date range).
403 ) ) } if ( ! check_admin_referer( bl_export_csv, bl_export_nonce ) ) { wp_die( Invalid nonce, , array( response => 403 ) ) } global wpdb table = wpdb->prefix . broken_links rows = wpdb->get_results( SELECT FROM table ORDER BY last_checked DESC ) header( Content-Type: text/csv charset=utf-8 ) header( Content-Disposition: attachment filename=broken-links.csv ) output = fopen(php://output, w) fputcsv( output, array( id,url,context_type,context_id,http_code,status,final_url,last_checked,check_count,notes ) ) foreach ( rows as r ) { fputcsv( output, array( r->id, r->url, r->context_type, r->context_id, r->http_code, r->status, r->final_url, r->last_checked, r->check_count, r->notes ) ) } fclose( output ) exit }
Scheduled weekly summary and alerting
Schedule a weekly job that queries the table for broken links found in the last week and emails an admin report. Optionally call a webhook/Slack integration for urgent notifications.
prefix . broken_links since = date( Y-m-d H:i:s, strtotime( -7 days ) ) rows = wpdb->get_results( wpdb->prepare( SELECT FROM table WHERE status = %s AND last_checked >= %s ORDER BY last_checked DESC, broken, since ) ) if ( empty( rows ) ) { return } message = Weekly broken links reportnn foreach ( rows as r ) { message .= sprintf( %s (code: %s) — context: %s #%d — last checked: %sn, r->url, r->http_code, r->context_type, r->context_id, r->last_checked ) } admin_email = get_option( admin_email ) wp_mail( admin_email, Weekly Broken Links Report, message ) } // schedule on plugin init if not scheduled if ( ! wp_next_scheduled( bl_weekly_report ) ) { wp_schedule_event( time(), weekly, bl_weekly_report ) }
Performance and operational considerations
- Rate limiting and politeness: Add configurable delay between requests to the same host. Honor robots or opt-out domains if required.
- Caching: Use transients to avoid repeatedly checking popular external resources within a short window. E.g., cache a successful check for 24 hours.
- HTTP concurrency: Default wp_remote_ is synchronous. For larger sites consider using Action Scheduler or an async HTTP client, or third-party libraries to parallelize with care and throttling.
- Retries: Implement exponential backoff for transient network errors and attempt 2-3 checks before marking permanently broken.
- Relative URLs: Resolve relative URLs to absolute using the post/site context prior to checking.
- Link frequency: Only re-check links that are stale or changed (compare a hash of content or last_checked timestamp) to reduce unnecessary checks.
- Large sites: For very large sites, consider indexing unique external domains and checking domain-level health first (DNS, TLS), then per-URL checks.
Security and hardening
- Always verify nonces for admin actions and AJAX endpoints.
- Ensure proper capability checks for any administration or export operations.
- Escape output in admin UI with esc_html/esc_url. Use prepared SQL for queries.
- Limit direct outbound connections to avoid being used as a proxy. Avoid returning raw HTTP responses to users.
- Be careful when checking internal-only URLs behind authentication: treat 401/403 as potentially broken or special-case internal checks with credentials (rare).
Advanced features and extensions
- Automatic rewrite suggestions: when 301/302 point to a new final_url, optionally offer to update posts or create a redirect rule.
- Integration with editors: inline context links in the post edit screen highlighting broken links in content and allowing one-click fixes.
- Support for custom fields and ACF: scan meta_value fields for URLs by scanning postmeta or using field definitions.
- Use Action Scheduler library for scalable background processing (recommended for sites with high volume).
- Store HTTP response headers for deeper diagnostics (cache-control, content-type, server).
Troubleshooting tips
- If many HEAD requests fail: some hosts block HEAD requests. Fall back to GET or set allow_head false for problematic domains.
- If timeouts are common: increase timeout slightly or process fewer URLs per batch. Consider using WP CLI for manual scans on demand.
- SSL verification failures: record the error message and treat separately. Do not disable sslverify globally in production.
- Missing URLs from content: ensure you also scan serialized data in postmeta and widget outputs if necessary.
Example: Minimal end-to-end flow summary
1) On activation create table. 2) Schedule a scan job that reads 50 posts at a time and extracts URLs. 3) For every extracted URL insert/update a record and schedule a check event. 4) Check event runs, performs HEAD/GET, updates status and last_checked. 5) Admin page shows summary and allows re-check, mark fixed, and export CSV. 6) Weekly report emailed to admin with a summary of broken links.
Helpful links and references
Complete code notes and packaging
Organize code into a plugin with clear separation:
- includes/schema.php — table creation and upgrade logic
- includes/collector.php — extraction helpers
- includes/checker.php — HTTP checking and normalization
- includes/storage.php — DB insert/update helpers
- includes/scheduler.php — batch scan and cron scheduling
- admin/admin-page.php — admin listing, export, action handling, AJAX enpoints
- assets/js/admin.js — optional AJAX recheck UI
Test thoroughly on a staging server before running full scans on production. For very large sites, prefer CLI scans via WP-CLI to avoid web server/cron limits.
Short example: key snippets recap
Normalization, extraction, checking, storing, and admin listing are shown above as full snippets. Use them as the foundation and extend for your environment. The approach balances correctness and scalability: chunked scans, scheduled checks, robust HTTP detection and cautious defaults (timeouts, user-agent) to avoid abuse.
End of tutorial
|
Acepto donaciones de BAT's mediante el navegador Brave 🙂 |