Most webmasters know that caching is important for website speed. And for sites running on WordPress, there are lots of plugins dedicated to page caching. But not a lot is said about cache warming (or cache preloading) which is just as important.
The PHP code shared below will warm up your website cache by crawling your entire sitemap using cURL. The code can be run from a Linux Cron job.
I tested this with Yoast sitemaps. But I’m pretty sure it will work nicely with sitemaps generated by any other plugin as long as the generated sitemap follows standard conventions.
I have added lots of comments to the code. So make sure to check the comments when working with this. You will also need to enter correct values for stuff like file paths, email address, domain, etc etc.
This sitemap crawler works beautifully with Cloudflare. I have done some extensive testing with great results. If you use the Cloudflare “Cache Everything” page rule together with “Edge Cache TTL”, you can use this code to populate Cloudflare edge caches (POPs) by running it from different geographic locations worldwide.
The subject of how to run this code from different global geographic locations will be discussed in a future blog post.
Here are some other things you may want to note:
- The code uses PHPMailer to send you email notifications if an error occurs. So don’t forget to enter the correct local path to the PHPMailer source code. If you don’t want the email notification features, you can remove the appropriate sections.
- This code also uses my random user agent generator. You can alternatively hardcode your own user agent. But to avoid the risk of getting blocked by firewalls, I recommend using my random user agent generator.
- I designed this to be run from a Linux cron job. All my tests were from cron jobs as well. If you plan to run this without a cron job, be wary of potential PHP timeout issues you may encounter especially when crawling large websites. When working with PHP CLI, timeouts are not really a problem.
- If your cron jobs are scheduled close to each other, the code uses the flock() construct to prevent starting multiple instances of the same process. So if a sitemap crawl is already running, another crawl will not start until the first is completed.
Ok, enough preamble. Here’s the PHP sitemap crawler code:
<?php // If you don't need email notifications of crawler errors... // ...just comment out the PHPMailer references and usage within the code use PHPMailer\PHPMailer\PHPMailer; use PHPMailer\PHPMailer\Exception; require '/your/full/path/phpmailer/src/Exception.php'; require '/your/full/path/phpmailer/src/PHPMailer.php'; require '/your/full/path/phpmailer/src/SMTP.php'; // You can optionally hardcode a single user agent instead of including this file include("/your/full/path/warm-cache-folder/random-user-agent.php"); try { // If you're running this from a cron job, this flock() structure will prevent starting multiple instances of the same process $fp = fopen('/your/full/path/warm-cache-folder/lock.txt', 'w'); if(!flock($fp, LOCK_EX | LOCK_NB)) { $mail = new PHPMailer(true); $mail->SMTPDebug = 2; $mail->isSMTP(); $mail->Host = 'smtp.your-host.com'; // Replace with your correct SMTP host $mail->SMTPAuth = true; $mail->Username = 'your_email@your-domain.com'; // Replace with your correct email address $mail->Password = 'your_password'; // Replace with your correct password $mail->SMTPSecure = 'tls'; // Enter the correct value $mail->Port = 587; // Enter the correct value $mail->setFrom('your_email@your-domain.com', 'Your Name'); $mail->addAddress('your_email@your-domain.com', 'Your Name'); $mail->isHTML(true); $mail->Subject = gethostname() . ' Unable to obtain lock'; $mail->Body = 'Unable to obtain lock'; $mail->send(); echo 'Unable to obtain lock'; exit(-1); } // My code here... $timezone = new DateTimeZone('America/Toronto'); // Replace with your own TimeZone (optional) $date_start = new DateTime(); $date_start->setTimezone($timezone); file_put_contents('/your/full/path/warm-cache-folder/run-history.txt', "Cache warming started at:\t\t\t\t" . $date_start->format('M j, Y - g:i:s a') . PHP_EOL, FILE_APPEND); $sitemaps = array( 'https://your-domain.com/sitemap_index.xml', // Enter your sitemap url ); $crawler = new Ehi_Crawler( $sitemaps ); $crawler->run(); $date_stop = new DateTime(); $date_stop->setTimezone($timezone); $diff = $date_stop->getTimestamp() - $date_start->getTimestamp(); file_put_contents('/your/full/path/warm-cache-folder/run-history.txt', "Cache warming finished/terminated at:\t" . $date_stop->format('M j, Y - g:i:s a') . PHP_EOL . "Total time taken:\t\t\t\t\t\t" . $diff . ' seconds' . PHP_EOL . PHP_EOL, FILE_APPEND); // Continue flock() code sleep(5); fclose($fp); } catch (\Exception $e) { echo 'Caught exception: ', $e->getMessage(), "\n"; $mail = new PHPMailer(true); $mail->SMTPDebug = 2; $mail->isSMTP(); $mail->Host = 'smtp.your-host.com'; // Replace with your correct SMTP host $mail->SMTPAuth = true; $mail->Username = 'your_email@your-domain.com'; // Replace with your correct email address $mail->Password = 'your_password'; // Replace with your correct password $mail->SMTPSecure = 'tls'; // Enter the correct value $mail->Port = 587; // Enter the correct value $mail->setFrom('your_email@your-domain.com', 'Your Name'); $mail->addAddress('your_email@your-domain.com', 'Your Name'); $mail->isHTML(true); $mail->Subject = gethostname() . ' Exception'; $mail->Body = 'Error Occured: ' . $e->getMessage(); $mail->send(); echo 'Message has been sent'; } /** * Crawler class */ class Ehi_Crawler { protected $_sitemaps = null; protected $_urls = null; /** * Constructor * * @param array|string $sitemaps A string with an URL to a XML sitemap, or an array with URLs to XML sitemaps. Sitemap index files works well too. * */ function __construct( $sitemaps = null ) { $this->_sitemaps = []; $this->_urls = []; if ( ! is_null( $sitemaps ) ) { if ( ! is_array( $sitemaps ) ) { $sitemaps = array( $sitemaps ); } foreach ( $sitemaps as $sitemap ) { $this->add_sitemap( $sitemap ); } } } /** * Add a sitemap URL to our crawl stack. Sitemap index files works too. * * @param string $sitemapurl URL to a XML sitemap or sitemap index */ public function add_sitemap( $sitemapurl ) { if ( in_array( $sitemapurl, $this->_sitemaps ) ) { return; } $this->_sitemaps[] = $sitemapurl; $ua = random_user_agent(); $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $ua); // To prevent grabbing a cached sitemap page, I append the UNIX timestamp as a parameter to the sitemap URL // Since this value changes every second, the retrieved sitemap will always be fresh (not cached) // If you don't want this, tweak it accordingly $sitemapurl_nocache = $sitemapurl . '?current_time=' . time(); curl_setopt( $ch, CURLOPT_URL, $sitemapurl_nocache ); curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true ); $content = curl_exec( $ch ); $http_return_code = curl_getinfo( $ch, CURLINFO_HTTP_CODE ); if ( '200' != $http_return_code ) { $mail = new PHPMailer(true); $mail->SMTPDebug = 2; $mail->isSMTP(); $mail->Host = 'smtp.your-host.com'; // Replace with your correct SMTP host $mail->SMTPAuth = true; $mail->Username = 'your_email@your-domain.com'; // Replace with your correct email address $mail->Password = 'your_password'; // Replace with your correct password $mail->SMTPSecure = 'tls'; // Enter the correct value $mail->Port = 587; // Enter the correct value $mail->setFrom('your_email@your-domain.com', 'Your Name'); $mail->addAddress('your_email@your-domain.com', 'Your Name'); $mail->isHTML(true); $mail->Subject = gethostname() . ' Invalid Return Code'; $mail->Body = 'Invalid Return Code: ' . $http_return_code; $mail->send(); echo 'Message has been sent'; return false; } $xml = new SimpleXMLElement( $content, LIBXML_NOBLANKS ); if ( ! $xml ) { return false; } switch ( $xml->getName() ) { case 'sitemapindex': foreach ( $xml->sitemap as $sitemap ) { $this->add_sitemap( reset( $sitemap->loc ) ); } break; case 'urlset': foreach ( $xml->url as $url ) { $this->add_url( reset( $url->loc ) ); } break; default: break; } } /** * Add a URL to our crawl stack * * @param string $url URL to check */ public function add_url( $url ) { if ( ! in_array( $url, $this->_urls ) ) { $this->_urls[] = $url; } } /** * Run the crawl */ public function run() { // All sitemap urls are inside the $this->_urls array object // Split our URLs into chunks of 5 URLs to use with curl multi $chunks = array_chunk($this->_urls, 5); foreach ($chunks as $chunk) { $mh = curl_multi_init(); foreach ($chunk as $url) { $ua = random_user_agent(); $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $ua); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_multi_add_handle($mh, $ch); } $active = null; do { $mrc = curl_multi_exec($mh, $active); } while (CURLM_CALL_MULTI_PERFORM == $mrc); while ($active && CURLM_OK == $mrc) { if (curl_multi_select($mh) != -1) { do { $mrc = curl_multi_exec($mh, $active); $info = curl_multi_info_read($mh); if ($info['result'] != 0) { $mail = new PHPMailer(true); $mail->SMTPDebug = 2; $mail->isSMTP(); $mail->Host = 'smtp.your-host.com'; // Replace with your correct SMTP host $mail->SMTPAuth = true; $mail->Username = 'your_email@your-domain.com'; // Replace with your correct email address $mail->Password = 'your_password'; // Replace with your correct password $mail->SMTPSecure = 'tls'; // Enter the correct value $mail->Port = 587; // Enter the correct value $mail->setFrom('your_email@your-domain.com', 'Your Name'); $mail->addAddress('your_email@your-domain.com', 'Your Name'); $mail->isHTML(true); $mail->Subject = gethostname() . ' cURL multi failed.'; $mail->Body = 'cURL multi failed somewhere.'; $mail->send(); echo 'cURL multi failed somewhere'; return; } } while (CURLM_CALL_MULTI_PERFORM == $mrc); } } } } }
Leave a Reply