• Skip to main content
  • Skip to primary sidebar

Technical Notes Of
Ehi Kioya

Technical Notes Of Ehi Kioya

  • Forums
  • About
  • Contact
MENUMENU
  • Blog Home
  • AWS, Azure, Cloud
  • Backend (Server-Side)
  • Frontend (Client-Side)
  • SharePoint
  • Tools & Resources
    • CM/IN Ruler
    • URL Decoder
    • Text Hasher
    • Word Count
    • IP Lookup
  • Linux & Servers
  • Zero Code Tech
  • WordPress
  • Musings
  • More
    Categories
    • Cloud
    • Server-Side
    • Front-End
    • SharePoint
    • Tools
    • Linux
    • Zero Code
    • WordPress
    • Musings
Home » Backend (Server-Side) » PHP Sitemap Crawler For Full Page Cache Warming

PHP Sitemap Crawler For Full Page Cache Warming

November 14, 2019 by Ehi Kioya Leave a Comment

Most webmasters know that caching is important for website speed. And for sites running on WordPress, there are lots of plugins dedicated to page caching. But not a lot is said about cache warming (or cache preloading) which is just as important.

The PHP code shared below will warm up your website cache by crawling your entire sitemap using cURL. The code can be run from a Linux Cron job.

I tested this with Yoast sitemaps. But I’m pretty sure it will work nicely with sitemaps generated by any other plugin as long as the generated sitemap follows standard conventions.

I have added lots of comments to the code. So make sure to check the comments when working with this. You will also need to enter correct values for stuff like file paths, email address, domain, etc etc.

This sitemap crawler works beautifully with Cloudflare. I have done some extensive testing with great results. If you use the Cloudflare “Cache Everything” page rule together with “Edge Cache TTL”, you can use this code to populate Cloudflare edge caches (POPs) by running it from different geographic locations worldwide.

The subject of how to run this code from different global geographic locations will be discussed in a future blog post.

Here are some other things you may want to note:

  1. The code uses PHPMailer to send you email notifications if an error occurs. So don’t forget to enter the correct local path to the PHPMailer source code. If you don’t want the email notification features, you can remove the appropriate sections.
  2. This code also uses my random user agent generator. You can alternatively hardcode your own user agent. But to avoid the risk of getting blocked by firewalls, I recommend using my random user agent generator.
  3. I designed this to be run from a Linux cron job. All my tests were from cron jobs as well. If you plan to run this without a cron job, be wary of potential PHP timeout issues you may encounter especially when crawling large websites. When working with PHP CLI, timeouts are not really a problem.
  4. If your cron jobs are scheduled close to each other, the code uses the flock() construct to prevent starting multiple instances of the same process. So if a sitemap crawl is already running, another crawl will not start until the first is completed.

Ok, enough preamble. Here’s the PHP sitemap crawler code:

<?php

// If you don't need email notifications of crawler errors...
// ...just comment out the PHPMailer references and usage within the code
use PHPMailer\PHPMailer\PHPMailer;
use PHPMailer\PHPMailer\Exception;
require '/your/full/path/phpmailer/src/Exception.php';
require '/your/full/path/phpmailer/src/PHPMailer.php';
require '/your/full/path/phpmailer/src/SMTP.php';

// You can optionally hardcode a single user agent instead of including this file
include("/your/full/path/warm-cache-folder/random-user-agent.php");

try {
	 // If you're running this from a cron job, this flock() structure will prevent starting multiple instances of the same process
	$fp = fopen('/your/full/path/warm-cache-folder/lock.txt', 'w');
	if(!flock($fp, LOCK_EX | LOCK_NB)) {
		$mail = new PHPMailer(true);
		$mail->SMTPDebug = 2;
		$mail->isSMTP();
		$mail->Host = 'smtp.your-host.com'; // Replace with your correct SMTP host
		$mail->SMTPAuth = true;
		$mail->Username = 'your_email@your-domain.com'; // Replace with your correct email address
		$mail->Password = 'your_password';  // Replace with your correct password
		$mail->SMTPSecure = 'tls'; // Enter the correct value
		$mail->Port = 587; // Enter the correct value
		$mail->setFrom('your_email@your-domain.com', 'Your Name');
		$mail->addAddress('your_email@your-domain.com', 'Your Name');
		$mail->isHTML(true);
		$mail->Subject = gethostname() . ' Unable to obtain lock';
		$mail->Body    = 'Unable to obtain lock';
		$mail->send();
		echo 'Unable to obtain lock';
		exit(-1);
	}

	// My code here...
	$timezone = new DateTimeZone('America/Toronto'); // Replace with your own TimeZone (optional)
	$date_start = new DateTime();
	$date_start->setTimezone($timezone);
	file_put_contents('/your/full/path/warm-cache-folder/run-history.txt', "Cache warming started at:\t\t\t\t" . $date_start->format('M j, Y - g:i:s a') . PHP_EOL, FILE_APPEND);
	
	$sitemaps = array(
		'https://your-domain.com/sitemap_index.xml', // Enter your sitemap url
	);
	$crawler = new Ehi_Crawler( $sitemaps );
	$crawler->run();
	
	$date_stop = new DateTime();
	$date_stop->setTimezone($timezone);
	$diff = $date_stop->getTimestamp() - $date_start->getTimestamp();
	file_put_contents('/your/full/path/warm-cache-folder/run-history.txt',
						"Cache warming finished/terminated at:\t" . $date_stop->format('M j, Y - g:i:s a') . PHP_EOL . "Total time taken:\t\t\t\t\t\t" . $diff . ' seconds' . PHP_EOL . PHP_EOL,
						FILE_APPEND);
	
	// Continue flock() code
	sleep(5);
	fclose($fp);
}
catch (\Exception $e) {
	echo 'Caught exception: ',  $e->getMessage(), "\n";
	$mail = new PHPMailer(true);
	$mail->SMTPDebug = 2;
	$mail->isSMTP();
	$mail->Host = 'smtp.your-host.com'; // Replace with your correct SMTP host
	$mail->SMTPAuth = true;
	$mail->Username = 'your_email@your-domain.com'; // Replace with your correct email address
	$mail->Password = 'your_password';  // Replace with your correct password
	$mail->SMTPSecure = 'tls'; // Enter the correct value
	$mail->Port = 587; // Enter the correct value
	$mail->setFrom('your_email@your-domain.com', 'Your Name');
	$mail->addAddress('your_email@your-domain.com', 'Your Name');
	$mail->isHTML(true);
	$mail->Subject = gethostname() . ' Exception';
	$mail->Body    = 'Error Occured: ' . $e->getMessage();
	$mail->send();
	echo 'Message has been sent';
}

/**
 * Crawler class
 */
class Ehi_Crawler {
	protected $_sitemaps = null;
	protected $_urls = null;
	
	/**
	 * Constructor
	 *
	 * @param array|string $sitemaps A string with an URL to a XML sitemap, or an array with URLs to XML sitemaps. Sitemap index files works well too.
	 *
	 */
	function __construct( $sitemaps = null ) {
		$this->_sitemaps = [];
		$this->_urls = [];
		if ( ! is_null( $sitemaps ) ) {
			if ( ! is_array( $sitemaps ) ) {
				$sitemaps = array( $sitemaps );
			}
			foreach ( $sitemaps as $sitemap ) {
				$this->add_sitemap( $sitemap );
			}
		}
	}
	
	/**
	 * Add a sitemap URL to our crawl stack. Sitemap index files works too.
	 *
	 * @param string $sitemapurl URL to a XML sitemap or sitemap index
	 */
	public function add_sitemap( $sitemapurl ) {
		if ( in_array( $sitemapurl, $this->_sitemaps ) ) {
			return;
		}
		$this->_sitemaps[] = $sitemapurl;
		$ua = random_user_agent();
		$ch = curl_init();
		curl_setopt($ch, CURLOPT_USERAGENT, $ua);
		
		// To prevent grabbing a cached sitemap page, I append the UNIX timestamp as a parameter to the sitemap URL
		// Since this value changes every second, the retrieved sitemap will always be fresh (not cached)
		// If you don't want this, tweak it accordingly
		$sitemapurl_nocache = $sitemapurl . '?current_time=' . time();
		
		curl_setopt( $ch, CURLOPT_URL, $sitemapurl_nocache );
		curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
		$content = curl_exec( $ch );
		$http_return_code = curl_getinfo( $ch, CURLINFO_HTTP_CODE );
		if ( '200' != $http_return_code ) {
			$mail = new PHPMailer(true);
			$mail->SMTPDebug = 2;
			$mail->isSMTP();
			$mail->Host = 'smtp.your-host.com'; // Replace with your correct SMTP host
			$mail->SMTPAuth = true;
			$mail->Username = 'your_email@your-domain.com'; // Replace with your correct email address
			$mail->Password = 'your_password';  // Replace with your correct password
			$mail->SMTPSecure = 'tls'; // Enter the correct value
			$mail->Port = 587; // Enter the correct value
			$mail->setFrom('your_email@your-domain.com', 'Your Name');
			$mail->addAddress('your_email@your-domain.com', 'Your Name');
			$mail->isHTML(true);
			$mail->Subject = gethostname() . ' Invalid Return Code';
			$mail->Body    = 'Invalid Return Code: ' . $http_return_code;
			$mail->send();
			echo 'Message has been sent';
			return false;
		}
		$xml = new SimpleXMLElement( $content, LIBXML_NOBLANKS );
		if ( ! $xml ) {
			return false;
		}
		switch ( $xml->getName() ) {
			case 'sitemapindex':
				foreach ( $xml->sitemap as $sitemap ) {
					$this->add_sitemap( reset( $sitemap->loc ) );
				}
				break;
			case 'urlset':
				foreach ( $xml->url as $url ) {
					$this->add_url( reset( $url->loc ) );
				}
				break;
			
			default:
				break;
		}
	}
	
	/**
	 * Add a URL to our crawl stack
	 *
	 * @param string $url URL to check
	 */
	public function add_url( $url ) {
		if ( ! in_array( $url, $this->_urls ) ) {
			$this->_urls[] = $url;
		}
	}
	
	/**
	 * Run the crawl
	 */
	public function run() {
		// All sitemap urls are inside the $this->_urls array object
		
		// Split our URLs into chunks of 5 URLs to use with curl multi
		$chunks =  array_chunk($this->_urls, 5);
		foreach ($chunks as $chunk) {
			$mh = curl_multi_init();
			foreach ($chunk as $url) {
				$ua = random_user_agent();
				$ch = curl_init();
				curl_setopt($ch, CURLOPT_USERAGENT, $ua);
				curl_setopt($ch, CURLOPT_URL, $url);
				curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
				curl_multi_add_handle($mh, $ch);
			}
			$active = null;
			do {
				$mrc = curl_multi_exec($mh, $active);
			}
			while (CURLM_CALL_MULTI_PERFORM == $mrc);
			while ($active && CURLM_OK == $mrc) {
				if (curl_multi_select($mh) != -1) {
					do {
						$mrc = curl_multi_exec($mh, $active);
						$info = curl_multi_info_read($mh);
						if ($info['result'] != 0) {
							$mail = new PHPMailer(true);
							$mail->SMTPDebug = 2;
							$mail->isSMTP();
							$mail->Host = 'smtp.your-host.com'; // Replace with your correct SMTP host
							$mail->SMTPAuth = true;
							$mail->Username = 'your_email@your-domain.com'; // Replace with your correct email address
							$mail->Password = 'your_password';  // Replace with your correct password
							$mail->SMTPSecure = 'tls'; // Enter the correct value
							$mail->Port = 587; // Enter the correct value
							$mail->setFrom('your_email@your-domain.com', 'Your Name');
							$mail->addAddress('your_email@your-domain.com', 'Your Name');
							$mail->isHTML(true);
							$mail->Subject = gethostname() . ' cURL multi failed.';
							$mail->Body    = 'cURL multi failed somewhere.';
							$mail->send();
							echo 'cURL multi failed somewhere';
							return;
						}
					}
					while (CURLM_CALL_MULTI_PERFORM == $mrc);
				}
			}
		}
	}
}

Found this article valuable? Want to show your appreciation? Here are some options:

  1. Spread the word! Use these buttons to share this link on your favorite social media sites.
  2. Help me share this on . . .

    • Facebook
    • Twitter
    • LinkedIn
    • Reddit
    • Tumblr
    • Pinterest
    • Pocket
    • Telegram
    • WhatsApp
    • Skype
  3. Sign up to join my audience and receive email notifications when I publish new content.
  4. Contribute by adding a comment using the comments section below.
  5. Follow me on Twitter, LinkedIn, and Facebook.

Related

Filed Under: Backend (Server-Side), PHP, Programming, Web Development, WordPress Tagged With: Bot, Crawler, PHP, Sitemap

About Ehi Kioya

I am a Toronto-based Software Engineer. I run this website as part hobby and part business.

To share your thoughts or get help with any of my posts, please drop a comment at the appropriate link.

You can contact me using the form on this page. I'm also on Twitter, LinkedIn, and Facebook.

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

26,207
Followers
Follow
30,000
Connections
Connect
14,641
Page Fans
Like

POPULAR   FORUM   TOPICS

  • How to find the title of a song without knowing the lyrics
  • Welcome Message
  • How To Change Or Remove The WordPress Login Error Message
  • The Art of Exploratory Data Analysis (Part 1)
  • Getting Started with SQL: A Beginners Guide to Databases
  • Replacing The Default SQLite Database With PostgreSQL In Django
  • Understanding Routes In Laravel
  • Building A Blog With Laravel – Part 6: Creating A Form For New Posts
  • Getting Started With JupyterLab
  • How To Become A Self-taught Programmer
  • Recently   Popular   Posts   &   Pages
  • Actual Size Online Ruler Actual Size Online Ruler
    I created this page to measure your screen resolution and produce an online ruler of actual size. It's powered with JavaScript and HTML5.
  • Allowing Multiple RDP Sessions In Windows 10 Using The RDP Wrapper Library Allowing Multiple RDP Sessions In Windows 10 Using The RDP Wrapper Library
    This article explains how to bypass the single user remote desktop connection restriction on Windows 10 by using the RDP wrapper library.
  • WordPress Password Hash Generator WordPress Password Hash Generator
    With this WordPress Password Hash Generator, you can convert a password to its hash, and then set a new password directly in the database.
  • Forums
  • About
  • Contact

© 2021   ·   Ehi Kioya   ·   All Rights Reserved
Privacy Policy