[SPIP] 429 error message with crawlers like Google

One of our clients uses a SPIP website and began to lose pages from Google’s index. The Google Search Console reported 500 errors on many of the pages, but the pages were visible to users. When investigating the logs or simulating a Googlebot browser-agent, I found that SPIP was sending HTTP error 429 (too many requests) to Googlebot and other search engine crawlers. This was because SPIP uses the server load average function to determine if it should allow bots’ crawling, however shared hosting providers have usually ultra performant CPUs with a high load value.

spip_directory/config/ecran_securite.php

if (	defined('_ECRAN_SECURITE_LOAD')	and _ECRAN_SECURITE_LOAD > 0	and _IS_BOT	and !_IS_BOT_FRIEND	and $_SERVER['REQUEST_METHOD'] === 'GET'	and (		(function_exists('sys_getloadavg')			and $load = sys_getloadavg()			and is_array($load)			and $load = array_shift($load))		or		(@is_readable('/proc/loadavg')			and $load = file_get_contents('/proc/loadavg')			and $load = floatval($load))	)	and $load > _ECRAN_SECURITE_LOAD // eviter l'evaluation suivante si de toute facon le load est inferieur a la limite	and rand(0, (int) ($load * $load)) > _ECRAN_SECURITE_LOAD * _ECRAN_SECURITE_LOAD) {	//https://webmasters.stackexchange.com/questions/65674/should-i-return-a-429-or-503-status-code-to-a-bot	header("HTTP/1.0 429 Too Many Requests");	header("Retry-After: 300");	header("Expires: Wed, 11 Jan 1984 05:00:00 GMT");	header("Cache-Control: no-cache, must-revalidate");	header("Pragma: no-cache");	header("Content-Type: text/html");	header("Connection: close");	die("<html><title>Status 429: Too Many Requests</title><body><h1>Status 429</h1><p>Too Many Requests (try again soon)</p></body></html>");}

The default threshold for SPIP is 4, which is low for shared hostings:

if (!defined('_ECRAN_SECURITE_LOAD')) {define('_ECRAN_SECURITE_LOAD', 4);}

To go around this issue, many options are possible, you can either modify the previous file or rather create a new config file:

config/ecran_securite_options.php

In which you can either:
1- Increase or disable _ECRAN_SECURITE_LOAD

<?phpdefine('_ECRAN_SECURITE_LOAD', 4)

2- Or just consider specific user agents like Google’s as friend bots (e.g.: Googlebot, Googlebot-News, Googlebot-Image Googlebot-Video, Storebot-Google, GoogleOther, Google-Extended, etc):

<?phpif (preg_match('/google/i', (string)$_SERVER['HTTP_USER_AGENT'])) {    define('_IS_BOT_FRIEND', true);}

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *