/magento-ce-ee-config-corruption-bug

Explanation and fix for the Magento Enterprise '100 router match iterations' / Community 'no 404 CMS page configured' bug

Primary LanguagePHP

Magento Bug - Corrupted Config Cache

http://ampersandcommerce.com

  1. Preface
  2. Mage_Core_Model_Config and Caching
  3. Symptoms of a Corrupted Config Cache
  4. Debugging the Issue
  5. The Problem & Explanation
  6. Replication
  7. PHPUnit Tests
  8. The Fix

If this document helps you out then please give it a star, thanks!

Update 3: Baked into the core

As of Magento 1.9.3.1 the fix discovered in this write up has been included in the core.

If you still suffering configuration cache corruption you should

  1. Upgrade to 1.9.3.1.
  2. Apply the available patches to your older version of Magento
    1. Apply the fix from this write up (SUPEE-4755).
    2. If you are running php-fpm read the following article and ensure the patch is applied.
  3. If you are still seeing frequent issues try the core modification described here.

Update 2: Further improvements

Since SUPEE-6788 I've been contacted a lot by people who have tried the the Fix described below with mixed results:

  • I applied your patch and it fixed everything thanks!
  • I applied your patch and it didn't work, do you have any other recommendations?
  • I applied your patch, dropped core_config_data, dropped core_url_rewrite, and now my website isn't working. Here is root SSH access please fix.

Anyway, I do have a few further improvements to Mage_Core_Model_Config that I've been using in production for a while now, they seem to be helping. Click through to read more.

  1. Config Cache Corruption Fix - 100 Router Match Iterations - Unreadable local.xml
  2. Config Generation Performance Fix - allowUseCacheForInit

Update: Good news! A patch from Magento

PATCH_SUPEE-4755_EE_1.13.1.0_v1.sh

This patch is the exact same as my fix which gives it some validity.

This patch isn't publicly listed anywhere because Magento don't do that for anything but security patches for some reason. I know the patch file says EE_1.13.1.0, but I tested it on community edition 1.9 and it applied fine.

... I guess that makes me a Magento 1 core contributor now?

Preface

The majority of my experimentation took place on EE 1.13.0.1 and while this bug does also affect CE 1.9.1.0 I will be referring to EE 1.13.0.1 code throughout the explanation.

Mage_Core_Model_Config and Caching

Mage_Core_Model_Config is the class responsible for merging all the system configuration files (config.xml, local.xml) and module configuration files (config.xml) into one object. There are 3 basic parts to the Mage_Core_Model_Config::init() call

  1. Loading the base config (config.xml and local.xml files).
  2. Attempting to load the rest of the config from cache.
  3. (On cache load failure) Load the remainder of the config from xml and the database, then save it into cache.
/**
 * Initialization of core configuration
 *
 * @return Mage_Core_Model_Config
 */
public function init($options=array())
{
    $this->setCacheChecksum(null);
    $this->_cacheLoadedSections = array();
    $this->setOptions($options);
    $this->loadBase();

    $cacheLoad = $this->loadModulesCache();
    if ($cacheLoad) {
        return $this;
    }
    $this->loadModules();
    $this->loadDb();
    $this->saveCache();
    return $this;
}

If at any point something were to silently go wrong within loadModules or loadDb, then corrupted configuration would be saved into cache, meaning that the following request would be served invalid configuration.

Mage_Core_Model_Config also has a protected variable $_useCache. When this flag is set Magento will attempt to load sections of the config from cache storage then persist them within the singleton itself.

Symptoms of a Corrupted Config Cache

The following symptoms would usually manifest when the website is experiencing high load, and very often after a cache flush was triggered. The symptoms persist until you flush the CONFIG cache.

Enterprise Edition

Your website produces nothing but Front controller reached 100 router match iterations reports. (Tested on Magento 1.12 and 1.13)

Community Edition

Your homepage fails to load the CSS, and a message is displayed saying There was no 404 CMS page configured or found.

I have not spent much time debugging the effects on the community edition, there are likely other symptoms. (Tested on Magento 1.9.1.0)

Debugging the Issue

Many sources correctly point out that the problem is caused by some of the routers disappearing from the configuration object, meaning there is no router available to match the request.

This error only occurs when the routers configuration was loaded from cache. To stop the bug from bringing down the website and to aid my debugging I overrode Mage_Core_Model_Cache::save() such that it would do some quick data validation, and prevent corrupted data being saved.

/**
 * Save data
 *
 * @param string $data
 * @param string $id
 * @param array $tags
 * @param int $lifeTime
 * @return bool
 */
public function save($data, $id, $tags = array(), $lifeTime = null)
{
    //Start patch for 100 routers problem
    if (strpos($id,'config_global_stores_') !== false) {
        $xml = new SimpleXMLElement($data);
        $xmlPath = $xml->xpath('web/routers/standard');
        if (count($xmlPath) != 1) {
            $e = new Exception();
            Mage::log($e->getTraceAsString(), null, 'corrupt-cache.log', true);
            //Returning true to prevent it from saving an incomplete cache entry
            return true;
        }
    }
    //End patch

    if ($this->_disallowSave) {
        return true;
    }

    /**
     * Add global magento cache tag to all cached data exclude config cache
     */
    if (!in_array(Mage_Core_Model_Config::CACHE_TAG, $tags)) {
        $tags[] = Mage_Core_Model_App::CACHE_TAG;
    }
    return $this->getFrontend()->save((string)$data, $this->_id($id), $this->_tags($tags), $lifeTime);
}

This code change did not 'solve' the issue, but it did stop corrupted configuration being saved and prevented the website crashing so much.

It was also useful as a point of debugging, as I now had a place from which I could monitor and log the issue. The stack trace logged here was vital in learning what was causing the issue.

The Problem

By using apache bench to stress my Magento instance along with a lot of file_put_contents debugging I was able to discover that the invalid configuration was generated in the loadDb method of Mage_Core_Model_Config, but only under the following conditions

  1. Mage_Core_Model_Config::init() has been called on the singleton twice.
  2. The first call to loadModulesCache must successfully load from cache and set $_useCache = true
  3. The second call must fail to retrieve the config from cache, and proceed to incorrectly rebuild the cache because $_useCache = true is still set.

Explanation

To understand how we get this flow we'll have to revisit Mage_Core_Model_Config::init() and a few other functions.

/**
 * Initialization of core configuration
 *
 * @return Mage_Core_Model_Config
 */
public function init($options=array())
{
    $this->setCacheChecksum(null);
    $this->_cacheLoadedSections = array();
    $this->setOptions($options);
    $this->loadBase();

    $cacheLoad = $this->loadModulesCache();
    if ($cacheLoad) {
        return $this;
    }
    $this->loadModules();
    $this->loadDb();
    $this->saveCache();
    return $this;
}

We're interested in $cacheLoad = $this->loadMoudulesCache(), the first call successfully retrieved something that resolved to true while the second call received something that resolved to false.

Digging deeper into the code.

/**
 * Load cached modules configuration
 *
 * @return bool
 */
public function loadModulesCache()
{
    if (Mage::isInstalled(array('etc_dir' => $this->getOptions()->getEtcDir()))) {
        if ($this->_canUseCacheForInit()) {
            Varien_Profiler::start('mage::app::init::config::load_cache');
            $loaded = $this->loadCache();
            Varien_Profiler::stop('mage::app::init::config::load_cache');
            if ($loaded) {
                $this->_useCache = true;
                return true;
            }
        }
    }
    return false;
}

loadModulesCache attempts to load the configuration from cache, if it is loaded it sets $_useCache = true and returns true so that we do not continue to regenerate the cache in the init method. The main points for this call failing would be that loadCache or _canUseCacheForInit returns false.

Again, digging deeper into the code.

/**
 * Check if cache can be used for config initialization
 *
 * @return bool
 */
protected function _canUseCacheForInit()
{
    return Mage::app()->useCache('config') && $this->_allowCacheForInit
        && !$this->_loadCache($this->_getCacheLockId());
}

_canUseCacheForInit ensures the cache is enabled and that it is not locked. For some reason Magento actually uses the cache to lock itself $this->_loadCache($this->_getCacheLockId()).

The problem in our case was that on the second run of Mage_Core_Model_Config::init() we were failing the _canUseCacheForInit call because the cache was locked. This meant that we would proceed to regenerate and save the CONFIG cache while the singletons $_useCache was erroneously still set to true.

Cache Lock Generation

As far as I am aware the cache is locked only within the saveCache call of the Mage_Core_Model_Config singleton. The cache is locked after the configuration has been generated and before the calls to _saveCache which will save the config for config_global, config_websites and config_stores_{stores}. The cache lock is removed when all the configuration has been saved in cache.

/**
 * Save configuration cache
 *
 * @param   array $tags cache tags
 * @return  Mage_Core_Model_Config
 */
public function saveCache($tags=array())
{
    if (!Mage::app()->useCache('config')) {
        return $this;
    }
    if (!in_array(self::CACHE_TAG, $tags)) {
        $tags[] = self::CACHE_TAG;
    }
    $cacheLockId = $this->_getCacheLockId();
    if ($this->_loadCache($cacheLockId)) {
        return $this;
    }

    if (!empty($this->_cacheSections)) {
        $xml = clone $this->_xml;
        foreach ($this->_cacheSections as $sectionName => $level) {
            $this->_saveSectionCache($this->getCacheId(), $sectionName, $xml, $level, $tags);
            unset($xml->$sectionName);
        }
        $this->_cachePartsForSave[$this->getCacheId()] = $xml->asNiceXml('', false);
    } else {
        return parent::saveCache($tags);
    }

    $this->_saveCache(time(), $cacheLockId, array(), 60);
    $this->removeCache();
    foreach ($this->_cachePartsForSave as $cacheId => $cacheData) {
        $this->_saveCache($cacheData, $cacheId, $tags, $this->getCacheLifetime());
    }
    unset($this->_cachePartsForSave);
    $this->_removeCache($cacheLockId);
    return $this;
}

Step-By-Step

What was happening was likely due to the shared cache storage between multiple servers, but could also have been caused by multiple processes running on one server.

Here's a step-by-step of what was happening in our instance, we had a cronjob which was calling Mage::app()->init() multiple times which accounted for the repeated initialisation of Mage_Core_Model_Config.

Time Process 1 Process 2 Shared Cache Lock
1 init() - loadModulesCache succeeds and sets $_useCache = true. -
2 Some code is executed init() - loadModulesCache fails because someone has hit Flush Cache in the admin panel
3 Some code is executed init() - loadModules, loadDb work as expected.
4 Some code calls Mage::app()->init() init() - saveCache is initiated
5 Mage::app()->init() calls Mage_Core_Model_Config::init() saveCache() - sets the cache lock LOCKED
6 init() - loadModulesCache fails as cache lock is present saveCache() - saves config to cache LOCKED
7 init() - calls loadModules and loadDb with $_useCache = true, generated invalid config saveCache() - removes the cache lock, init completes.
7 init() - calls saveCache with incorrectly generated data, causing the errors. Some code is executed

Replication

If you have a look at 100-router-script.php you can see a simple script which should allow you to reproduce the bug on a Magento instance. Simply download it to the root of your Magento instance and run it. Alternatively you can specify the location of Mage.php using an environment variable.

MAGE_PATH="/path/to/magento/app/Mage.php" php 100-router-script.php

I was unable to easily reproduce the time sensitive cache hit on global_config.lock, however I was able to emulate it by making loadModulesCache fail to load config_global on the second call.

PHPUnit Tests

The phpunit tests simulate the time sensitive cache hit on global_config.lock in two different ways:

  • By using Convenient_Core_Model_Config, a custom configuration model which returns true for the second call to loadCache for the config_global.lock entry.
  • By removing config_global from the cache as in the 100-router-script.php file.

These tests enable the CONFIG cache.

Run the unit tests by specifying an environment variable pointing your Magento installation.

git clone https://github.com/AmpersandHQ/magento-ce-ee-config-corruption-bug.git
composer install
MAGE_PATH="/path/to/magento/app/Mage.php" vendor/bin/phpunit

On an unpatched instance you should see the following output

PHPUnit 4.6.4 by Sebastian Bergmann and contributors.

Configuration read from /lukerodgers.co.uk/magento-ce-ee-config-corruption-bug/phpunit.xml

FF

Time: 2.53 seconds, Memory: 16.25Mb

There were 2 failures:

1) ConfigurationTest::reinitWithAlternativeConfigModel
Failed asserting that two strings are equal.

/lukerodgers.co.uk/magento-ce-ee-config-corruption-bug/tests/ConfigurationTest.php:78

2) ConfigurationTest::reinitMissingCacheEntry
Failed asserting that two strings are equal.

/lukerodgers.co.uk/magento-ce-ee-config-corruption-bug/tests/ConfigurationTest.php:121

On a patched instance you should see:

PHPUnit 4.6.4 by Sebastian Bergmann and contributors.

Configuration read from /lukerodgers.co.uk/magento-ce-ee-config-corruption-bug/phpunit.xml

..

Time: 2.87 seconds, Memory: 16.50Mb

OK (2 tests, 2 assertions)

The Fix

2 weeks of work and all this for a 1 line fix.

I overrode Mage_Core_Model_Config and forced $_useCache = false when regenerating the config. This fix is validated by the replication script/phpunit tests above, and completely stopped this bug from occurring in our instances.

/**
 * Initialization of core configuration
 *
 * @return Mage_Core_Model_Config
 */
public function init($options=array())
{
    $this->setCacheChecksum(null);
    $this->_cacheLoadedSections = array();
    $this->setOptions($options);
    $this->loadBase();

    $cacheLoad = $this->loadModulesCache();
    if ($cacheLoad) {
        return $this;
    }
    //100 Router Fix Start
    $this->_useCache = false;
    //100 Router Fix End
    $this->loadModules();
    $this->loadDb();
    $this->saveCache();
    return $this;
}

It would be naive of me to believe that this will completely solve the issue for everyones Magento setup. If it does not work for you then I recommend you

  1. Implement the logic mentioned in the Debugging the Issue section.
  2. Wait for some log data to appear.
  3. Contact me!

Performance

I do not believe that this fix will affect performance in any negative way as the usual flow for loadModules, loadDb and saveCache is for $_useCache to be false.