pbauer

Obstacles on the road towards Plone 2020

A "short" story about debugging issues with test isolation

During the sprint at the Plone Conference 2017 in Barcelona, Plone achieved a major milestone towards what is often called "Plone 2020". This is basically the effort to modernize Plone's backend and achieve Python 3 compatibility. In 2020, support for Python 2.7 will officially end, hence Plone 2020.

A necessary part of that effort was to migrate Zope to Python 3, a daunting task that was only possible by a flurry of activity that combined the efforts of many stakeholders (not only the Plone Community). Learn more about that in Hanno Schlichting's talk once the video is on the website, and on many blog posts on the Gocept Blog.

Getting Plone to run on that newest version of Zope (currently Zope 4.0b2) was another story and took a lot of work (some details are in my post here. Finally in Barcelona, in a daring move we merged all the work that had been done for that PLIP https://github.com/plone/Products.CMFPlone/issues/1351 and decided that the result will be called Plone 5.2. But by that time not all tests were green (that's why it was daring). We worked hard to get the tests to pass and to fix some issues we found when testing manually.

By the way: At the same sprint we started to prepare Plone itself for Python 3 by fixing all imports to work in both Python 2 and Python 3. But that is a tale for another blog post.

So, despite out best efforts, even one week after the conference I was not yet able to fix all the tests, and so I created at ticket to track the remaining issues.

Here this story about two erroring tests in Products.CMFFormController actually begins. Here is the spoiler: I did not really solve the issue but finally worked around it. But I still think the approach I took might be of interest to some.

The two breaking tests, test_attacker_redirect and test_regression, were passing when I ran them in isolation or when I ran all test of Products.CMFFormController with ./bin/test -s Products.CMFFormController. To add insult to injury, Products.CMFFormController is basically dead code but is still used by some of our legacy ControllerPageTemplates.

So how could I find the issue since the traceback was not really helpful?

Here is the relevant part of the log from jenkins:

#### Running tests for group Archetypes ####
Running Products.Archetypes.tests.attestcase.Archetypes:Functional tests:

[...]

Running plone.app.testing.bbb.PloneTestCase:Functional tests:
  Tear down Testing.ZopeTestCase.layer.ZopeLite in 0.000 seconds.
  Set up plone.testing.zca.LayerCleanup in 0.000 seconds.
  Set up plone.testing.z2.Startup in 0.101 seconds.
  Set up plone.app.testing.layers.PloneFixture in 9.722 seconds.
  Set up plone.app.testing.bbb.PloneTestCaseFixture in 2.628 seconds.
  Set up plone.app.testing.bbb.PloneTestCase:Functional in 0.000 seconds.


Error in test test_attacker_redirect (Products.CMFFormController.tests.testRedirectTo.TestRedirectToFunctional)
Traceback (most recent call last):
  File "/usr/lib/python2.7/unittest/case.py", line 329, in run
    testMethod()
  File "/home/jenkins/workspace/plone-5.2-python-2.7-at/src/Products.CMFFormController/Products/CMFFormController/tests/testRedirectTo.py", line 97, in test_attacker_redirect
    handle_errors=False,
  File "/home/jenkins/workspace/plone-5.2-python-2.7-at/src/Zope/src/Testing/ZopeTestCase/functional.py", line 43, in wrapped_func
    return func(*args, **kw)
  File "/home/jenkins/workspace/plone-5.2-python-2.7-at/src/Zope/src/Testing/ZopeTestCase/functional.py", line 127, in publish
    wsgi_result = publish(env, start_response)
  File "/home/jenkins/workspace/plone-5.2-python-2.7-at/src/Zope/src/ZPublisher/WSGIPublisher.py", line 254, in publish_module
    with load_app(module_info) as new_mod_info:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/jenkins/workspace/plone-5.2-python-2.7-at/src/Zope/src/Testing/ZopeTestCase/sandbox.py", line 73, in load_app
    with ZPublisher.WSGIPublisher.__old_load_app__(module_info) as ret:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/jenkins/workspace/plone-5.2-python-2.7-at/src/Zope/src/ZPublisher/WSGIPublisher.py", line 220, in load_app
    app = app_wrapper()
  File "/home/jenkins/workspace/plone-5.2-python-2.7-at/src/Zope/src/App/ZApplication.py", line 78, in __call__
    return connection.root()[self._name]
  File "/home/jenkins/shiningpanda/jobs/2fa08faf/virtualenvs/d41d8cd9/lib/python2.7/UserDict.py", line 40, in __getitem__
    raise KeyError(key)
KeyError: 'Application'



Error in test test_regression (Products.CMFFormController.tests.testRedirectTo.TestRedirectToFunctional)
Traceback (most recent call last):

[...]

    raise KeyError(key)
KeyError: 'Application'

  Ran 68 tests with 0 failures, 2 errors and 0 skipped in 1.626 seconds.
Running plone.app.folder.tests.layer.plone.app.folder testing:Integration tests:
  Set up plone.app.folder.tests.layer.IntegrationFixture in 0.027 seconds.
  Set up plone.app.folder.tests.layer.plone.app.folder testing:Integration in 0.000 seconds.
  Ran 27 tests with 0 failures, 0 errors and 0 skipped in 9.033 seconds.

[...]

Tearing down left over layers:
  Tear down zope.testrunner.layer.UnitTests in 0.000 seconds.
Total: 733 tests, 0 failures, 2 errors and 0 skipped in 3 minutes 10.739 seconds.
#### Finished tests for group Archetypes ####

What? Why does connection.root() have no Application? This makes no sense to me, and a pdb there did not help to shed light on it at all.

First I reproduced the error by testing all packages in the test group Archetypes (where the error occurs):

./bin/test \
-s Products.Archetypes \
-s Products.CMFFormController \
-s Products.MimetypesRegistry \
-s Products.PortalTransforms \
-s Products.statusmessages \
-s Products.validation \
-s plone.app.folder

Then I only used the test layers that actually got set up according to the output:

./bin/test --layer Products.Archetypes.tests.attestcase.Archetypes \
--layer Products.PortalTransforms.testing.PortalTransformsLayer \
--layer Testing.ZopeTestCase.layer.ZopeLite \
--layer plone.app.testing.bbb.PloneTestCase \
-s Products.Archetypes \
-s Products.CMFFormController \
-s Products.MimetypesRegistry \
-s Products.PortalTransforms \
-s Products.statusmessages \
-s Products.validation \
-s plone.app.folder

That worked, I see the error. But I will not try to read 733 tests and wait for more than 3 minutes each time I think I may have fixed something!

Thus I used the divide-and-conquer strategy to figure out which combination produced the failing tests: remove half of the packages layers and see if it still fails. If they pass, try the other half. Do the same with the layers.

Remember to keep --layer plone.app.testing.bbb.PloneTestCase and -s Products.CMFFormController in order not to skip the tests that expose the issue.

It turned out that the following combination reproduced the issue:

./bin/test \
--layer Products.Archetypes.tests.attestcase.Archetypes \
--layer Testing.ZopeTestCase.layer.ZopeLite \
--layer plone.app.testing.bbb.PloneTestCase \
-s Products.Archetypes \
-s Products.CMFFormController

Still way too many tests to have a look, most of them in Products.Archetypes. So I removed (actually, moved the .py files to some temp folder) all python tests and kept the doctests (and their setup). The only reason was that I hate doctests and consequently it must be a doctest that created trouble. I was right.

So I kept only one doctest that produced the issue by commenting out the others in test_doctest.py of Products.Archetypes.

Now I needed to find a combination of three tests from these layers that still exposed the issue. To to that, I added the option -vv to the testrunner to see the names and python path of all tests that still ran.

./bin/test --layer Products.Archetypes.tests.attestcase.Archetypes --layer Testing.ZopeTestCase.layer.ZopeLite --layer plone.app.testing.bbb.PloneTestCase -s Products.Archetypes -s Products.CMFFormController -vv
Running tests at level 1
Running Products.Archetypes.tests.attestcase.Archetypes:Functional tests:
  Set up plone.testing.zca.LayerCleanup in 0.000 seconds.
  Set up plone.testing.z2.Startup in 0.157 seconds.
  Set up plone.app.testing.layers.PloneFixture in 10.252 seconds.
  Set up plone.app.testing.bbb.PloneTestCaseFixture in 1.871 seconds.
  Set up Products.Archetypes.tests.attestcase.ATTestCaseFixture in 0.647 seconds.
  Set up Products.Archetypes.tests.attestcase.Archetypes:Functional in 0.000 seconds.
  Running:
    1/1 (100.0%) /Users/pbauer/workspace/coredev/src/Products.Archetypes/Products/Archetypes/tests/traversal_4981.txt

  Ran 1 tests with 0 failures, 0 errors, 0 skipped in 0.269 seconds.
Running Testing.ZopeTestCase.layer.ZopeLite tests:
  Tear down Products.Archetypes.tests.attestcase.Archetypes:Functional in 0.000 seconds.
  Tear down Products.Archetypes.tests.attestcase.ATTestCaseFixture in 0.010 seconds.
  Tear down plone.app.testing.bbb.PloneTestCaseFixture in 0.009 seconds.
  Tear down plone.app.testing.layers.PloneFixture in 0.065 seconds.
  Tear down plone.testing.z2.Startup in 0.004 seconds.
  Tear down plone.testing.zca.LayerCleanup in 0.001 seconds.
  Set up Testing.ZopeTestCase.layer.ZopeLite in 0.009 seconds.
  Running:
    1/5 (20.0%) test_parseXML_empty (Products.CMFFormController.tests.test_exportimport.CMFFormControllerImportConfiguratorTests)
    2/5 (40.0%) test_parseXML_with_info (Products.CMFFormController.tests.test_exportimport.CMFFormControllerImportConfiguratorTests)
    3/5 (60.0%) test_action_not_unicode (Products.CMFFormController.tests.test_exportimport.Test_importCMFFormController)
    4/5 (80.0%) test_normal (Products.CMFFormController.tests.test_exportimport.Test_importCMFFormController)
    5/5 (100.0%) test_partial (Products.CMFFormController.tests.test_exportimport.Test_importCMFFormController)

  Ran 5 tests with 0 failures, 0 errors, 0 skipped in 0.023 seconds.
Running plone.app.testing.bbb.PloneTestCase:Functional tests:
  Tear down Testing.ZopeTestCase.layer.ZopeLite in 0.000 seconds.
  Set up plone.testing.zca.LayerCleanup in 0.000 seconds.
  Set up plone.testing.z2.Startup in 0.092 seconds.
  Set up plone.app.testing.layers.PloneFixture in 7.227 seconds.
  Set up plone.app.testing.bbb.PloneTestCaseFixture in 2.087 seconds.
  Set up plone.app.testing.bbb.PloneTestCase:Functional in 0.000 seconds.
  Running:
    1/4 (25.0%) testCopy (Products.CMFFormController.tests.testCopyRename.TestCopyRename)
    2/4 (50.0%) testRename (Products.CMFFormController.tests.testCopyRename.TestCopyRename)
    3/4 (75.0%) test_attacker_redirect (Products.CMFFormController.tests.testRedirectTo.TestRedirectToFunctional)


Error in test test_attacker_redirect (Products.CMFFormController.tests.testRedirectTo.TestRedirectToFunctional)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 329, in run
    testMethod()
  File "/Users/pbauer/workspace/coredev/src/Products.CMFFormController/Products/CMFFormController/tests/testRedirectTo.py", line 97, in test_attacker_redirect
    handle_errors=False,
  File "/Users/pbauer/workspace/coredev/src/Zope/src/Testing/ZopeTestCase/functional.py", line 43, in wrapped_func
    return func(*args, **kw)
  File "/Users/pbauer/workspace/coredev/src/Zope/src/Testing/ZopeTestCase/functional.py", line 127, in publish
    wsgi_result = publish(env, start_response)
  File "/Users/pbauer/workspace/coredev/src/Zope/src/ZPublisher/WSGIPublisher.py", line 254, in publish_module
    with load_app(module_info) as new_mod_info:
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/Users/pbauer/workspace/coredev/src/Zope/src/Testing/ZopeTestCase/sandbox.py", line 73, in load_app
    with ZPublisher.WSGIPublisher.__old_load_app__(module_info) as ret:
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/Users/pbauer/workspace/coredev/src/Zope/src/ZPublisher/WSGIPublisher.py", line 220, in load_app
    app = app_wrapper()
  File "/Users/pbauer/workspace/coredev/src/Zope/src/App/ZApplication.py", line 78, in __call__
    return connection.root()[self._name]
  File "/Users/pbauer/workspace/coredev/bin/../lib/python2.7/UserDict.py", line 40, in __getitem__
    raise KeyError(key)
KeyError: 'Application'

    4/4 (100.0%) test_regression (Products.CMFFormController.tests.testRedirectTo.TestRedirectToFunctional)


Error in test test_regression (Products.CMFFormController.tests.testRedirectTo.TestRedirectToFunctional)
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 329, in run
    testMethod()
  File "/Users/pbauer/workspace/coredev/src/Products.CMFFormController/Products/CMFFormController/tests/testRedirectTo.py", line 71, in test_regression
    handle_errors=False,
  File "/Users/pbauer/workspace/coredev/src/Zope/src/Testing/ZopeTestCase/functional.py", line 43, in wrapped_func
    return func(*args, **kw)
  File "/Users/pbauer/workspace/coredev/src/Zope/src/Testing/ZopeTestCase/functional.py", line 127, in publish
    wsgi_result = publish(env, start_response)
  File "/Users/pbauer/workspace/coredev/src/Zope/src/ZPublisher/WSGIPublisher.py", line 254, in publish_module
    with load_app(module_info) as new_mod_info:
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/Users/pbauer/workspace/coredev/src/Zope/src/Testing/ZopeTestCase/sandbox.py", line 73, in load_app
    with ZPublisher.WSGIPublisher.__old_load_app__(module_info) as ret:
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/Users/pbauer/workspace/coredev/src/Zope/src/ZPublisher/WSGIPublisher.py", line 220, in load_app
    app = app_wrapper()
  File "/Users/pbauer/workspace/coredev/src/Zope/src/App/ZApplication.py", line 78, in __call__
    return connection.root()[self._name]
  File "/Users/pbauer/workspace/coredev/bin/../lib/python2.7/UserDict.py", line 40, in __getitem__
    raise KeyError(key)
KeyError: 'Application'


  Ran 4 tests with 0 failures, 2 errors, 0 skipped in 0.403 seconds.
Tearing down left over layers:
  Tear down plone.app.testing.bbb.PloneTestCase:Functional in 0.000 seconds.
  Tear down plone.app.testing.bbb.PloneTestCaseFixture in 0.010 seconds.
  Tear down plone.app.testing.layers.PloneFixture in 0.068 seconds.
  Tear down plone.testing.z2.Startup in 0.007 seconds.
  Tear down plone.testing.zca.LayerCleanup in 0.001 seconds.

Tests with errors:
   test_attacker_redirect (Products.CMFFormController.tests.testRedirectTo.TestRedirectToFunctional)
   test_regression (Products.CMFFormController.tests.testRedirectTo.TestRedirectToFunctional)
Total: 10 tests, 0 failures, 2 errors, 0 skipped in 24.082 seconds.

24 seconds? I can work with that.

Still, I removed tests from each layer until I only had three tests left and reverted my changes to Products.Archetypes.

The result is the following:

./bin/test \
--layer Products.Archetypes.tests.attestcase.Archetypes \
--layer Testing.ZopeTestCase.layer.ZopeLite \
--layer plone.app.testing.bbb.PloneTestCase \
-s Products.Archetypes \
-s Products.CMFFormController \
-t test_parseXML_empty \
-t traversal_4981 \
-t test_attacker_redirect \
-vv

Since more than one test still exposed the issue, I kept only very simple ones because I guessed that the issue is actually in the setup or teardown.

So next I changed the test test_parseXML_empty to a simple return. The error is still there. Trying the same with traversal_4981 makes it go away.

At this point I could skip reducing the layers since I only run three tests from two packages.

It was time to actually read what the remaining tests are doing. I stripped down all tests and their setup to the base minimum that still breaks the test run and could not find anything. I turn edCMFFormControllerImportConfiguratorTests into a ZopeTestCase and a PloneTestCase and realized that the error disappears when it is a PloneTestCase. Bad. Migrating the whole test to PloneTestCase or plone.app.testing would be a lot of work since CMFFormControllerImportConfiguratorTests inherits from Products.GenericSetup.tests.common.BaseRegistryTests and does a lot of additional magic.

So the test layers for the two tests that did not fail or error by themselves but triggered the issue in the failing tests (traversal_4981 and test_parseXML_empty) seemed to be out of the scope of what I could do so I took a closer look at the failing tests themselves. I quickly found that I hate them but what they do is actually quite simple. Why do I hate them? Because they use the publish method of ZopeTestCase.Functional. That method (and its evil doctest-cousin Testing.ZopeTestCase.zopedoctest.functional.http) are way too clever helper methods that make things harder, not easier. I prefer to use restrictedTraverse or the testbrowser any time since both are much closer to what actually happens in the application.

This was the moment when I decided to migrate the tests in question to proper plone.app.testing tests. It took me about 1 hour to create a pull-request which resolves the issue. The rest of the day was spent on a fruitless attempt to find the issue that must still be lurking somewhere between the three tests and their layers.

I hope that monster will never rear its ugly head again until CMFFormController is finally removed from the coredev. The PLIP 2092 by @esteele and me will remove the last remaining ControllerPageTemplates but there are some more left in Archetypes.

I fear it will be quite some time until all ZopeTestCase and PloneTestCase tests are migrated to plone.app.testing. The remaining happy thought is that many will not need to be migrated since they are part of Archetypes and will go awaaaaay with it.