Proxmox quality assurance

A glimpse at Proxmox Quality Assurance

December 26, 2024

This post follows up on the previous finding that there is no difference in the eventual content no-subscription and test software repositories as publicly made available by Proxmox.

Routine

Every software house has some sort of testing routine (QA) to ensure the obviously bad versions of their packages never reach their user.

It starts with rudimentary unit tests that a developer is supposed to make and have accompany their newly written code, these would also help find out any regressions - unintended bugs that caused previously dependable features to stop working correctly as they did before. Otherwise, an individual developer would typically just be testing the part that they were implementing anew.

Further integration testing would typically cover any unintended interactions across interfaces, these could still be routinely run automated scripts on every new build, but also could be manual.

Then there is system tests performed with the full suite and by actual testers, i.e. dedicated personnel that does not have the bias of the original developers and possibly involves also automation, but closely resembling behaviour of real users.

This is all before final User Acceptance Test (UAT) - something only a customer (in a typical scenario) can sign on.

How well the first 3 are part of Proxmox culture is hard to determine, but following individual bugreports, it becomes clear there are some deficiencies.

Proxmox do have public Bugzilla instance, 1 but it is apparent there’s no fixed process to follow once bugs get fixed to ensure full end-to-end testing in every individual case uniformly. When it comes to quality of work of individual developers, this can also vary vastly, e.g. there’s rigorous unit tests written for some new works, others have none at all, at least not published.

Unit tests

A prime example is pve-ha-manager, looking at its recent git log: 2

commit 34fe8e59eacb9107c76962ed12f6bea69195eb74 (HEAD -> master, origin/master, origin/HEAD)
Author: ---8<---
Date:   Sun Nov 17 20:36:27 2024 +0100

    bump version to 4.0.6
    
---8<---

commit 977ae288497fde04fb67bf25417ce54e77a29a63
Author: ---8<---
Date:   Sun Nov 17 17:23:01 2024 +0100

    crm: get active if there are nodes that need to leave maintenance
    
---8<---

commit 73f93a4f6b6662d106c32b433efabcc1f10dbc3a
Author: ---8<---
Date:   Sun Nov 17 17:01:37 2024 +0100

    crm: get active if there are pending CRM commands
    
---8<---

commit d0979e6dd064e6dc5a1292aa2c9b25c244500043
Author: ---8<---
Date:   Sun Nov 17 16:35:22 2024 +0100

    env: add any_pending_crm_command method

---8<---

commit afbfa9bafca0237785badb96f589524749fc937a
Author: ---8<---
Date:   Sun Nov 17 16:34:48 2024 +0100

    tests: add more crm idle situations
    
    To test the behavior for when a CRM should get active or stay active
    (for a bit longer).
    
    These cases show the status quo, which will be improved on in the next
    commits.

---8<---

commit ddd56db3463c3c7716072f6011070109df4a577a
Author: ---8<---
Date:   Fri Oct 25 16:34:02 2024 +0200

    fix #5243: make CRM go idle after ~15 min of no service being configured

---8<---

This was a bugfix 3 in a non-trivial component relating to High Availability, committed October 25, 2024 and then almost a month later, unit tests were supplied, but in the same swoop, more changes and finally “bump version,” i.e. releasing package to the public just 3 hours following the last changes of November 17, 2024. The package has been made public soon after.

What about the… tests

In another instance, an SSH bugfix 4 that aimed to go all-in with new intra-cluster communication setup (impact on migrations, replications, GUI proxy’ing console/shell connections, so quite a bit) 5 was made in January 2024 and a regular member of development team (i.e. not a dedicated tester) got tasked to manually ad hoc test another one’s work: 6

 > Tested cluster creation with three new nodes on 8.1 and the patches                                         
 > Cluster creation and further ssh communication (eq. migration) worked                                       
 > flawless                                                                                                    
 >                                                                                                             
 > Tested-by: ---8<---
 
 What about the reinstallation of an existing node, or replacing                                               
 one, while keeping the same nodename scenario?                                                                
                                                                                                               
 As that was one of the main original reasons for this change here                                             
 in the first place.                                                                                           
                                                                                                               
 For the removal you could play through the documented procedure                                               
 and send a patch for update it accordingly, as e.g., the part                                                 
 about the node’s SSH keys remaining in the pmxcfs authorized_key                                              
 file would need some change to reflect that this is not true                                                  
 for newer setups (once this series is applied and the respective                                              
 packages got bumped and released).              

This was then applied to public repositories in April 2024. 7

Then in May 2024, a user is filing a bugreport on a regression with QDevice setup 8 regarding a “typo in command” - fixed in next minor version in May 2024.

Another bug in closely related forgotten-to-be-changed code was found only in October, 9 fixed same day, 10 but not included so far - December 2024, at all. 11

The takeaway

These are some of the testing procedures Proxmox use before releasing anything into their public repositories, however the distinction between what test packages are and what makes its way into no-subscription repository is blur - eventually, they contain identical packages, after all. The final acceptance test (UAT) inevitably happen with the public - widest user base possible - to offset any deficiencies that may have been overlooked, but this is part of the actual business model of Proxmox and it helps it stay free of any monetary cost to the user.