A glimpse at Proxmox Quality Assurance
This post follows up on the previous finding that there is no difference in the eventual content no-subscription and test software repositories as publicly made available by Proxmox.
Routine
Every software house has some sort of testing routine (QA) to ensure the obviously bad versions of their packages never reach their user.
It starts with rudimentary unit tests that a developer is supposed to make and have accompany their newly written code, these would also help find out any regressions - unintended bugs that caused previously dependable features to stop working correctly as they did before. Otherwise, an individual developer would typically just be testing the part that they were implementing anew.
Further integration testing would typically cover any unintended interactions across interfaces, these could still be routinely run automated scripts on every new build, but also could be manual.
Then there is system tests performed with the full suite and by actual testers, i.e. dedicated personnel that does not have the bias of the original developers and possibly involves also automation, but closely resembling behaviour of real users.
This is all before final User Acceptance Test (UAT) - something only a customer (in a typical scenario) can sign on.
How well the first 3 are part of Proxmox culture is hard to determine, but following individual bugreports, it becomes clear there are some deficiencies.
Proxmox do have public Bugzilla instance, 1 but it is apparent there’s no fixed process to follow once bugs get fixed to ensure full end-to-end testing in every individual case uniformly. When it comes to quality of work of individual developers, this can also vary vastly, e.g. there’s rigorous unit tests written for some new works, others have none at all, at least not published.
Unit tests
A prime example is pve-ha-manager
, looking at its recent git log
: 2
commit 34fe8e59eacb9107c76962ed12f6bea69195eb74 (HEAD -> master, origin/master, origin/HEAD)
Author: ---8<---
Date: Sun Nov 17 20:36:27 2024 +0100
bump version to 4.0.6
---8<---
commit 977ae288497fde04fb67bf25417ce54e77a29a63
Author: ---8<---
Date: Sun Nov 17 17:23:01 2024 +0100
crm: get active if there are nodes that need to leave maintenance
---8<---
commit 73f93a4f6b6662d106c32b433efabcc1f10dbc3a
Author: ---8<---
Date: Sun Nov 17 17:01:37 2024 +0100
crm: get active if there are pending CRM commands
---8<---
commit d0979e6dd064e6dc5a1292aa2c9b25c244500043
Author: ---8<---
Date: Sun Nov 17 16:35:22 2024 +0100
env: add any_pending_crm_command method
---8<---
commit afbfa9bafca0237785badb96f589524749fc937a
Author: ---8<---
Date: Sun Nov 17 16:34:48 2024 +0100
tests: add more crm idle situations
To test the behavior for when a CRM should get active or stay active
(for a bit longer).
These cases show the status quo, which will be improved on in the next
commits.
---8<---
commit ddd56db3463c3c7716072f6011070109df4a577a
Author: ---8<---
Date: Fri Oct 25 16:34:02 2024 +0200
fix #5243: make CRM go idle after ~15 min of no service being configured
---8<---
This was a bugfix 3 in a non-trivial component relating to High Availability, committed October 25, 2024 and then almost a month later, unit tests were supplied, but in the same swoop, more changes and finally “bump version,” i.e. releasing package to the public just 3 hours following the last changes of November 17, 2024. The package has been made public soon after.
What about the… tests
In another instance, an SSH bugfix 4 that aimed to go all-in with new intra-cluster communication setup (impact on migrations, replications, GUI proxy’ing console/shell connections, so quite a bit) 5 was made in January 2024 and a regular member of development team (i.e. not a dedicated tester) got tasked to manually ad hoc test another one’s work: 6
> Tested cluster creation with three new nodes on 8.1 and the patches
> Cluster creation and further ssh communication (eq. migration) worked
> flawless
>
> Tested-by: ---8<---
What about the reinstallation of an existing node, or replacing
one, while keeping the same nodename scenario?
As that was one of the main original reasons for this change here
in the first place.
For the removal you could play through the documented procedure
and send a patch for update it accordingly, as e.g., the part
about the node’s SSH keys remaining in the pmxcfs authorized_key
file would need some change to reflect that this is not true
for newer setups (once this series is applied and the respective
packages got bumped and released).
This was then applied to public repositories in April 2024. 7
Then in May 2024, a user is filing a bugreport on a regression with QDevice setup 8 regarding a “typo in command” - fixed in next minor version in May 2024.
Another bug in closely related forgotten-to-be-changed code was found only in October, 9 fixed same day, 10 but not included so far - December 2024, at all. 11
The takeaway
These are some of the testing procedures Proxmox use before releasing anything into their public repositories, however the distinction between what test packages are and what makes its way into no-subscription repository is blur - eventually, they contain identical packages, after all. The final acceptance test (UAT) inevitably happen with the public - widest user base possible - to offset any deficiencies that may have been overlooked, but this is part of the actual business model of Proxmox and it helps it stay free of any monetary cost to the user.
https://pve.proxmox.com/wiki/Cluster_Manager#_role_of_ssh_in_proxmox_ve_clusters ↩︎
https://lists.proxmox.com/pipermail/pve-devel/2024-January/061372.html ↩︎
https://lists.proxmox.com/pipermail/pve-devel/2024-April/063379.html ↩︎
https://forum.proxmox.com/threads/bug-new-lxcs-tofu-are-you-sure-you-want-to-continue-connecting-yes-no-fingerprint.156714/ ↩︎
https://lists.proxmox.com/pipermail/pve-devel/2024-October/065853.html ↩︎
https://github.com/proxmox/pve-container/commits/master/src/PVE/API2/LXC.pm ↩︎