Agreed, if you took 10 different receivers and hooked them to the same antenna on a test range one after the other with a calibrated signal generator on the other end of the test range you would likely get 10 different readings. S meters are only So-So calibrated at even the S9 reading, and no one can agree what each S unit should be. They vary by as much as 100 percent, and are not linear to boot.
When people compare antenna A to B across thousands of miles, it has no meaning, nor two different stations as a comparison side by side. Propagation changes second by second and these type of test while fun to do, are pretty meaningless. Unless it is consistently done over a long period of time, where Station A is always stronger than Station B. But a simple Station A to Station B test done over a period of seconds means nothing.