IMO, the larger issue sfraser raises is whether blind testing is an accurate or meaningful arbiter for these type of debates. Rather than trying to summarize a complex scientific review of the matter, I'll link to a discussion which summarizes the results of two French researcher's conclusions:
https://www.audiosciencereview.com/forum/index.php?threads/limitations-of-blind-testing-procedures.1254/In all manner of blind tests, we are confronted with conclusions which confound our perceptions of reality. Examples such as;
Gordon Ramsey blind testing other chefs with pureed apples and pears (or similar sorts of comparisons) to see if they can tell the difference under the stress of testing. They fail often.
Professional violinists have been blind tested to see if they can discern between a Stradivarius and a modern violin. They couldn't.
Dr. N. Ramanjaneyalu and research assistants conducted a carefully controlled, peer reviewed blind testing in 2013 to see if consumers can tell the difference between Coke and Pepsi. They can't.
Non professional wine consumers are tested to see if they can discern cheap from expensive wine. They can't.
https://www.forbes.com/sites/thomaspellechia/2019/02/05/a-new-study-asks-answers-the-question-does-blind-wine-tasting-work/#4aa5fb442eabNow the last one regarding wine discrimination is very interesting. Researchers have studied the differences in the abilities of professional sommeliers and amateur consumers and found marked differences in the accuracy of their discernment abilities. Different parts of the brain are activated during testing conducted with pros and amateurs. Is there a possibility that non-trained listeners could be subject to similar limitations in similar test conditions?
No one would suggest there are no differences between apples and pears, different wines, coke and pepsi, and violins. There are obvious, objectively measurable differences, we just fail to discern them in testing situations.
The take away for me in all of this is that blind testing is very difficult to do correctly and results can be hard to interpret meaningfully. While it is a very useful tool in science, I suspect limitations when used to determine if listeners accurately discern differences during test conditions as they are usually constructed.