Garth (above) makes some excellent points about positive and negative controls. He must have some experience as a lab scientist. I'd like to expand on his ideas.
Any measuring method is only as good as it's controls demonstrate it is. Before the A-B box can be used to compare a known sound (an existing amp) to an unknown sound (a newly designed amp) it has to be calibrated against some known standards. With the appropriate positive and negative controls, you can define how useful this new A-B box is as a measuring tool.
Frank already did one negative control, testing the A-B box vs. a direct connection. He and others (how many others?) said there was no difference.
In an A-B test of whether two different devices sound different, it is important to limit the listeners’ responses to "Yes I hear a difference" or "No I don't". Leave preference out. Obviously, you cannot have a preference if you can't hear a difference. Human perception is not a simple physical or electrical parameter that can be easily and reliably measured with the appropriate instrument. It's a complex (read messy) biological function, and the results may not be clean cut. It's not known whether different listeners will respond similarly or differently, so the numbers of listeners tested matter, the more the better. I can easily imagine that with Frank's negative control test (above) if enough listeners were tested, some would answer "No they heard no difference", and some would say "Yes". It is possible that this could be 50% Yes and 50% No. That may be as clean a result as you can get. That needs to be determined with real tests.
Another negative control would use the A-B box to test two identical amplifiers. Don't assume that no one will hear a difference, measure how many hear a difference and how many don't.
The more important test of the A-B box is the positive control. Think of as a calibration curve. Test several different sounds that are known to be different, and ask how many listeners hear a difference. What kinds of sounds should be tested as positive controls? I really don't know, but here is my simple idea. Add various amounts of white or pink noise to a digital file of a short passage of cleanly recorded music. You will have a series of recordings of a short musical passage that contain added noise of 0%, 2.5%, 5%, 7.5%, 10%, etc.
Use the A-B box to test each listener with the noise series compared to the unmodified recording (0% added noise). At some level of added noise, all or most listeners will hear a difference, and at some lower level few or no listeners will hear a difference. The % of listeners who hear a difference for each level of added noise is important. This way you can get quantitative data for what is essentially a Yes-No test. I can imagine a 2-dimensional graph as the result. The X axis will show the % added noise, and the Y axis will show the % of listeners who hear a difference. With the right levels of added noise, there ought to be a linear relationship between % added noise and the % of listeners who can hear it.
That audibility curve of positive control tests defines the real sensitivity of the A-B box. Hopefully the results of testing two of Frank’s amps will fall within the range in which the positive control tests show that listeners can hear differences. But it is also possible that such a test may require greater sensitivity than human listeners can deliver with the A-B box. I don’t know the answer now.
My idea of % noise added to a clean musical passage, is just one untested idea off the top of my head. Does anyone have any other ideas for positive control tests that can be used to calibrate the sensitivity of the A-B test?