the semblance of objectivity in numbers

I just received my first ever first-authored con­fer­ence paper rejec­tion from FSE. The pri­mary rea­sons, quoted from the reviews, include:

  • The qual­i­ta­tive nature of the study … is liable to mis­in­ter­pre­ta­tion and bias.”
  • I was expect­ing a quan­ti­ta­tive analy­sis: is there any cor­re­la­tion between some of the char­ac­ter­is­tics and between [the results] and the time a bug takes to resolve and its res­o­lu­tion status?
  • I would have thought that what types of ele­ments to look for in dis­cus­sion should be decided before by the researchers as it should be based on the problem”
  • I was expect­ing con­crete advice on HOW the tools should struc­ture the discussion.”

I was hop­ing the review­ers would have been more epis­te­mo­log­i­cally informed. For exam­ple, the first and sec­ond quotes are quite telling: they imply that some forms of empiri­cism are not sub­ject to mis­in­ter­pre­ta­tion or bias. But quan­ti­ta­tive empir­i­cal mea­sures are just as sub­ject to bias as any other mea­sure. For exam­ple, if I had counted cer­tain kinds of data and run cor­re­la­tions between these counts and other out­come mea­sures, not only would one in twenty of them be “sta­tis­ti­cally sig­nif­i­cant” by chance, but whether there was any real mean­ing in the vari­ables depends on the con­struct valid­ity of the quan­ti­ta­tive mea­sure­ments. For exam­ple, if I had cor­re­lated hyper­boles with bug res­o­lu­tion time, not only would the hyper­bole mea­sure have the same lim­i­ta­tions as it did as a qual­i­ta­tive clas­si­fi­ca­tion, but the bug res­o­lu­tion time would have any num­ber of con­tex­tual fac­tors that could influ­ence its true reflec­tion of the hyperbole’s impact on con­sen­sus. Trans­form­ing empir­i­cal obser­va­tions into num­bers does NOT make them objec­tive, nor does it pre­vent bias and mis­in­ter­pre­ta­tion.

The third quote is ironic: this reviewer seems to believe that the only way to ana­lyze a prob­lem is to make some assump­tion about its nature upfront. The whole point of qual­i­ta­tive research is that the more you make upfront assump­tions, the more you bias your find­ings. What this reviewer is propos­ing would have less­ened the objec­tiv­ity of the results and pre­vented us from uncov­er­ing the trends we did.

The last quote reveals the sys­temic bias in soft­ware engi­neer­ing research (and also some HCI venues): qual­i­ta­tive stud­ies are only valu­able if they explic­itly inform design. What this really reduces to is a view that mate­r­ial goods are real work, but the pro­duc­tion of knowl­edge comes for free. Build­ing a sys­tem or automat­ing some activ­ity, even if the sys­tem and automa­tion are entirely imprac­ti­cal in the real world, is more valu­able than under­stand­ing the real world. The com­ment also reveals the reviewer’s lack of under­stand­ing about design: inno­va­tions don’t come from stud­ies, they come from peo­ple. Stud­ies can sup­port design deci­sions (and the results through­out our rejected sub­mis­sion have been quite valu­able in our cur­rent design efforts), but they can­not gen­er­ate ideas. Peo­ple gen­er­ate ideas.

Had I really wanted the paper in, I would have lit­tered the sub­mis­sion with arbi­trary, but seem­ingly objec­tive quan­tifi­ca­tions and cor­re­la­tions of our data (which is what most quan­tifi­ca­tions are in soft­ware engi­neer­ing papers). This has worked in past papers and is a tried and true workaround for the soft­ware engi­neer­ing community’s lack of expe­ri­ence with qual­i­ta­tive meth­ods. Review­ers would have thought, “I don’t get all of this qual­i­ta­tive stuff, but these num­bers are great.” I decided not to do this on prin­ci­ple, since doing so would have only made the results seem more objec­tive with­out adding any real objectivity.

So much for prin­ci­ple. Time to start cor­re­lat­ing things!

10 thoughts on “the semblance of objectivity in numbers

  1. Pingback: Bad surveys « Catenary

  2. I have been a mem­ber of the PC that so frus­trated you; and as future ESEC/FSE pro­gram chair, I am con­cerned about your impli­ca­tions. I can­not con­firm a sys­temic bias against qual­i­ta­tive research (nor against quan­ti­ta­tive research, for that mat­ter). What I am look­ing for in a PC is cau­sa­tion, not cor­re­la­tion; and for cau­sa­tion, you not only need a well-established (quan­ti­ta­tive) cor­re­la­tion, but also a (qual­i­ta­tive) the­ory on why things are as they seem. And the rea­son we’re look­ing for cau­sa­tion is that indeed, we’d like results to be action­able — such that peo­ple can change the cause in order to influ­ence the effect. (In another sense, these papers carry infor­ma­tion rather than just data.)

    Few papers excel in estab­lish­ing causal­ity, but they tend to dom­i­nate papers that are less action­able. And if a paper pro­vides only the ini­tial steps towards fur­ther research, it will be dom­i­nated by research that has gone all the way.

    (I don’t have access to your sub­mis­sion or its reviews, as I have a con­flict with UW; so none of this need to apply to your paper. But if you send the mate­r­ial to me, I’ll be happy to give you some extra advice.)

    Best — Andreas

    • Thanks for your insights Andreas. I com­pletely agree about the impor­tance of estab­lish­ing some form of cau­sa­tion. Soft­ware engi­neer­ing, as a design dis­ci­pline, is about chang­ing and improv­ing prac­tice, and under­stand­ing the causal rela­tion­ships between the vari­ables we see in prac­tice is the most reli­able way to impact prac­tice. I fully sup­port this perspective.

      Unfor­tu­nately, I’ve found the soft­ware engi­neer­ing research com­mu­nity to have a very lim­ited notion causal­ity. First, many of the reviews I’ve seen seem to think that one can *estab­lish* causal­ity. Empiri­cism can do no such thing. We can only gain con­fi­dence in it. And in fact, as any good epis­te­mol­o­gist will tell you, most of our con­fi­dence comes from con­ver­gent valid­ity: using a vari­ety of meth­ods and mea­sures to study the same phe­nom­e­non and find­ing the same under­ly­ing truth through each lens. If the soft­ware engi­neer­ing research com­mu­nity con­tin­ues to limit pub­lish­able meth­ods to quan­ti­ta­tive empiri­cism, we will have a very skewed (and I would argue shal­low) under­stand­ing of practice.

      And its not like the quan­ti­ta­tive empiri­cism in soft­ware engi­neer­ing research is that good (but it is get­ting bet­ter). Test­ing the­o­ries and demon­strat­ing causal­ity requires exper­i­ments and repli­ca­tions of exper­i­ments. These rarely get pub­lished because of con­cerns that there’s noth­ing new being estab­lished. A well-designed study with a non-result is often quite impor­tant in con­firm­ing our under­stand­ing of the world. With­out these works, we prac­tice weak empiricism.

      Yet there is larger, more sys­tem­atic prob­lem in what work the com­mu­nity val­ues. Cur­rently, the com­mu­nity only incen­tives work that *tests* the­o­ries, but it has lit­tle inter­est in (qual­i­ta­tive) work that gen­er­ates new new the­o­ries? The whole point of qual­i­ta­tive research is to rig­or­ously extract new the­o­ries from empir­i­cal obser­va­tion. This was the goal of my paper, but the review­ers were of the mind, “These the­o­ries are great and we will use them to inform tool and process design, but I don’t see any real work here.” Appar­ently, three months of rig­or­ous analy­sis of over 10,000 bug report com­ments isn’t work. Unless those analy­ses involved num­bers :)

      I think its per­fectly rea­son­able have work that “goes all the way” dom­i­nate, espe­cially work that trans­forms our under­stand­ing of how to improve soft­ware engi­neer­ing. But so much of this work lacks any sub­stan­tive shift in our view of how soft­ware engi­neer­ing work hap­pens and how it can be improved. These shifts in under­stand­ing don’t come from nar­row exper­i­ments or clever automa­tions. They come from deep, rig­or­ous analy­sis of what makes soft­ware engi­neer­ing challenging.

      This brings me to a larger cri­tique of PCs and con­fer­ences in gen­eral. Because our work is archived in con­fer­ences, which has lim­ited slots, the goal of a PC is unfor­tu­nately to choose the “best” of the sub­mit­ted work as opposed to work that is above some level of qual­ity. If we were a dis­ci­pline that was inter­ested in cre­at­ing gen­er­al­iz­able, broad knowl­edge about soft­ware engi­neer­ing, we would struc­ture our modes of dis­sem­i­na­tion to ensure that all high qual­ity knowl­edge is spread to the larger com­mu­nity. Cur­rently, the arti­fi­cial cut­offs imposed by con­fer­ence venues means that we only spread the most con­ser­v­a­tive of high qual­ity work. Work that applies new meth­ods or tests a plau­si­ble but exotic the­ory rarely finds a home.

      I’m actu­ally not bit­ter; I pre­dicted the paper would be rejected for the meth­ods it used. I just hope for a day where soft­ware engi­neer­ing research uses a wider vari­ety of meth­ods to under­stand and impact soft­ware engi­neer­ing work. From my view, it’s still in its infancy from a sci­en­tific per­spec­tive, with most researchers con­fus­ing “qual­i­ta­tive” with “subjective”.

  3. Your third point that has both­er­ing me for the past month: are stud­ies worth­while if they fail to inform design?

    My cur­rent work has no clear path for inform­ing design (right now). In the long term, it may, but I can’t artic­u­late how at the moment.

    Is it worth pur­su­ing? I think so. But, with­out artic­u­lat­ing a clear moti­va­tion in the form of moti­vat­ing design, then how can I sell it to a com­mit­tee or fund­ing source?

    • I think the issue is that the impli­ca­tions of stud­ies can always have impli­ca­tions, but don’t always have impli­ca­tions for design. Some­times the impli­ca­tions are for process, orga­ni­za­tion, or train­ing. I think the prob­lem is that most SE con­fer­ences don’t really view these other forms of impli­ca­tions as inter­est­ing or relevant.

  4. I sus­pect that ESEM would have been more recep­tive to this work, espe­cially given Car­olyn Seaman’s involve­ment in the com­mu­nity. Admit­tedly, ESEM is not con­sid­ered as pres­ti­gious as FSE, but you get to main­tain your principles!

    • That’s a good sug­ges­tion. We just sub­mit­ted a revised ver­sion to CSCW today, but ESEM would be a good place if that doesn’t work out. Thanks!

    • That’s what I usu­ally do. This time I didn’t do it out of prin­ci­ple, since it shouldn’t be nec­es­sary :) So much for prin­ci­ple, back to pragmatism.

  5. Pingback: Andy Ko and the semblance of objectivity in numbers « Catenary

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">