Research Themes

I study how to better build (and understand) interesting systems using humans and AI.

General Themes

Open Science

As of this moment, on this planet, two of the largest experiments currently underway in organization are (1) the international open source movement and (2) the open science movements. I contribute to these via:

The PROMISE repository of data for repeatable SE experiments;
The Academic Publishing 2.0 initiative project, where (with Emerson Murpy-Hill and others) we are designing a next generation open SE journal, to be launched in 2015.

Teams of Humans+AI

I try to understand how humans+AI can work better together to organize software systems.
Where I differ from other researchers is that I think humans, with all their flaws and imperfections, are central to this organization. So here is what I do not think is true:

The notion of 'user' cannot be precisely defined, and therefore and no place in CS or SE.
--Edsger Dijkstra, ICSE, 1979

As to what I do think is true is that better conclusions about software projects arise from:

More data: In the age of the Internet, we can certainly access more data. Certainly, In SE research, we are accessing and using more and more models that ever before.
More cpu: Given recent trends in cloud computing, it is now possible to apply more CPU to any given problem. For example, search-based SE methods treat SE problems as optimization problems, then apply large amounts of CPU to understand the trade-offs that exist between competing goals.
Human analysts: Who find and structure better questions. This structure offers a "landscape" of possible solutions. Note that these analysts are what I think our current generation of programmers will evolve into.
Automatic systems: That better understand the questions and use that insight to explore, and find shortcuts, around that "landscape".

Note my emphasis on this "landscape". I think the shape of the user preference space can tell us a lot about how to better solve problems since:

A mathematical definition of "user": the force that changes the geometry of search space.
--Me, 2014

How NOT To Do It

How to best organize our software systems? Well, here are some ideas that I tried and now I think are not the right approach.

Not Formal Logics

For years I worked with specialized modeling languages with the dream to coercing business users and programmers to use my kind of "better language".
Nowadays I think that we have to reason about all manner of idiosyncratically written models (written in FORTRAN or EXCEL or whatever) since those kinds of models are most common. Hence, I use data mining and optimization methods that treat models as black boxes that map inputs to outputs (and the goal is to find better inputs that yield better outputs).
(And to change my view on this point, all you'd have to do is show me some formal logical modeling language that is in wider use than FORTRAN or EXCEL. Good luck with that.)
Not Automatic Programming

The old dream of fully automatic programming (e.g. "write down the axioms then auto-generate the system") may not work. Deterministic deductive synthesis is an interesting technique. But it struggles when there are competing goals or any degree of uncertainty in the background knowledge (the search space explodes; and it becomes hard/impossible to maintain invariants in assemblies of partially specified components).

How To Do It

If the above is not recommended, then what do I think are useful directions?

Induction

For problems relating to competing goals, or that use partially incomplete knowledge, model creation is heuristic inductive process where some learner explores a space and culls a large space of possible models.
Heuristic induction should be conducted in partnership with the users since the biases of those users are needed to guiding the learner to a result that is acceptable to those users. In that partnership, it is important that the automatic process does not overwhelm the humans with too many options. Happily, if we exploit user biases, we can readily build range, feature selection and instance subset selection methods that greatly reduce the space of options the user needs to discuss.
Exploring Biases

Inductive search is inherently biased so it is important we study and document those biases since different biases lead to radically different models.
Note that we cannot, and should not, avoid bias since without it there would be no way to decide what bits are most important and which bits can be ignored. Paradoxically, bias blinds us to some things while letting us see (predict) the future. So all models are biased (but only some admit it). If we understand and use that bias, we can generate models that are most useful to the user:

By pruning away irrelevancies (where "relevance" is defined by the biases) we can produce models that perform "better", where "better" means either better predictions or lower variance or both.
Another advantage of this kind of pruning is increased privacy. If data from N projects is pruned back to M (where M<N) then, by definition we are ensuring the privacy of N-M projects. Further, it we add in some feature selection so we are only using (say) 25% of the features, then we are ignoring more data and achieving greater privacy.

Learning "NICHE" Contexts

Another way to describe the above is "how to learn contexts?" (where each "context" is a region where a particular set of biases are most true). Contextual reasoning is very important in software engineering since software projects are very different and constantly changing. One automatic method for finding contexts is something I call NICHE (it used to be called IDEA, then WHERE, but I found an 1989 article that convinced me that "NICHE" was a better term).