At the end of this draft, we offered a reply to
reviewer2 comment that ``the main limitation of
the research to me is the use of just two data
sets (coming from 1981 and 1993). Even though
there are subsets created, this still does not
change anything on this limitation. This is
especially true for the type of conclusions drawn
to demonstrate conclusion stability.  I can not
see the reason why further data sets are (not)
used to validate the stated proposition.''.

Politeness prevents us for a more honest reply to
Reviewer2.  This empirical basis of our paper is
as strong as several prior prominent TSE papers:

1) Chulani and Boehm's 1999 paper on Bayes tuning
in COCOMO used 141 records. We have 156.

2) Shepperd 2001 TSE paper is based on 6 data
sets. We have 19.  Yes, there is some overlap in
our 19 data sets but as observed in figures 3 and
4 of this paper, that overlap is not large.

3) The core empirical result of Shepperd's 2002
and 2005 TSE papers come from artificial data sets
that were generated using distributions pulled
from one data set per publication.

4) Our own prior TSE paper (October 2006) on this
topic was accepted after a considered examination
of two rounds of TSE reviews.  None of those
reviewers felt that we were over-generalizing our
conclusions from the small data sets used in that
paper.

The industrial reality is that this kind of data
is rare as hen's teeth. Now in an ideal world,
we'd have more that 156 records divided into 19
subsets. But Boehm has been trying for a decade to
extend that set, with no success.

Given the data poverty, it is important that we
can demonstrate stable conclusions with the
available data. Hence, this paper.