EDITOR'S SUMMARY:

There is seemingly a lot of work to do here. In particular, you need to focus on stating the problem more
clearly and clearly outlining your novel contribution(s). Part of that will include relating better to your own
recent work mentioned by Reviewer#1. Also, please make clear what you mean by "Monte Carlo", as both
reviewers seem confused on this point. Also, it seems like it should be clear to readers exactly what mission
critical failures your tool diagnosed! Please fix that also.

In addition, please address in detail all the comments of both reviewers.

REVIEWER #1: Summary: The paper experimentally evaluates two, related treatment learners developed by the authors, TAR3 and the more recent TAR4.1, against two optimization algorithms, Simulated Annealing and Quasi-Newton algorithms, using similar objective functions, on three datasets. The first two of these are simulations from NASA systems; the third is data from a bicycle ride.

The paper reports three findings. The first is that treatment learners are much faster than the SA and QN optimization algorithms. This confirms earlier results by the second author in treatment learners. The second finding is that TAR3's treatments are more precise than SA and QN. The third finding is that TAR4.1 has a higher recall and lower false positive rate than SA or QN.

The paper's contribution is its experimental results showing that the two treatment learning algorithms out-perform two standard optimization algorithms on the datasets. The reported results are interesting and have important implications for the expanded use of treatment learners in analyzing datasets produced by simulations of large, complex systems.

Major Comments:

1. Two of the datasets were from physics simulations for launch-abort and reentry of a vehicle, and both of these are certainly mission critical. However, contrary to the expectation raised by the title, it was not shown in the paper whether or how the treatment learner results actually assisted in diagnosing any mission-critical failures in these cases, nor were any examples given. The main experiment looked for any failure type, not an individual failure type. Fig. 9 and the accompanying text describe an experiment for "a specific error type" but no information is given as to what that was. The abstract refers to "fault points" and "settings most likely to cause a mission-critical failure", but again, this seemed to motivate the work on treatment learners rather than reflect results reported in this paper.
--
The wording in the paper has been revised and strengthened. We have provided some details on the failures being searched for. We have also attempted to clarify that these treatment learners are only looking for mission-critical failures, not just any type of failure.
--
2. The fact that TAR3's treatments perform better than standard techniques is not a new finding. Two of the authors of this paper were co-authors on a 2009 paper, "Software V&V Support by Parametric Analysis of Large Software Simulation Systems," that explored combinations of two learners, TAR3 and AUTOBAYES. The first two authors are co-authors on a 2010 Automated Software Engineering paper, "Finding Robust Solutions in Requirements Models," that uses BORE and TAR, and describes a comparison with simulated annealing in a different context. These papers should both be cited here as related work.
--
This paper had already been submitted prior to the acceptance of the 2010 ASE paper. A description of that experiment has been added to the "Related Work" section, and the 2009 paper is now referenced.
--
3. The paper is unnecessarily hard to read in several places and needs to be carefully edited to fill in logical gaps (see comments) and provide continuity across what seem to be separately authored sections that sometimes lack cohesion.
--
This new version of the paper has been edited more substantially.
--
4. Page 16 shows TAR4.1 and QN as tied on recall (i.e., within parentheses), but Figs. 6-8, as well as the introduction, indicate that TAR4.1 is better than QN on recall.
--
TAR4.1 and QN tied in terms of statistical ranking, though TAR4.1 had a better median values. This has been clarified in the text.
--
5. The discussion of TAR3 vs. TAR4.1 on p. 21 can probably be strengthened a bit. For example, can anything be said regarding the choice made by the Margins Analysis Project as to whether to use TAR3 or TAR4.1 be included in the discussion on p. 21? The only guidance given is that "The final recommendation on which treatment learner to use comes down to the priorities of a particular project." I would also guess that the data mining and IR fields have "weighed in" on whether precision or recall matters most for mission-critical systems.
--
Additional discussion has been added on the importance of precision vs recall.
--
6. Page 21: At least a brief explanation is needed as to why these two algorithms were chosen for the benchmark comparison.
--
A justification was added. Thank you for pointing this out this oversight.
--
Minor:

1. The paragraph in the Introduction, beginning "Ultimately, classifiers" on page 2, should be rewritten to be clearer to a reader without a background in this area. Is a "category" here the same as a "Class" in the previous paragraph? What's a "smallest" rule? The previous paragraph referred instead to a set of rules. What is evidence? What is a complex target? Try to keep your reader with you through the Introduction by defining your terms, using consistent terminology and/or abstracting out the details. A small example would help here.
--
The introduction has been revised with more consistent terminology. Additionally, an example of a situation where treatment learning would be used has been added.
--
2. It's not clear what Fig. 1 adds, and it can be deleted without loss.
--
This figure had no true purpose and has been removed.
--
3. The first paragraph of 4.3 does not define terms. What is C? What is c? What does "Some subset of e subset of E" mean?
--
The terms used in this paragraph have been clarified.
--
4. Why does 4.4 begin with an explanation of what a data miner is and why TAR3 isn't one?
--
This was a piece of rhetoric to motivate the creation of a more efficient treatment learner. It has been removed and the following text has been clarified
to better introduce that motivation.
--
5. In the Future Works section, the future work in Margins Analysis does not seem to be directly related to the work described here, so can be deleted.
--
All future work not directly related to the treatment learning stage has been removed.
--
6. Missing noun in all figures' captions: "iff their their are"
--
Fixed, thank you for pointing this out.
--
7. Page 11 introduces another learner, TAR4, that preceded TAR4.1. The innovation of TAR4 was the use of a NaÃ¯ve Bayesian classifier to improve runtime and lessen memory. Does TAR4.1 share this? If so, it may be better to omit the discussion of TAR4.
--
TAR4 and TAR4.1 are essentially identical except for a squared probability term in TAR4.1. TAR4 was mentioned as a reason for why that term is squared.
Such intrioduction was deemed unneccesary, and all mentions of TAR4 have been removed.
--
8. What does the "new" in the title refer to, both TARs or just TAR 4.1? TAR3 has already been evaluated in other papers; TAR4.1 builds on TAR3 so may not merit the claim of novelty.
--
The title has been changed.
--
9. p. 21: typo: "for an some"
--
Fixed. Thank you.
--
10. Reference 18 should be updated. NASA should be capitalized in Ref. 31, RAID in Ref. 33.
--
Thanks for pointing out the oversight. All three have been corrected.

REVIEWER #2: Those comments/remarks below are both for author and editor, and are adressed to author of the article.
1) "Monte Carlo filtering" in article name is never defined/explained later in the article.
--
This connection was not made clear enough in the previosu draft. This concept is explained earlier.
--
2) Need to show connection/difference between rule induction and treatment learning. Is treatment learning a new term introduced by authors? If so what is the necessety fo it?
3) Goal and contribution of the article should be stated unambiguously; any specifics for the data that authors work with should be explained(at least stated) in the abstract. The novel element (research) should be clearly distinguished - if there is any.
--
The goals of this research have been more thouroughly explained.
--
4) p.5 : RSE - what is it ? is RSE=RSA?
--
RSE refers to the Robust Software Engineering Group at NASA, as introduced in the previous section. This has been made clearer.
--
5) p.6-8 : authors show decision tree and result of treatment learning and say desicion tree is more difficult to interpret. This
is not correct comparison since both methods have different goals, and terminal nodes of tree ("rules") - can be sorted in a way that puts rules/nodes with high support/lift will be on top - the same that "treatment learning" does. Also it is not clear whether goal of treatment learning is to find only one big region with high "fail count" or to describe all faults. Tree seems more advantageos because it can do both and needs no discretization of numeric variables.

6) p. 8-9 : BORE Classification - is it the only special problem that is targeted with your approach? What is the origin of this problem?
Response generation method seems taken from the sky - and obviously makes the problem not supervised.
--
BORE is a generic classification system. The data we use initially is just given a continuous score, and BORE is used to convert those scores into discrete
class values. The text now clarifies this.
--
7) p.9 : Not clear how binary response of BORE is connected with multiclass response on par 4.3 (TAR 3) - how utility costs are selected , for example?
--
BORE does not neccesarly do a binary split, and the text has been rewritten to show that.
--
8) "Each example .. maps a set of attribute ranges R_i, R_j,...->C to some class score" - what does it mean? Usually example cosists of feature (variable) values, not a set of intervals - do you consider really different problem statement? How does it appear and why it is necessary? What is the number of attribute ranges in "..." ? to some class score - does it refer to class C?
--
This section has been clarified.
--
9) p.10 : "Past research ... has shown that treatment size should be leads than four attributes" - how such a general statement could be made? Was this "research" done on the same datasets with application of other RI(=rule induction) methods? Or it is just end user concern that a rule should not be too long?
--
This is more of a rule of thumb intended to keep rules understandable by humans. The wording has been clarified.
--
10) "TAR3 determines ... by first determinimg the lift of each individual attribute" - lift of an attribute is not defined.
--
The word "range" was left out of this sentence. The sentence has been rewritten to make its point clearer.
--
11) p.11. "TAR3 is not a data miner" - what is the meaning of this statement? As I see term "data miner" is not a formal term, five bullets below also are not formal/unambiguius definitions, at least most classifiers (linear regression ,trees, KNN, SVN( or rule induction (apriori, PRIM) obviously do not satisfy all five conditions below. Also bullets "online, anytime algorithm" and "are suspendable, stoppable and resumably" are not clear, and seem impossible to formalize.
--
This was a bit of unscientific rhetoric to show that TAR3 could be more efficient. It has been deemed unneccesary and has been removed.
--
12) "Domingos and Pazzini have shown ... that independence assumption is a problem only in vanishingly small percent of cases - again too general statement to be true in real life. That is probably a consequence of obvious fact that "real life datasets" are vanishingly small portion of "general" distribution of datasets (at least because they are non-random in some way). In my experience most real datasets contain a lot of highly correlated variables, for example pixel colors in face and text recognition, etc.
--
This statement was primarly included as an interesting footnote to explain the name of the algorithm, and as such has little bearing on the research work being conducted here. We have removed the statement, as it adds nothing of value to the paper.
--
13) p. 12 "Each example ... adds down_i and up_i" -> how it adds? this has to be wtitten in more formal way. F(R_i|base) and F(R_i|apex) are not defined.
--
This section has been rewritten to make each item more clear.
--
14) p. 14 - Gradient Based optimization - it is unclear how attributes (variables) for intervals in the "treatment" (decision rule) are selected and optimized using continuous optimization method. Need to formailize.
--
This entire section has been rewritten and clarified.
--
15) p. 16 - why you selected so many criteria for estimation of algoritm quality? Wny not use just misclassification error or balanced error rate (in case you have problems with imbalanced classes)? Anyway this is usually solved with specifying costs or priors for classes in the rule quality function - that does not depend on chocie of rule optimization approach.
16) QN is not self explaining - before you talk of gradient based optimization.
--
QN is short for Quasi-Newton, a form of gradient-based optimization. The term Quasi-Newton is used throughout the paper, and it has been made clearer that QN is a shortened form of that.
--
17) Results table are noninformative : a) time requerements - what units do you use (sec, min, hours)? also you cannot directly compare times because probably you used different implemetations of the algorithms, also parameters (like number of iterations and stopping criteria, or cooling/reheating strategy for SA) affect running time a lot, and you do not even mention parameter values! b) "50%" column - it is not clear what value is there, and in what measurment units. Probably you can present some theoretical estimations of computational complexity? c) It is unclear how percentile ranges for values were computed. Did you use many runs with different or same parameter values? How many runs?
--
Time units (seconds) have been added. 50% is the median value, with the unit being in seconds for runtime. For the other measures (recall, precision, p.false alarm), there are technically no units. Percentile ranges are for ten repeats, which has now been clarified.
--
18) p.23 - "External Validity" - what do you mean by external validity? Because there is no clear problem statement in the article, and you do not define what kind of data do you work with, it is very vague.
--
In this paper, we have used two data sets from NASA. Our techniques work well on this data. A third, non-NASA data set was incorporated to show that our results are not specific to NASA data.
--
19) p.24 - how optimization of code relates to theme of the article? is it research or experiment report/application article?
Everything below "Margins Analysis" words in conclusion is unintelligible or not connected with main theme of the article - should be explained for non - NASA people or dropped (I suggest the later).
--
The section on Margins Analysis has been dropped, as it is unrelated to the treatment learning work.
--