Gmail * Tim Menzies * ------------------------------------------------------------------------ *lost: tse reviews* ------------------------------------------------------------------------ * David Owen * * Sat, Jun 27, 2009 at 12:34 PM * To: tim@menzies.us Tim, Hope you're feeling better. Here are the reviews: RE: TSESI-2008-10-0347, "The Benefits of Combining Diverse Model Translation and Verification Strategies" Manuscript Type: Special Issue on Software Dependability Dear Prof. Cukic, We have completed the review process of the above referenced paper for the IEEE Transactions on Software Engineering. Enclosed are your reviews. Associate Editor Dr. Software Dependability Goseva and Kanoun has recommended to the Editor-in-Chief that your paper undergo a major revision. If you should choose to revise your paper, please prepare a separate document describing how each of the reviewers' comments are responded to in your revision and submit within three months. To revise your manuscript, log into https://mc.manuscriptcentral.com/tse-cs and enter your Author Center, where you will find your manuscript title listed under "Manuscripts with Decisions." Under "Actions," click on "Create a Revision." Your manuscript number has been appended to denote a revision. Once the revised manuscript is prepared, you can upload it and submit it through your Author Center. When submitting your revised manuscript, you will be able to respond to the comments made by the reviewer(s) in the space provided. You can use this space to document any changes you make to the original manuscript. In order to expedite the processing of the revised manuscript, please be as specific as possible in your response to the reviewer(s)’ questions and comments. You may also upload your responses as separate files for review along with your revision. If you choose to do this, please choose “Summary of Changes” as the file designation. IMPORTANT: Your original files are available to you when you upload your revised manuscript. Please delete any redundant files before completing the submission. When the submission process is complete, you will receive an automated confirmation email immediately. If you did not receive that email, your submission is not yet complete. The journal’s publication coordinator will contact you should we have any concerns or questions regarding your revision. Otherwise, your revision will be forwarded to the assigned associate editor with a request to begin the second round of reviews. Our page limitation and formatting guidelines for TSE can be found on http://www.computer.org/portal/site/transactions/menuitem.eda2ca84d8d67764cfe79d108bcd45f3/index.jsp?&pName=transactions_level1&path=transactions/tse/mc&file=author.xml&xsl=article.xsl&#manuscript Please do not hesitate in contacting us should you have any questions about our process or are experiencing technical difficulties. You can reach me at tse@computer.org . Thank you for your contribution to TSE, and we look forward to receiving your revised manuscript. Thank you, Mrs. Joyce Arnold TSE Assistant IEEE Transactions on Software Engineering IEEE Computer Society 10662 Los Vaqueros Circle Los Alamitos, CA 90720 USA Voice: 714.821.8380 Fax: 714.821.9975 tse@computer.org ******************** Editor Comments Dear Authors: We have received four substantive reviews for your paper. The reviewers describe many valid concerns with the technical content that need to be remedied. They also make some very useful suggestions for improvement of the paper and clarification of its technical contribution. Therefore, we are going to recommend this paper to undergo a major revision. Please revise the paper carefully, responding to reviewers recommendations and detailed comments. To keep the Special Issue on schedule, please submit the revised version of the paper at latest by July 8, 2009. ******************** Reviewer Comments Reviewer: 1 Recommendation: Author Should Prepare A Minor Revision Comments: The paper advocates the use of multiple formal verification tools to adequately verify a formal model against requirements. The authors consider a formal specification written in SCR (a Personnel Access Control System), generate fault seeded versions of theation tools (Salsa, Cadence SMV, NuSMV, SPIN, Lurch) to verify properties/detect errors on these specifications. From the results of these experiments and the performances of the tools, they define a strategy to efficiently combine the tools. The idea of combining different verification tools is a very interesting one and has not yet been explored much. The paper gives an informative example of what could be a verification strategy for an SCR specification, it however remains only an example from which it is difficult to infer a generic approach, more work still needs to be done for that purpose. The paper is clear and easy to read, provides sufficient information on the experiments and includes interesting discussions. Related work is sufficient and appropriate. The paper could be improved on the following points. * Your motivating example did not really convince me and are even sometimes counterproductive. In the first one, the problem for me is really a translation problem and it is indeed an important issue if you want to advocate the use of multiple tools working on different input languages. There is some work to be done on the semantics of the different languages used, the similarities, the differences, whar can be handled by which tool. You discuss it a bit in section 4 but I think it is a central issue in your approach. In the second example, the problem comes from the fact that the SCR specification was not correct. And the third example is not really convincing either. The general idea I got from this section is that because there can be mistakes at every level (spec, translation, tool), it is better to use several different tools to have more chance to detect and diagnose the error. It is not false but it is usually an argument that is more used for informal verification techniques and it is much less convincing than the results you give on your experiment. So maybe this section should be improved. * You propose a verification strategy based on the results you obtained on a single specification. Do you think the results would be the same for another SCR specification (with different characteristics) or different kinds of properties? In other words, does the strategy only depends on the capabilities and complementarity of the tools? It would be interesting to discuss this point. * I agree with the general conclusion that it is interesting to combine different verification tools, but I also thinks that using only one formal verification tool can be complex for an industrial user because formal tools in general lack support for everything that is not purely the verification itself (understanding of counter examples, management of verification configurations and so on). A verification framework including several tools and verification strategies will be even more difficult to use for a non specialist. I think there is really a lot of work to be done in this area. *Details: p6 : patrial order instead of partial order p8 : one word missing (we showing) How relevant is this manuscript to the readers of this periodical? Please explain under the Public Comments section below.: Very Relevant Is the manuscript technically sound? Please explain under the Public Comments section below.: Yes 1. Are the title, abstract, and keywords appropriate? Please explain under the Public Comments section below.: Yes 2. Does the manuscript contain sufficient and appropriate references? Please explain under the Public Comments section below.: References are sufficient and appropriate 3. Please rate the organization and readability of this manuscript. Please explain under the Public Comments section below.: Easy to read Please rate the manuscript. Explain your rating under the Public Comments section below.: Good ******************** Reviewer: 2 Recommendation: Author Should Prepare A Minor Revision Comments: Reader Interest - This paper describes the author's experiences applying several verification tools to an SCR requirements model. This topic is everal interesting points, but also makes some claims I feel are quite misleading. 1) The authors state that the translation from SCR to SMV provided with the SCR toolset does not support the translation of assertions involving NEXT states. For this reason, only 9 of the 16 assertions in the original SCR assertion are checked by (Nu)SMV. This may be true of the translator provided to the authors, but there is no reason why a translator cannot be written from SCR to SMV that support such assertions. Both SCR and SMV are two state models. At worst, this could be accomplished by adding a few additional state variables as was done for SPIN (though its only necessary to add additional state variables for those used in assertions involving NEXT, not for every variable in the specification). This would not be a problem except that the authors repeatedly infer that this is a shortcoming of (Nu)SMV, not the translator. For example, on page 22 they refer to "this limitation to the effectiveness of SMV", on page 24 they state that "in our experimental framework only SPIN can be used to fully verify all 16 of the assertions", on page 26 they state that "Cadence SMV and NuSMV could not check 6 of the 15 properties", on page 28 they state that "the only tool capable of fully verifying all types of assertions present in an SCR specification is SPIN", on page 34 they state that "SPIN ... is the only tool capable of fully verifying SCR specifications". Sometimes their statements are accompanied by qualifiers attributing the limitations to their translator, but just as often they are not. The overall effect is to convey the impression that SMV and NuSMV cannot verify properties involving more than a single state. Our experiences have been quite different - with an effective translator, (Nu)SMV should have been able check all 16 properties, and for a synchronous example as small as the one presented by the authors, in roughly the same time as that required for the single state assertions. In similar fashion, the translator provided to the authors from SCR to SPIN makes two copies of every variable in the model to support assertions using NEXT. This is not necessary - only the variables referenced using NEXT need to be duplicated. This greatly slows down SPIN, and the authors frequently comment on how slow SPIN is compared to the other verification tools. Again, its not clear if this is due to the translator or to SPIN. At the very least, the authors need to carefully rewrite these statements to make it perfectly clear that the limitations of the translators are not being attributed to (Nu)SMV and SPIN. 2) The authors state that (to the best of their knowledge), mutant specifications have not be been in the realm of formal verification. This is incorrect. Hanna Chockler wrote her dissertation on this topic "Coverage Metrics for Temporal Logic Model Checking" and published a paper "Coverage Metrics for Temporal Logic Model Checking", Formal Methods in System Design, Volume 28, #3, 2006. Mats Heimdahl and Mike Whalen also used mutations and model checkers to to check the effectiveness of varying test coverage criteria given different, functionally equivalent models in an ICSE 2008 paper: "The Effect of Program and Model Structure on MC/DC Test Adequacy Coverage". 3) I am suspicious of the authors explanation of why SPEC and INVARSPEC are treated differently in NuSMV. The SPEC keyword in NuSMV has been deprecated for quite a while now in favor of CTLSPEC. Its unclear from the paper if the authors were using NuSMV as a bounded model checker or as an implicit state bounded model checker. That information would help me to understand why SPEC was being improperly handled by NuSMV. 4) In general, the authors conclusion that using several back end tools to find translation strategies is not a realistic one. Real users aren't going to have the luxury of verifying a specification several times. Having different tools catch different errors is onlfferent translators will be used for each verification tool and that the same error will not be made in each translator. Even in this paper, they used part of the same translation chain for SPIN and Lurch. The real lesson here isn't for the users to use multiple verification tools - it's for the tool developers to get their translators right and to make them generate efficient input code for the verification tools. 5) I can't figure out out to relate the numbers in the first paragraph of page 23 (lines 4-10) to Figure 8. 6) The authors claim on page 31, lines 29-33, that automatically generated models tend to be less efficient and use more memory than carefully handwritten models. This has not been our experience. We have found that well written translators do an excellent job of producing efficient models for model checking, usually producing more efficient models than we would have expected by hand. In any case, the use of hand written models is a non-starter. Models describing real systems are becoming so large that re-specifying them in the language of the model checker by hand will be prohibitively expensive. Even if a hand written model were created, nobody would trust them to really describe the system being developed. Presentation - Keywords I'm not sure that the keywords of "fault tolerance" and "software performance" are all that helpful here. Fault tolerance is usually associated with techniques to detect or mitigate faults in hardware - here the authors are using diverse tools to detect errors some of the tools miss. "Fault detection" might be more appropriate. Software performance is usually associated with the techniques to improve the performance of software - here the authors are comparing the performance of several verification tools. "Tool Performance" might be more appropriate. Presentation - References The authors need to cite the work of Hanna Chockler, Mats Heimdahl, and Mike Whalen mentioned above. The authors do not cite the work sponsored by the NASA Langely Research Center on model checking (http://shemesh.larc.nasa.gov/fm/fm-flightdeck.html ) conducted by both Honeywell and Rockwell Collins. The work by Rockwell Collins and the University of Minnesota on translating RSML, Simulink, and SCADE models to NuSMV and Prover are particularly relevant to their research. Presentation - Readability The English in the paper is excellent and it is generally very well written It does tend to get rather verbose, particularly in the conclusions. It would be easier to read if it were made shorter and tighter, but that's a fairly significant rewrite. Evaluation This paper is one of the more difficult papers I've had to evaluate. It is well written and is on a highly relevant topic. However, the quality of the translators used (which the authors may not have had a choice about) greatly weaken the contribution of the paper. If good translators had been available, this paper could have presented some useful comparisons of SMV, NuSMV, SPIN, and Salsa. However, the results are so dominated by the quality of translators that its hard to conclude anything useful about the individual verification tools. The authors conclusions that diverse verification strategies are necessary is one that I agree with, but for different problem domains, not to compensate for tool deficiencies. The real lesson here is the importance of designing tools (particularly translators) that are correct and generate efficient specifications for the verification tools. If real users have to go to the time and expense of using multiple verification tools to compensate for translator errors and shortcomings, they're simply going to revert back to testing. The authors do make a valid point that it may be more efficient to use tools that are fast first to weed out errors those tools can find, but only if this speeds up the use of subsequent tools to find other classes of errors. If not, you might as well just use the slower but more complete tool in the ir - its acceptable, but not of the level I would expect in an IEEE TSE paper. If the paper is accepted, the authors should attempt to rewrite all sections that make statements about (Nu)SMV and SPIN to make perfectly clear that many of the results are due to the translators used, not the verification tools themselves. I would also recommend shortening the conclusions sections. How relevant is this manuscript to the readers of this periodical? Please explain under the Public Comments section below.: Very Relevant Is the manuscript technically sound? Please explain under the Public Comments section below.: Partially 1. Are the title, abstract, and keywords appropriate? Please explain under the Public Comments section below.: No 2. Does the manuscript contain sufficient and appropriate references? Please explain under the Public Comments section below.: Important references are missing; more references are needed 3. Please rate the organization and readability of this manuscript. Please explain under the Public Comments section below.: Easy to read Please rate the manuscript. Explain your rating under the Public Comments section below.: Fair ******************** Reviewer: 3 Recommendation: Author Should Prepare A Minor Revision Comments: Summary: The paper looks at the issue of using various verification and testing tools on the same example to determine how such tools can achieve the best synergy. Specifically the authors use a SCR model of an access controller and analyze it with an invariant checker, two related symbolic model checkers, an explicit state model checker and lastly a random testing tool. The ultimate conclusion of the work is that the combination of tools work better than any one individual tool. The authors present a methodology for using the tools in sequence to allow the optimal results in terms of time; memory was not really an issue in any of the experiments presented. Verdict: I support acceptance with some revisions outlined below in the Evaluation section. For and Against + Well written, easy to read and technically sound. + Interesting results for others in the field - Scope of the results is limited - Title does not reflect this limited scope - Need more details on the tool performance Evaluation: This is a nicely written paper that is easy to read and follow. The main drawback as I see it is that the results have very limited scope: it only pertains to analyzing SCR models using the translation tools available to Salsa, (Nu)SMV, SPIN and Lurch.. If one wants to be even more specific it only really pertains to one example as well. The authors must change the title to reflect the true scope of this work: adding “…for SCR” to the end of the title would be the minimal change required. What is really done here is to compare the different translation tools to the back-end analyzers. In most previous studies that compared tools, the focus is on the analyzers, but here the focus is almost exclusively on the translators. The reason why the tools are not actually being analyzed is that unlike in other studies the example is “small” enough that all the tools can analyze it; although SPIN could not do that with its vanilla settings and had to be helped along a bit. What I would like to see added to the paper is a more in-depth discussion of the differences in behavior of the tools. For example, the SMVs could not find errors that SPIN could. What was the reason for this? Was it due to the fact that the SPIN errors were spurious due to not considering NATURE constraints, was it due to the properties requiring NEXT and thus not suitable for SMV, or was there a fundamental issue with SMV (for example maybe it needed some fine-tuning as was done for SPIN to find more errors). Why did Lurch miss some errors that SPIN found? Exactly what made these hard to find for the random search? This is a question many people are interested in! How many of the SPIN errors were spurious due to not looking at NATURE etc., why, what made them harder to find? Some more specific issues (I will give page and line numbers in the P5L34(-39) format) 1. P4L17 Why is it that one would want random search to confirm the results of model checking? Seems impossible when there are no errors and unnecessary when there is an error. 2. P8L38 disconfirmed should be discarded 3. P8L36-42 There is a whole body of research devoted to vacuity checking that addresses the shortcoming of not having any feedback when a model checker claims a formula holds. Please make reference to this work; it is mostly for symbolic model checking though. 4. P8L47 “we showing that a violation reported” typo should be “..show..” 5. P16L13-31 For Salsa, what are generic disjointness and coverage properties? I think the former of the two is mentioned somewhere but not the latter. Please state it here too so that this section becomes more self-contained. Also, give a reason and an example of why only two-states are given as a counter-example. Later on it is alluded to that it is hard to determine if a failure is spurious or not, why? 6. P17L10 type, extra “)” at the end of the paragraph. 7. P17L17 Artefact should be artifact 8. P17L19-20 Are you sure such an experiment has not been performed in verification. I pretty sure lots of mutation related experiments have been done in verification, but to be honest can only think of a Paul Black paper at ASE around 2000 that used mutation but not exactly in this fashion. 9. P18L30 type user’ name should be user’s name 10. P20L22-33 This whole paragraph should be rewritten, to make it more clear. I think I know what you are trying to say but it is not overly obvious. It is also not obvious that getting the same coverage is even relevant after mutation is even relevant, isn’t that a function of the test suite more than the mutated program. 11. P20L38 Typo, In SCR Toolset should be In the SCR toolset 12. P21L20L41 What are the assertions and what are the generic properties used and where do they come from? Also, why is it necessary to subdivide into these categories since it seems this play little or no role in the rest of the paper. It is stated that 16 assertions are checked in the rest of the paper, so no generic properties? 13. P22L3-21 In Fig 8 the one error on the right that lurch never found, was that an actual error and why did Lurch not find it? Same really goes for the one found once, why was that, what made it so hard to find? 14. P26L48 only time the number of assertions are referred to as 15 and not 16? 15. P30L15-19 How many of the errors found by SPIN/Lurch were due to not having the NATURE constraints, i.e. are false positives? 16. P34L24-26 What were these errors that were so hard to find for SPIN, and why were they hard to find? How relevant is this manuscript to the readers of this periodical? Please explain under the Public Comments section below.: Relevant Is the manuscript technically sound? Please explain under the Public Comments section below.: Yes 1. Are the title, abstract, and keywords appropriate? Please explain under the Public Comments section below.: No 2. Does the manuscript contain sufficient and appropriate references? Please explain under the Public Comments section below.: References are sufficient and appropriate 3. Please rate the organization and readability of this manuscript. Please explain under the Public Comments section below.: Easy to read Please rate the manuscript. Explain your rating under the Public Comments section below.: Good ******************** Reviewer: 4 Recommendation: Reject Comments: The authors present work where they attempt to verify a small model expressed in the SCR modeling notation using several diverse verification tools. From their experiences, they attempt to provide an approach where the diverse tools are used together to complement each other, an attempt to take advantage of the strengths while eliminating weaknesses. They also provide some performble, the experiences in this paper do not support the conclusions and the approach proposed really has nothing to do with the verification techniques used. First, the study performed exposed no problems with the verification tools. They exposed several faults with either the translation or the assumptions made on the environment or semantics of the language being translated. Thus, the title and general direction of the paper is quite misleading. What they are arguing for in this work is really n-version programming in the translation portion of the verification tool chain. They cold have gotten exactly the same results with several translators targeting the same verification tool. The paper does not discuss this aspect of the problem and there is no reference to the n-version programming studies by John Knight and Nancy Leveson back in the 80s. Second, the performance data is misleading at best. The translation of a model to the input language of a model checker is hugely important for the performance of the analysis. As an example, in our own work we have seen an order of magnitude performance difference between tow different translations from one language to the same model checker. Thus, performing a performance evaluation of the model checkers based on arbitrary translations makes little sense. In addition, the verification can be highly sensitive to how the properties are structured as well as the settings on the model checker. For example, performing a backward versus a forward search in symbolic model checker, or reusing a previous variable ordering for the BDD (either by reading from a file or verify many properties in the same run). In addition, the analysis is only performed on one example providing no statistical significance. In sum, the performance results are not telling us much. These two weaknesses do, in my opinion, make the paper unpalatable in TSE. In addition, there are quite a few editorial and clarification issues that should be resolved. Page 2: You mention that something is to be reduced as “much as possible”, that is an ambitions goal… I think the goal is typically to reduce something to an acceptable level. Middle of page 3: You mention that complementary performance implies that a problem will be easy for at least one tool. I see no evidence of this anywhere. Scalability and undecidability are the bane of verification and-I assert-a problem too large for one tool is very likely to be too large for all other tools also. There are problems that do not scale in one tool, but scales well in another, but I do not think that is the norm. This paper did nothing to change that view. The related work section is highly unfocused and is difficult to follow. I suggest you reorganize around the various verification technologies (explicit state, symbolic, SMT based, etc.). You also have some inappropriate statements in this section that you do not back up. For example, you claim SMV is the best implementation of a symbolic checker, I am pretty sure there are several groups with a different opinion (Cadence being one of them). How relevant is this manuscript to the readers of this periodical? Please explain under the Public Comments section below.: Relevant Is the manuscript technically sound? Please explain under the Public Comments section below.: No 1. Are the title, abstract, and keywords appropriate? Please explain under the Public Comments section below.: No 2. Does the manuscript contain sufficient and appropriate references? Please explain under the Public Comments section below.: References are sufficient and appropriate 3. Please rate the organization and readability of this manuscript. Please explain under the Public Comments section below.: Readable - but requires some effort to understand Please rate the manuscript. Explain your rating under the Public Comments section below.: Poor >>> Tim Menzies > 06/26/09 9:55 PM >>> dear david, can you send along the reviews on the tse article? my email has EAtimm? [Quoted text hidden] ------------------------------------------------------------------------