I write to protest, in the strongest possible terms, at the reviews of my recent submission to TKDE. While there are certain comments of merit in the reviewer remarks, on the whole the reviewers just missed the point. To some extent, the problem was in the presentation of the paper. It was my goal to produce a short TKDE paper and so perhaps I was too terse in some areas. A rewrite of this paper, perhaps with two extra pages is absolutely required and that rewrite would address certain issues raised by the reviewers. Nevertheless, I strong believe that the reviewers fixated on surface features of the paper and never really considered the paper's main message. ---------------------------------- Reviewer one remarks: "However, this interesting idea has not been studied gracefully enough to justify its publication.". In this regard, I just don't understand "gracefully enough". The paper presents algorithms, discusses related work, then analyzes the performance of those algorithms on dozens of data sets (20 from UC Irvine, some KDD cup data and an F18 flight simulator data). limitations to the analysis are then discussed. What more does Reviewer One want? ---------------------------------- Reviewer one also comments: There exist a large number of discretization methods for naove-Bayes as well as concept drift learning algorithm. However no empirical results are presented that compare SPADE or SAWTOOTH against its alternatives. This leaves readers wonder why they can be claimed to work ``very well``. This remark makes no sense at all. In the paper we state: "Provost and Kolluri [2, p22] comment that sequential learning strategies like windowing usually performs worse than learning from the total set". In figure 3 we compare our incremental method (which suffers from the Provost and Kolluri warning) to a widely used non-incremental method (kernel estimation) and we do (nearly) as well as kernel estimation. But kernel estimation requires N passes thru the system so won't scale to large data sets. Our method requires 1 pass and has a low memory footprint and so will scale to very large data sets. If our system does better than a non-incremental scheme, why do we need to empirically compare it against the LOWER baseline of other incremental schemes? To be fair to reviewer one, the current text does not place enough stress on the above point. This could be fixed in a rewrite. -| 1 |-------------------------------- Reviewer one also comments: In the conclusion section, the paper claims that one advantage is that ``In Figiure 3... This discretizer performed nearly as well as other discretization methods without requiring multiple passes through the data``. However, in Figure 3, SPADE is only compared with naove-Bayes with kernel estimation, which does not involve discretization at all. Where is the conclusion drawn from then? The following point is not stressed in the paper and could be fixed in a rewrite. The paper references the following, widely cite, publication: James Dougherty, Ron Kohavi, and Mehran Sahami, Supervised and unsupervised discretization of continuous features, in International Conference on Machine Learning, 1995, pp. 194-202. XXX andres: do we compare our stuff with n-bins etc? -| 2 |--------------------------------- Reviewer one comments: The understating of (naove) Bayes classifiers is far less than accurate. In the first paragraph on page 6, it is said that ``Bayes classifiers are called naove``. This expression is misleading. Bayes classifiers have a very big family. Naove Bayes is only one member out of it. Nobody calls Bayes classifiers naove except for naive Bayes. Reviewer one is being very unkind here. Many authors use the term Naive Bayes: Langley, Pazzini and Domingoes, the whole WEKA team. Domingos and Pazzini comment: "The classifier obtained by using this set of discriminant functions, and estimating the relevant probabilities from the training set, is often called the naive Bayesian classifier." @misc{ domingos97optimality, author = "P. Domingos and M. Pazzani", title = "the optimality of the simple Bayesian classifier under zero-one loss", text = "Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103--130.", year = "1997", url = "citeseer.ifi.unizh.ch/domingos97optimality.html" } We considered adopting the Domingos and Pazzini renaming (call them "Simple" not "Naive") but a review of public domain sources showed that "Naive" was a more common term than "simple" (e.g. wikipedia has an entry for "naive bayes" but not for "simple bayes"). So we stayed with the common parlance. -| 3 |--------------------------------- Reviewer one then comments The paper then goes on by saying `` since they assume that the frequencies of different attributes are independent``. This statement is wrong. Instead, naove Bayess `attribute independence assumption` is: ``attributes are independent of each other given the class``. This is a minor typo and we thank reviewer one for that correction. Our understanding the class dependencies is clearly shown in fig 2 (the classify function, where the frequency counts from different class hypotheses are added to separate parts of the frequency table). -| 4 |--------------------------------- Reviewer one then comments: SPADE is interesting since it does not need to repeat scanning the data. This will be useful in applications where one can not retain the whole historical data. However, there are two potential pitfalls that the paper fails to address: >>> first on the merge mechanism. It produces new cut points from the old cut points. For example, the old discretization of age is (, [30, 39], [40, 49], ). Merging the two intervals will still retain the old cut points like 30 and 49. But what if should the appropriate cut points be 35 and 45 instead? >>> second on lacking a split mechanism. Although the paper has mentioned it is because ``do not know how to best divide up a bin without keeping per-bin data`` and `` experiments suggested that adding SubBins=5 new bins between old ranges and newly arrived out-of-range values was enough to adequately divide the range``, those arguments can not trade-off the need of a split operator. For example, the instances are patients coming into a clinic one after the other. The first one is an infant while the second one is an old lady. In the two first instances, one has seen the two far ends of the age attribute [1, 90]. SPADE will produce 1+5 intervals by now and forever (assume the oldest is 90 years old). The reason behind this sub-optimality is that the attribute values do not necessarily gradually change, they can abruptly shift. How does Reviewer One reconcile their theoretical concerns with SPADE and our experimental results? Is the reviewer saying that our experiment methods are somehow in error? We would be happy++ to supply more information on those experiments. But we should add that when we first designed SPADE, we shared the above concerns. However, on experimentation (and those experiments are clearly described in the paper), those concerns turned out to be irrelevancies. EVERY learning method has a search bias and once that is known it is possible to create and example that defeats that method (as Reviewer One does in the above paragraph). For example: ** Naive Bayes (which we will call "Simple Bayes" in future drafts) assumes independence between attributes (given the same class) and with that knowledge is it possible to devise examples that confuse that classifier. However, in practice, those kinds of examples have yet to be seen in naturally occurring data sets. In fact, that algorithm works astonishingly well: ** witness the good performance of naive bayes shown in the above Domingos and Pazzani paper, ** see also our own experiments at http://www.cs.pdx.edu/~timm/scant.org/2/xval.html). So, to assess a learner (or a discretization scheme), it is not enough merely to conduct small experiments on one made-up example. Instead, we need to explore real-world data in all its glorious complexity. And this is what this paper does (see fig3 of our paper). -| 5 |------------------------ Reviewer One says: 3. The paper mentions the MaxBins parameters is by default set to be the square root of all the instances seen to date. If the paper wants to justify this setting, it may help by citing a causal paper: Ying Yang and Geoff Webb, Proportional k-interval discretization for naive-Bayes classifiers, ECML 2001. This remark is incredible. Did Reviewer One give this paper any more than a cursory read? We are well aware of the Yang and Webb work. In the current draft of the paper, we even cite a paper that is NEWER than the older one mentioned by reviewer one: Y. Yang and G. Webb, Weighted proportional k-interval discretization for naive-bayes classifiers, in Proceedings of the 7th Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2003), 2003, Available from http://www.cs.uvm.edu/yyang/ wpkid.pdf. -| 6 |-------------------------- Reviewer Two says: However, we now know that, roughly speaking, getting 90% of the best possible performance is quite easy, but getting that last 10% can be quite hard. Therefore, the results on the KDDCUP dataset presented in this paper are not surprising. They're close to, but not as good as, the results from the winning system which was much more complicated. This comment is a mis-read of the paper. Firstly, we aren't showing a 90-10 rule using simple methods. We are showing more like 99-1 results using very very simple methods: ** average difference between schemes in Figure 3 is -1.1%; ** there is barely any difference between SPADE/SAWTOOTH and the KDD cup winner in fig4. ** in fig 5, we are within 3% of standard methods on UCI data. -| 7 |-------------------------- Reviewer Two says: The observations in section II on finding plateaus, and the method used, do not seem to constitute a novel contribution. As the authors acknowledge, the fact that relatively few instances often suffice has been noticed by others before. Figure 1 confirms this observation yet again. No, this is not a novel contribution. The paper does not claim that it is. But it does set the stage for the rest of the paper. Assuming plateaus, we don't need to work on mega-induction. Instead, for domains where the data generating phenomena changes SLOWER than time-to-plateau, the induction problem just becomes "learn what you can till plateau, disable learning while on the plateau, then reactivate learning only if you fall off the plateau". And once that is clear, then the next thing that follows is that standard methods, with minor modifications, will scale to very large data sets. The above paragraph was the line of reasoning that lead to this paper. So we tried the SIMPLEST method we could think of (Naive Bayes) and it worked- very well. -| 8 |------------------------------------- Reviewer 2 comments: The use of sliding windows to deal with non-stationarity is not new, though the use of equation 1 to control window growth may be. However, that equation is presented without discussion as to its derivation and appears to be ad hoc. That's not necessarily a bad thing, but some discussion of why equation 1 is expected to be useful is in order. Equation 1 comes from standard sampling theory when just compares the means of the SAME phenomena at different times (whereas a student t-test compares the means of different phonmena). We can spell this out more in a longer draft. shows -| 9 |------------------------------------- Reviewer 2 comments: Section IV is just a review of NB, and section V presents SPADE. Figure 3 suggests that SPADE performs roughly as well as John and Langley's method, which is true of a large number of other discretization methods. There's nothing particularly new or insightful about the approach. We don't understand this remark at all. Using the simplest method imaginable we work as well as widely used methods. Better yet, we scale up to large data sets (cause we are one pass). And even better than that, we get a built in confidence measure on our learners. Our learners produce conclusions and (if we track the average likelihood, as in Figure 7) we get a second measure showing how much we can trust the conclusions We are show that very simple methods (incremental Naive Bayes with windowing) doesn't do NEARLY as well. it out-performs. look at the 1 RE: TKDE-0074-0305, "Incremental Discretizastion and Bayes 2 Classifiers Hanldes Concept Drfit and Scales Very Well" 3 Manuscript Type: Concise 4 5 Dear Dr. Menzies, 6 7 We have completed the review process of the above referenced 8 paper that was submitted to the IEEE Transactions on Knowledge 9 and Data Engineering. Enclosed are your reviews. We hope that 10 you will find the editor's and reviewers comments and 11 suggestions helpful. 12 13 I regret to inform you that based on the reviewer feedback, 14 Associate Editor, Dr. Qiang Yang could not recommend publishing 15 your paper to our Editor-in-Chief. Final decisions on acceptance 16 are based on the referees' reviews and such factors as 17 restriction of space, topic, and the overall balance of 18 articles. 19 20 We hope that this decision does not deter you from submitting to 21 us again. Thank you for your interest in the IEEE Transactions 22 on Knowledge and Data Engineering. 23 24 Sincerely, 25 26 Ms. Susan Miller 27 Transactions Assistant 28 IEEE Computer Society 29 10662 Los Vaqueros Circle 30 Los Alamitos, CA 90720 31 USA 32 tkde@computer.org 33 Phone: +714.821.8380 34 Fax: +714.821.9975 35 36 *********** 37 Editor Comments 38 39 Reviewer 2 raised serious concern over the novelty of the work 40 as well provided many good suggestions (same as reviewer one). 41 On the basis of their reviews, I have to recommend rejection of 42 the paper. 43 44 *********************** 45 46 Reviewer Comments 47 48 Please note that some reviewers may have included additional 49 comments in a separate file. If a review contains the note "see 50 the attached file" under Section III A - Public Comments, you 51 will need to log on to Manuscript Central to view the file. 52 After logging in to Manuscript Central, enter the Author Center. 53 Then, click on Submitted Manuscripts and find the correct paper 54 and click on "View Letter". Scroll down to the bottom of the 55 decision letter and click on the file attachment link. This 56 will pop-up the file that the reviewer included for you along 57 with their review. 58 59 *********************** 60 Reviewer 1 61 62 63 Section I. Overview 64 65 A. Reader Interest 66 67 1. Which category describes this manuscript? 68 69 ( ) Practice / Application / Case Study / Experience Report 70 71 (X) Research / Technology 72 73 ( ) Survey / Tutorial / How-To 74 75 76 2. How relevant is this manuscript to the readers of this 77 periodical? Please explain your rating under IIIA. Public 78 Comments. 79 80 ( ) Very Relevant 81 82 (X) Relevant 83 84 ( ) Interesting - but not very relevant 85 86 ( ) Irrelevant 87 88 89 B. Content 90 91 1. Please explain how this manuscript advances this field of 92 research and / or contributes something new to the literature. 93 Please explain your answer under IIIA. Public Comments. 94 95 2. Is the manuscript technically sound? Please explain your 96 answer under IIIA. Public Comments. 97 98 ( ) Yes 99 100 ( ) Appears to be - but didn't check completely 101 102 ( ) Partially 103 104 (X) No 105 106 107 C. Presentation 108 109 1. Are the title, abstract, and keywords appropriate? Please 110 explain your answer under IIIA. Public Comments. 111 112 ( ) Yes 113 114 (X) No 115 116 117 2. Does the manuscript contain sufficient and appropriate 118 references? Please explain your answer under IIIA. Public 119 Comments. 120 121 ( ) References are sufficient and appropriate 122 123 (X) Important references are missing; more references are 124 needed 125 126 ( ) Number of references are excessive 127 128 129 3. Does the introduction state the objectives of the manuscript 130 in terms that encourage the reader to read on? Please explain 131 your answer under IIIA. Public Comments. 132 133 (X) Yes 134 135 ( ) Could be improved 136 137 ( ) No 138 139 140 4. How would you rate the organization of the manuscript? Is it 141 focused? Is the length appropriate for the topic? Please explain 142 your answer under IIIA. Public Comments. 143 144 ( ) Satisfactory 145 146 (X) Could be improved 147 148 ( ) Poor 149 150 151 5. Please rate and comment on the readability of this 152 manuscript. Please explain your answer under IIIA. Public 153 Comments. 154 155 ( ) Easy to read 156 157 (X) Readable - but requires some effort to understand 158 159 ( ) Difficult to read and understand 160 161 ( ) Unreadable 162 163 164 Section II. Summary and Recommendation 165 166 167 A. Evaluation 168 169 Please rate the manuscript. Please explain your answer under 170 IIIA. Public Comments. 171 172 ( ) Award Quality 173 174 ( ) Excellent 175 176 ( ) Good 177 178 (X) Fair 179 180 ( ) Poor 181 182 183 B. Recommendation 184 185 Please make your recommendation. Please explain your answer 186 under IIIA. Public Comments. 187 188 ( ) Accept with no changes 189 190 ( ) Author should prepare a minor revision 191 192 (X) Author should prepare a major revision for a second review 193 194 ( ) Reject 195 196 197 Section III. Detailed Comments 198 199 200 A. Public Comments (these will be made available to the author) 201 Incremental discretization is enchanting when put into the 202 context of concept drift. However, this interesting idea has not 203 been studied gracefully enough to justify its publication. I don't understand "gracefully enough". 204 205 The paper title claims that incremental discretization and Bayes 206 classifiers handle concept drift very well. There exist a large 207 number of discretization methods for naove-Bayes as well as 208 concept drift learning algorithm. However no empirical results 209 are presented that compare SPADE or SAWTOOTH against its 210 alternatives. This leaves readers wonder why they can be claimed 211 to work ``very well``. There are numerous experiments comparing SPADE/SAWTOOTH against its alternatives. 212 213 In the conclusion section, the paper claims that one advantage 214 is that ``In Figiure 3... This discretizer performed nearly as 215 well as other discretization methods without requiring multiple 216 passes through the data``. However, in Figure 3, SPADE is only 217 compared with naove-Bayes with kernel estimation, which does not 218 involve discretization at all. Where is the conclusion drawn 219 from then? 220 221 The understating of (naove) Bayes classifiers is far less than 222 accurate. In the first paragraph on page 6, it is said that 223 ``Bayes classifiers are called naove``. This expression is 224 misleading. Bayes classifiers have a very big family. Naove 225 Bayes is only one member out of it. Nobody calls Bayes 226 classifiers naove except for naive Bayes. The paper then goes 227 on by saying `` since they assume that the frequencies of 228 different attributes are independent``. This statement is wrong. 229 Instead, naove Bayess `attribute independence assumption` is: 230 ``attributes are independent of each other given the class``. 231 232 SPADE is interesting since it does not need to repeat scanning 233 the data. This will be useful in applications where one can not 234 retain the whole historical data. However, there are two 235 potential pitfalls that the paper fails to address: 236 237 >>> first on the merge mechanism. It produces new cut points 238 from the old cut points. For example, the old discretization of 239 age is (, [30, 39], [40, 49], ). Merging the two intervals 240 will still retain the old cut points like 30 and 49. But what if 241 should the appropriate cut points be 35 and 45 instead? 242 243 >>> second on lacking a split mechanism. Although the paper 244 has mentioned it is because ``do not know how to best divide up 245 a bin without keeping per-bin data`` and `` experiments 246 suggested that adding SubBins=5 new bins between old ranges and 247 newly arrived out-of-range values was enough to adequately 248 divide the range``, those arguments can not trade-off the need 249 of a split operator. For example, the instances are patients 250 coming into a clinic one after the other. The first one is an 251 infant while the second one is an old lady. In the two first 252 instances, one has seen the two far ends of the age attribute 253 [1, 90]. SPADE will produce 1+5 intervals by now and forever 254 (assume the oldest is 90 years old). The reason behind this 255 sub-optimality is that the attribute values do not necessarily 256 gradually change, they can abruptly shift. 257 258 259 In the second to last paragraph of Section V, the paper claims 260 that SPADE is good because it outperforms dealing with numeric 261 attributes by normal or kernel probability estimation. However, 262 the observation that discretization is better than probability 263 estimation has long been established. Mentioning it here only 264 again proves that discretization is better, but not that SPADE 265 itself is good discretization. A much convincing way is to 266 compare SPADE with peer discretization methods. 267 268 At the end of section C in experiments, it is said that `` 269 SAWTOOTH can retain knowledge of old contexts and reuse that 270 knowledge when contexts re-occur``. But the paper does not 271 mention before any mechanism to retain old concepts or identify 272 re-appearing concepts at all. How did this achievement happen 273 then? 274 275 276 Other minor comments: 277 278 1. Is WASTOOTH a method newly proposed in this paper or it is 279 only reused by this paper? It does not hurt to clarify this 280 point. If it is new, should emphasize more; if not, should give 281 a reference. 282 283 2. At the end of this paper, in the conclusion section, the term 284 ``V & V`` agent is mentioned for the first time. What does it 285 mean? 286 287 3. The paper mentions the MaxBins parameters is by default set 288 to be the square root of all the instances seen to date. If the 289 paper wants to justify this setting, it may help by citing a 290 causal paper: Ying Yang and Geoff Webb, Proportional k-interval 291 discretization for naive-Bayes classifiers, ECML 2001. 292 *********************** 293 Reviewer 2 294 295 296 Section I. Overview 297 298 A. Reader Interest 299 300 1. Which category describes this manuscript? 301 302 (X) Practice / Application / Case Study / Experience Report 303 304 ( ) Research / Technology 305 306 ( ) Survey / Tutorial / How-To 307 308 309 2. How relevant is this manuscript to the readers of this 310 periodical? Please explain your rating under IIIA. Public 311 Comments. 312 313 ( ) Very Relevant 314 315 (X) Relevant 316 317 ( ) Interesting - but not very relevant 318 319 ( ) Irrelevant 320 321 322 B. Content 323 324 1. Please explain how this manuscript advances this field of 325 research and / or contributes something new to the literature. 326 Please explain your answer under IIIA. Public Comments. 327 328 2. Is the manuscript technically sound? Please explain your 329 answer under IIIA. Public Comments. 330 331 (X) Yes 332 333 ( ) Appears to be - but didn't check completely 334 335 ( ) Partially 336 337 ( ) No 338 339 340 C. Presentation 341 342 1. Are the title, abstract, and keywords appropriate? Please 343 explain your answer under IIIA. Public Comments. 344 345 (X) Yes 346 347 ( ) No 348 349 350 2. Does the manuscript contain sufficient and appropriate 351 references? Please explain your answer under IIIA. Public 352 Comments. 353 354 (X) References are sufficient and appropriate 355 356 ( ) Important references are missing; more references are 357 needed 358 359 ( ) Number of references are excessive 360 361 362 3. Does the introduction state the objectives of the manuscript 363 in terms that encourage the reader to read on? Please explain 364 your answer under IIIA. Public Comments. 365 366 (X) Yes 367 368 ( ) Could be improved 369 370 ( ) No 371 372 373 4. How would you rate the organization of the manuscript? Is it 374 focused? Is the length appropriate for the topic? Please explain 375 your answer under IIIA. Public Comments. 376 377 (X) Satisfactory 378 379 ( ) Could be improved 380 381 ( ) Poor 382 383 384 5. Please rate and comment on the readability of this 385 manuscript. Please explain your answer under IIIA. Public 386 Comments. 387 388 (X) Easy to read 389 390 ( ) Readable - but requires some effort to understand 391 392 ( ) Difficult to read and understand 393 394 ( ) Unreadable 395 396 397 Section II. Summary and Recommendation 398 399 400 A. Evaluation 401 402 Please rate the manuscript. Please explain your answer under 403 IIIA. Public Comments. 404 405 ( ) Award Quality 406 407 ( ) Excellent 408 409 ( ) Good 410 411 (X) Fair 412 413 ( ) Poor 414 415 416 B. Recommendation 417 418 Please make your recommendation. Please explain your answer 419 under IIIA. Public Comments. 420 421 ( ) Accept with no changes 422 423 ( ) Author should prepare a minor revision 424 425 ( ) Author should prepare a major revision for a second review 426 427 (X) Reject 428 429 430 Section III. Detailed Comments 431 432 433 A. Public Comments (these will be made available to the author) 434 This paper describes SAWTOOTH and SPADE - the former is an 435 implementation of a Naive Bayes (NB) classifier that does 436 windowing on 437 the input data, and the latter is a one-pass discretization 438 algorithm. It is a bit difficult to ascertain the contribution 439 of the 440 paper. It could be, and the introduction leads one to believe 441 that 442 the authors consider it to be at least in part, the observation 443 that 444 simple systems can perform well on large datasets (such as the 445 1999 446 KDDCUP dataset). When Rob Holte made this observation over a 447 decade 448 ago, it was surprising to many. However, we now know that, 449 roughly 450 speaking, getting 90% of the best possible performance is quite 451 easy, 452 but getting that last 10% can be quite hard. Therefore, the 453 results 454 on the KDDCUP dataset presented in this paper are not 455 surprising. 456 They're close to, but not as good as, the results from the 457 winning 458 system which was much more complicated. 459 460 The observations in section II on finding plateaus, and the 461 method 462 used, do not seem to constitute a novel contribution. As the 463 authors 464 acknowledge, the fact that relatively few instances often 465 suffice has 466 been noticed by others before. Figure 1 confirms this 467 observation yet 468 again. Also, there's a paper from KDD by Provost, Jensen, and 469 Oates 470 on progressive sampling in which issues related to determining 471 when 472 learning curves have plateaued that's relevant. The problem is 473 fairly 474 difficult. 475 476 The use of sliding windows to deal with non-stationarity is not 477 new, 478 though the use of equation 1 to control window growth may be. 479 However, that equation is presented without discussion as to 480 its 481 derivation and appears to be ad hoc. That's not necessarily a 482 bad 483 thing, but some discussion of why equation 1 is expected to be 484 useful 485 is in order. 486 487 Section IV is just a review of NB, and section V presents 488 SPADE. 489 Figure 3 suggests that SPADE performs roughly as well as John 490 and 491 Langley's method, which is true of a large number of other 492 discretization methods. There's nothing particularly new or 493 insightful about the approach. 494 495 Finally, the experiments are mostly done well, though there is 496 a 497 complete lack of information about variance in the paper. Are 498 any of 499 the results statistically significant? I suspect in the end 500 that the 501 answer may not be relevant - some results will be, and some 502 won't, and 503 SAWTOOTH/SPADE will enter the pack of other 504 algorithms/approaches that 505 exhibit similar behavior, though on different datasets. There 506 is no 507 free lunch in machine learning. 508 509 Section VI-C describes an experiment in which the ability of 510 SAWTOOTH 511 to deal with concept drift is explored. However, very little 512 information about the simulator is provided and, more 513 significantly, 514 the paper never says precisely how SAWTOOTH "retain[s] knowledge 515 of 516 old contexts". 517 518 In summary, there's nothing really new in this paper. 519