Measures, Metrics and Indicators Derived from the Ubiquitous Two-by-two Contingency Table, Part I: Background
Asian Journal of Medical Principles and Clinical Practice,
This paper (the first part of two sibling parts) provides a tutorial exposition of indicators derived of the ubiquitous two-by-two contingency table (confusion matrix) that has widespread applications in many fields, including, in particular, the fields of binary classification and clinical or epidemiological testing. These indicators include the eight most prominent indicators used in diagnostic testing, namely the Sensitivity or True Positive Rate (TPR), the Specificity or True Negative Rate (TNR), the Positive and Negative Predictive Values (PPV and NPV), together with their respective complements, namely the False Negative Rate (FNR), False Positive Rate (FPR), False Discovery rate (FDR) and False Omission Rate (FOR). We consider also some other indicators, such as the total error and accuracy, pre-test prevalence, the diagnostic odds ratio (DOR), the inverse DOR, the F-scores, Youden’s Index (Informedness), Markedness and the Index of Association (Matthews Correlation Coefficient (MCC)). We review recent studies asserting that the MCC is the most reliable single metric derivable from the contingency matrix. We suggest that any mean (signed geometric mean, arithmetic mean, or harmonic mean) of Informedness and Markedness might be as effective as the MCC in summarizing the contingency matrix into a single value. We set criteria in terms of basic and composite indicators for identifying the quality of binary classification, going down from the perfect type to the completely-contradictory type, where random-guessing-like classification marks the middle point of transition between good and bad classification. In a sequel paper, we present a potpourri of example or test cases to reveal and unravel many of the properties and inter-relationships among binary and composite indicators.
- Diagnostic testing
- binary classification
- predictive values
- F scores
- Matthews correlation coefficient
- means of Informedness and Markedness
- Muzainah Ali Rushdi
How to Cite
Johnson KM. The two by two diagram: A graphical truth table. Journal of Clinical Epidemiology. 1999;52(11):1073-1082.
Flach PA. The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In Proceedings of the 20th international conference on machine learning (ICML-03). 2003;194-201.
Azzimonti Renzo JC. Failures of common measures of agreement in medicine and the need for a better tool: Feinstein's paradoxes and the dual vision method. Scandinavian Journal of Clinical and Laboratory Investigation. 2003;63(3):207-216.
Freeman JV, Julious SA. The analysis of categorical data. Scope. 2007;16(1):18-21.
Texel PP. Measure, metric, and indicator: An object-oriented approach for consistent terminology. In 2013 Proceedings of IEEE Southeastcon 2013 Apr 4. IEEE. 2013;1-5.
Canbek G, Sagiroglu S, Temizel TT, Baykal N. Binary classification performance measures / metrics: A comprehensive visualized roadmap to gain new insights. In 2017 International Conference on Computer Science and Engineering (UBMK) 2017 Oct 5. IEEE. 2017;821-826.
Brzezinski D, Stefanowski J, Susmaga R, Szczȩch I. Visual-based analysis of classification measures and their properties for class imbalanced problems. Information Sciences. 2018;462:242-261.
Neth H, Gradwohl N, Streeb D, Keim DA, Gaissmaier W. Perspectives on the 2× 2 Matrix: Solving Semantically Distinct Problems Based on a Shared Structure of Binary Contingencies. Frontiers in Psychology. 2020;11(567817):1-31.
Rushdi RA, Rushdi AM, Talmees FA. Novel pedagogical methods for conditional-probability computations in medical disciplines. Journal of Advances in Medicine and Medical Research. 2018; 25(10):1-15.
Rushdi AMA, Talmees FA. An exposition of the eight basic measures in diagnostic testing using several pedagogical tools, Journal of Advances in Mathematics and Computer Science. 2018;26(3):1-17.
Rushdi RA, Rushdi AM. Common fallacies of probability in medical context: A simple mathematical exposition. Journal of Advances in Medicine and Medical Research. 2018;26(1):1-21.
Rushdi RA, Rushdi AM. Karnaugh-map utility in medical studies: The case of Fetal Malnutrition. International Journal of Mathematical, Engineering and Management Sciences. 2018;3(3):220-244.
Rushdi AMA, Talmees FA. Computations of the Eight Basic Measures in Diagnostic Testing. Chapter 6 in Advances in Mathematics and Computer Science, Vol. 2, Book Publishers International, Hooghly, West Bengal, India. 2019;66-87.
Rushdi RAM, Rushdi AMA. Mathematics and Examples for Avoiding Common Probability Fallacies in Medical Disciplines. Chapter 11 in Current Trends in Medicine and Medical Research, Book Publishers International, Hooghly, West Bengal, India. 2019;1: 106-132.
Rushdi AMA, Serag HA. Solutions of ternary problems of conditional probability with applications to mathematical epidemiology and the COVID-19 pandemic. International Journal of Mathematical, Engineering and Management Sciences. 2020;5(5):787-811.
Serag HA, Rushdi AMA. Checking consistency among the four basic indicators of diagnostic testing in Saudi medical journals, Asian Journal of Medical Principles and Clinical Practice. 2021;4(1): 14-27.
Rushdi AMA, Serag HA. Inter-relationships among the four basic measures of diagnostic testing: A signal-flow-graph approach. Journal of King Abdulaziz University: Computing and Information Technology Sciences. 2021;10(1):49-72.
Rushdi AM, Serag HA. Has the pandemic triggered a ‘paperdemic’? Towards an assessment of diagnostic indicators for COVID-19. International Journal of Pathogen Research. 2021;6(2):28-49.
Rushdi, RA, Rushdi, AM, Talmees, FA. Review of Methods for Conditional-Probability Computations in Medical Disciplines, a Chapter in Highlights on Medicine and Medical Research, Book Publishers International, Hooghly, West Bengal, India. 2021: 76-94.
Rufibach K. Use of Brier score to assess binary predictions. Journal of Clinical Epidemiology. 2010;63(8):938-939.
Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960;20(1): 37-46.
Sebastiani F. An axiomatically derived measure for the evaluation of classification algorithms. In Proceedings of the 2015 International Conference on the theory of Information Retrieval. Sep 27, 2015; 11-20.
Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association. 1983;78(383):553-569.
Campagner A, Sconfienza L, Cabitza F. H-accuracy, an alternative metric to assess classification models in medicine. In Pape-Haugaard LB, et al. (Editors), Digital Personalized Health and Medicine; Studies in Health Technology and Informatics; IOS Press: Amsterdam, The Netherlands. 2020; 242-246.
Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure. 1975;405(2):442- 451.
Setiawan AW. Image Segmentation Metrics in Skin Lesion: Accuracy, Sensitivity, Specificity, Dice Coefficient, Jaccard Index, and Matthews Correlation Coefficient. In 2020 International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM). IEEE. 2020;97-102.
Yule GU. On the methods of measuring association between two attributes. Journal of the Royal Statistical Society. 1912;75(6): 579-652.
Guilford JP, Perry NC. Estimation of other coefficients of correlation from the phi coefficient. Psychometrika. 1951;16(3): 335-46.
Cox DR, Wermuth N. A comment on the coefficient of determination for binary responses. The American Statistician. 1992;46(1):1-4.
Chicco D. Ten quick tips for machine learning in computational biology. BioData Mining. 2017;10(1):1-7.
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):1-3.
Yao J, Shepperd M. Assessing software defection prediction performance: Why using the Matthews correlation coefficient matters. In Proceedings of the Evaluation and Assessment in Software Engineering. 2020;120-129.
Zhu Q. On the performance of Matthews correlation coefficient (MCC) for imbalanced dataset. Pattern Recognition Letters. 2020;136:71-80.
Chicco D, Starovoitov V, Jurman G. The Benefits of the Matthews Correlation Coefficient (MCC) Over the Diagnostic Odds Ratio (DOR) in Binary Classification Assessment. IEEE Access. 2021;9:47112-471124.
Chicco D, Tötsch N, Jurman G. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Mining. 2021;14(1): 1-22.
Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000;16(5): 412-24.
Jurman G, Riccadonna S, Furlanello C. A comparison of MCC and CEN error measures in multi-class prediction. PloS One. 2012;7(8,e41882):1-8.
Carles M, Huerta MP. Conditional probability problems and contexts. The diagnostic test context. In Proceedings of the Fifth Congress of the European Society for Research in Mathematics Education, CERME. 2007;5(2):702-710.
Huerta MP. On conditional probability problem solving research–structures and contexts. International Electronic Journal of Mathematics Education. 2009;4(3):163-94.
Huerta MP. Researching conditional probability problem solving. In Probabilistic Thinking. Springer, Dordrecht. 2014;613-639.
Huerta MP, Cerdán F, Lonjedo MA, Edo P. Assessing difficulties of conditional probability problems. In Proceedings of the Seventh Congress of the European Society for Research in Mathematics Education. 2011;807-817.
Oldford RW, Cherry WH. Picturing probability: The poverty of Venn diagrams, the richness of eikosograms.
Retrieved from 2006.
Pfannkuch M, Budgett S. Reasoning from an eikosogram: An exploratory study. International Journal of Research in Undergraduate Mathematics Education. 2017;3(2):283-310.
Diamond GA, Forrester JS. Analysis of probability as an aid in the clinical diagnosis of coronary-artery disease. New England Journal of Medicine. 1979;300(24): 1350-1358.
Diamond GA, Hirsch MI, Forrester JS, Staniloff HM, Vas R, Halpern SW, Swan HJ. Application of information theory to clinical diagnostic testing. The electrocardiographic stress test. Circulation. 1981;63(4):915-921.
SOX Jr HC. Diagnostic decision: Probability theory in the use of diagnostic tests: An introduction to critical study of the literature. Annals of Internal Medicine. 1986;104(1):60-66.
Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing & Management. 2009;45(4):427-437.
Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS ONE. 2015;10(3),e0118432:1-23.
Goutte C, Gaussier E. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In European Conference on Information Retrieval 2005 Mar 21. Springer, Berlin, Heidelberg. 2005;345-359.
Lipton ZC, Elkan C, Naryanaswamy B. Thresholding classifiers to maximize F1 score, 2014. arXiv stat.ML. 1402.1892v2.
Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PloS ONE. 2017;12(6),e0177678: 1-17.
Rushdi AMA, Alghamdi SM. Measures, metrics, and indicators derived from the ubiquitous two-by-two contingency table, Part B: Examples. Asian Journal of Medical Principles and Clinical Practice. 2021;4(3):26-50.
Powers, DMW. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies. 2011;2(1):37-63.
Luque A, Carrasco A, Martín A, de las Heras A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition. 2019;91:216-231.
Ge W, Fazal Z, Jakobsson E. Using optimal f-measure and random resampling in gene ontology enrichment calculations. Frontiers in Applied Mathematics and Statistics. 2019;5:20-33.
Abstract View: 363 times
PDF Download: 198 times