Recursive partitioning: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>Subpagination Bot
m (Add {{subpages}} and remove any categories (details))
mNo edit summary
 
(28 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{subpages}}
{{subpages}}{{TOC|right}}
 
In [[statistics]], '''recursive partitioning''' is a method for the multivariable analysis of [[diagnostic test]]s.<ref name="isbn0-412-04841-8">{{cite book |author=Breiman, Leo |title=Classification and Regression Trees |publisher=Chapman & Hall/CRC |location=Boca Raton |year=1984 |pages= |isbn=0-412-04841-8 |oclc= |doi=}}</ref><ref name="Lewis2000">Lewis RJ (2000). [http://www.saem.org/download/lewis1.pdf An introduction to classification and regression tree (CART) analysis]. Accessed March 12, 2008</ref><ref name="pmid19968396">{{cite journal| author=Strobl C, Malley J, Tutz G| title=An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. | journal=Psychol Methods | year= 2009 | volume= 14 | issue= 4 | pages= 323-48 | pmid=19968396 | doi=10.1037/a0016973 | pmc=PMC2927982 | url=http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&tool=sumsearch.org/cite&retmode=ref&cmd=prlinks&id=19968396  }} </ref><ref>Yohannes Y, Hoddinott J (1999). [http://www.ifpri.org/themes/mp18/techguid/tg03.pdf Classification and regression trees: an introduction]. Accessed March 12, 2008</ref>. Recursive partitioning creates a decision tree that strives to correctly classify members of the population based on a dichotomous dependent variable. Compared to other multivariable:
Recursive partitioning is a statistical method for the multivariable analysis of diagnostic tests.<ref name="isbn0-412-04841-8">{{cite book |author=Breiman, Leo |title=Classification and Regression Trees |publisher=Chapman & Hall/CRC |location=Boca Raton |year=1984 |pages= |isbn=0-412-04841-8 |oclc= |doi=}}</ref>. Recursive partitioning creates a decision tree that strives to correctly classify members of the population based on a dichotomous dependent variable. Compared to other multivariable methods, recursive partitioning:
*Advantages are:
*Advantages are:
**Generates clinically more intuitive models that do not require the user to perform calculations.<ref name="pmid16149128">{{cite journal |author=James KE, White RF, Kraemer HC |title=Repeated split sample validation to assess logistic regression and recursive partitioning: an application to the prediction of cognitive impairment |journal=Statistics in medicine |volume=24 |issue=19 |pages=3019-35 |year=2005 |id=PMID 16149128 |doi=10.1002/sim.2154}}</ref>
**Generates clinically more intuitive models that do not require the user to perform calculations.<ref name="pmid16149128">{{cite journal |author=James KE, White RF, Kraemer HC |title=Repeated split sample validation to assess logistic regression and recursive partitioning: an application to the prediction of cognitive impairment |journal=Statistics in medicine |volume=24 |issue=19 |pages=3019-35 |year=2005 |id=PMID 16149128 |doi=10.1002/sim.2154}}</ref>
**Allows varying prioritizing of missclassifications in order to create a decision rule hat has more [[sensitivity (tests)|sensitivity]] or [[specificity (tests)|specificity]].<ref name="pmid6501544">{{cite journal |author=Cook EF, Goldman L |title=Empiric comparison of multivariate analytic techniques: advantages and disadvantages of recursive partitioning analysis |journal=Journal of chronic diseases |volume=37 |issue=9-10 |pages=721-31 |year=1984 |id=PMID 6501544 |doi=}}</ref>
**Allows "adjustable misclassification penalties" or misclassification  costs in order to create a decision rule that has more [[sensitivity (tests)|sensitivity]] or [[specificity (tests)|specificity]]. This has also been called the "diversity index" which is the sum of the false-negatives and false-positives. Either the false-negatives and false-positives can be weighted in order to preferentially reduce their occurrence.<ref name="pmid6501544">{{cite journal |author=Cook EF, Goldman L |title=Empiric comparison of multivariate analytic techniques: advantages and disadvantages of recursive partitioning analysis |journal=Journal of chronic diseases |volume=37 |issue=9-10 |pages=721-31 |year=1984 |id=PMID 6501544 |doi=}}</ref><ref name="pmid9495685">{{cite journal |author=Nelson LM, Bloch DA, Longstreth WT, Shi H |title=Recursive partitioning for the identification of disease risk subgroups: a case-control study of subarachnoid hemorrhage |journal=J Clin Epidemiol |volume=51 |issue=3 |pages=199–209 |year=1998 |month=March |pmid=9495685 |doi= |url=http://linkinghub.elsevier.com/retrieve/pii/S0895-4356(97)00268-0 |issn=}}</ref>
**May be more accurate.<ref name="pmid9790741">{{cite journal |author=Kattan MW, Hess KR, Beck JR |title=Experiments to determine whether recursive partitioning (CART) or an artificial neural network overcomes theoretical limitations of Cox proportional hazards regression |journal=Comput. Biomed. Res. |volume=31 |issue=5 |pages=363-73 |year=1998 |id=PMID 9790741 |doi=}}</ref>
 
*Disadvantages are:
*Disadvantages are:
** Does not work well for continuous variables<ref name="pmid16482368">{{cite journal |author=Lee JW, Um SH, Lee JB, Mun J, Cho H |title=Scoring and staging systems using cox linear regression modeling and recursive partitioning |journal=Methods of information in medicine |volume=45 |issue=1 |pages=37-43 |year=2006 |id=PMID 16482368 |doi=}}</ref>
** Does not work well for continuous variables<ref name="pmid16482368">{{cite journal |author=Lee JW, Um SH, Lee JB, Mun J, Cho H |title=Scoring and staging systems using cox linear regression modeling and recursive partitioning |journal=Methods of information in medicine |volume=45 |issue=1 |pages=37-43 |year=2006 |id=PMID 16482368 |doi=}}</ref>
** May overfit data.
** May overfit data.
Studies comparing the predictive accuracy of recursive partitioning with [[logistic regression]] have reported no difference<ref name="pmid16149128"/>, recursive partitioning to be more accurate<ref name="pmid9790741">{{cite journal |author=Kattan MW, Hess KR, Beck JR |title=Experiments to determine whether recursive partitioning (CART) or an artificial neural network overcomes theoretical limitations of Cox proportional hazards regression |journal=Comput. Biomed. Res. |volume=31 |issue=5 |pages=363-73 |year=1998 |id=PMID 9790741 |doi=}}</ref> and [[logistic regression]] to be more accurate.<ref name="pmid15687312">{{cite journal |author=Fonarow GC, Adams KF, Abraham WT, Yancy CW, Boscardin WJ |title=Risk stratification for in-hospital mortality in acutely decompensated heart failure: classification and regression tree analysis |journal=JAMA |volume=293 |issue=5 |pages=572-80 |year=2005 |id=PMID 15687312 |doi=10.1001/jama.293.5.572}}</ref>


A variation is 'Cox linear recursive partitioning'.<ref name="pmid6501544"/>
A variation is 'Cox linear recursive partitioning'.<ref name="pmid6501544"/>


Examples are available of using recursive partitioning in research of diagnostic tests.<ref name="pmid15687312">{{cite journal |author=Fonarow GC, Adams KF, Abraham WT, Yancy CW, Boscardin WJ |title=Risk stratification for in-hospital mortality in acutely decompensated heart failure: classification and regression tree analysis |journal=JAMA |volume=293 |issue=5 |pages=572-80 |year=2005 |id=PMID 15687312 |doi=10.1001/jama.293.5.572}}</ref><ref name="pmid11597285">{{cite journal |author=Stiell IG, Wells GA, Vandemheen KL, ''et al'' |title=The Canadian C-spine rule for radiography in alert and stable trauma patients |journal=JAMA |volume=286 |issue=15 |pages=1841-8 |year=2001 |id=PMID 11597285 |doi=}}</ref><ref name="pmid10891517">{{cite journal |author=Haydel MJ, Preston CA, Mills TJ, Luber S, Blaudeau E, DeBlieux PM |title=Indications for computed tomography in patients with minor head injury |journal=N. Engl. J. Med. |volume=343 |issue=2 |pages=100-5 |year=2000 |id=PMID 10891517 |doi=}}</ref><ref name="pmid8594242">{{cite journal |author=Stiell IG, Greenberg GH, Wells GA, ''et al'' |title=Prospective validation of a decision rule for the use of radiography in acute knee injuries |journal=JAMA |volume=275 |issue=8 |pages=611-5 |year=1996 |id=PMID 8594242 |doi=}}</ref><ref name="pmid7110205">{{cite journal |author=Goldman L, Weinberg M, Weisberg M, ''et al'' |title=A computer-derived protocol to aid in the diagnosis of emergency room patients with acute chest pain |journal=N. Engl. J. Med. |volume=307 |issue=10 |pages=588-96 |year=1982 |id=PMID 7110205 |doi=}}</ref> Goldman used recursive partitioning to prioritize [[sensitivity (tests)|sensitivity]] in the diagnosis of [[myocardial infarction]] among patients with chest pain in the emergency room.<ref name="pmid7110205"/>
Examples are available of using recursive partitioning in research of diagnostic tests.<ref name="pmid15687312">{{cite journal |author=Fonarow GC, Adams KF, Abraham WT, Yancy CW, Boscardin WJ |title=Risk stratification for in-hospital mortality in acutely decompensated heart failure: classification and regression tree analysis |journal=JAMA |volume=293 |issue=5 |pages=572-80 |year=2005 |id=PMID 15687312 |doi=10.1001/jama.293.5.572}}</ref><ref name="pmid11597285">{{cite journal |author=Stiell IG, Wells GA, Vandemheen KL, ''et al'' |title=The Canadian C-spine rule for radiography in alert and stable trauma patients |journal=JAMA |volume=286 |issue=15 |pages=1841-8 |year=2001 |id=PMID 11597285 |doi=}}</ref><ref name="pmid10891517">{{cite journal |author=Haydel MJ, Preston CA, Mills TJ, Luber S, Blaudeau E, DeBlieux PM |title=Indications for computed tomography in patients with minor head injury |journal=N. Engl. J. Med. |volume=343 |issue=2 |pages=100-5 |year=2000 |id=PMID 10891517 |doi=}}</ref><ref name="pmid8594242">{{cite journal |author=Stiell IG, Greenberg GH, Wells GA, ''et al'' |title=Prospective validation of a decision rule for the use of radiography in acute knee injuries |journal=JAMA |volume=275 |issue=8 |pages=611-5 |year=1996 |id=PMID 8594242 |doi=}}</ref><ref name="pmid7110205">{{cite journal |author=Goldman L, Weinberg M, Weisberg M, ''et al'' |title=A computer-derived protocol to aid in the diagnosis of emergency room patients with acute chest pain |journal=N. Engl. J. Med. |volume=307 |issue=10 |pages=588-96 |year=1982 |id=PMID 7110205 |doi=}}</ref><ref name="pmid18070993">{{cite journal |author=Heikes KE, Eddy DM, Arondekar B, Schlessinger L |title=Diabetes Risk Calculator: a simple tool for detecting undiagnosed diabetes and pre-diabetes |journal=Diabetes Care |volume=31 |issue=5 |pages=1040–5 |year=2008 |month=May |pmid=18070993 |doi=10.2337/dc07-1150 |url=http://care.diabetesjournals.org/cgi/pmidlookup?view=long&pmid=18070993 |issn=}}</ref>  Goldman used recursive partitioning to prioritize [[sensitivity (tests)|sensitivity]] in the diagnosis of [[myocardial infarction]] among patients with chest pain in the emergency room.<ref name="pmid7110205"/>
 
Additional alternatives to recursive partitioning include [[latent class analysis]].<ref>                                  McCutcheon AL. Latent Class Analysis. Beverly Hills, CA: Sage Publications; 1987.</ref><ref name="pmid17503107">{{cite journal |author=Schur EA, Afari N, Furberg H, ''et al'' |title=Feeling bad in more ways than one: comorbidity patterns of medically unexplained and psychiatric conditions |journal=Journal of general internal medicine : official journal of the Society for Research and Education in Primary Care Internal Medicine |volume=22 |issue=6 |pages=818–21 |year=2007 |pmid=17503107 |doi=10.1007/s11606-007-0140-5}}</ref>
 
==Calculations==
===Unweighted analysis===
An 'Initial misclassification rate' is calculated by assigning all cases to a single class. Using the example in the CART Manual, in trying to diagnose 59 patients who have a disease of interest among a  sample of 189 patients, arbitrarily classifying all patients as normal yields:<ref name="CART_Manual">{{cite book |author=Steinberg, Dan and Phillip Colla. |title=CART: Tree-Structured Non-
Parametric Data Analysis |publisher=Salford Systems |location=San Diego, CA |year=1995|pages=39 |isbn= |oclc= |doi=}}</ref><br/>
'''For the top node, also called node 1:'''
:<math>\mbox{Initial misclassification rate} =\left (\frac{\mbox{Total with disease}}{\mbox{Total sample size}}\right)=\left (\frac{\mbox{59}}{\mbox{189}}\right)=\mbox{0.312}</math>
 
'''For the next nodes:'''
The next best independent, classifier variable creates two nodes:
*Left node: 159 patients; 118 have disease and 41 are false positives who do not have the disease
*Right node: 30 patients; 12 have disease and false negatives
 
:<math>\mbox{Misclassification costs} =\left (\frac{\mbox{41}}{\mbox{159}}\right) + \left (\frac{\mbox{12}}{\mbox{30}}\right)=\left (\mbox{0.258}\right) + \left(\mbox{0.400}\right)=\mbox{0.658}</math>
 
:<math>\mbox{Relative misclassification costs} =\left (\frac{\mbox{0.658}}{\mbox{0.312}}\right)=\mbox{2.109}</math>
 
===Weighted analysis===
In this analysis, the [[sensitivity]] is preferenced by weighted it by 2.2, which is the ratio of patients without disease over patients with disease.<br/>
'''For the top node:'''
:<math>\mbox{Initial misclassification rate} =2.2 \times \left (\frac{\mbox{Total with disease}}{\mbox{Total sample size}}\right)=2.2 \times \left (\frac{59}{189}\right)=2.2 \times 0.312 = 0.686</math>
 
'''For the next nodes:'''
The next best independent, classifier variable creates two nodes:
*Left node: 159 patients; 118 have disease and 41 are false positives who do not have the disease
*Right node: 30 patients; 12 have disease and false negatives
 
:<math>\mbox{Misclassification costs} =2.2 \times \left (\frac{41}{159}\right) + \left (\frac{12}{30}\right)=2.2 \times 0.258 + 0.400 = 0.568 + 0.400 = 0.968</math>
 
:<math>\mbox{Relative misclassification costs} =\left (\frac{0.968}{0.686}\right)= 1.411</math>
 
===Stopping rules===
Various methods have been proposed to avoid overfitting of data.
* One method is to grow the tree as far as possible, then to 'prune' the tree. Pruning is based on a balance between misclassification costs and tree complexity.<ref name="Lewis2000">Lewis RJ (2000). [http://www.saem.org/download/lewis1.pdf An introduction to classification and regression tree (CART) analysis]. Accessed March 12, 2008</ref><ref>Yohannes Y, Hoddinott J (1999). [http://www.ifpri.org/themes/mp18/techguid/tg03.pdf Classification and regression trees: an introduction]. Accessed March 12, 2008</ref>
 
Another method is to grow the tree "until p<0.05 or until one or more of the marginal counts in the two-by-two table is less than 10."<ref name="pmid16149128"/>
 
==Software for recursive partitioning==
Software available for recursive partitioning includes:
* [http://www.salfordsystems.com/cart.php CART]
* [[R (programming language)|R programming language]]:
** [http://cran.r-project.org/web/packages/rpart/ rpart package]
** [http://cran.r-project.org/web/packages/HSAUR2/ HSAUR2] interactive package with a chapter containing sample demonstrations, "[http://cran.r-project.org/web/packages/HSAUR2/vignettes/Ch_recursive_partitioning.pdf Recursive Partitioning: Predicting Body Fat and Glaucoma Diagnosis]" in "A Handbook of Statistical Analyses Using R".<ref name="urlCRAN - Package HSAUR2">{{cite web |url=http://cran.r-project.org/web/packages/HSAUR2/ |title=CRAN - Package HSAUR, 2nd ed |author=Torsten Hothorn; Everitt, Brian |authorlink= |coauthors= |date= |format= |work= |publisher= |pages= |language= |archiveurl= |archivedate= |quote= |accessdate=}}</ref>


==References==
==References==
<references/>
<references/>


<!--
==External links==
[[Category:Statistical theory]]
* [http://www.salfordsystems.com/cart.php CART website][[Category:Suggestion Bot Tag]]
[[Category:Biostatistics]]
-->

Latest revision as of 11:01, 10 October 2024

This article is a stub and thus not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article is under development and subject to a disclaimer.

In statistics, recursive partitioning is a method for the multivariable analysis of diagnostic tests.[1][2][3][4]. Recursive partitioning creates a decision tree that strives to correctly classify members of the population based on a dichotomous dependent variable. Compared to other multivariable:

  • Advantages are:
    • Generates clinically more intuitive models that do not require the user to perform calculations.[5]
    • Allows "adjustable misclassification penalties" or misclassification costs in order to create a decision rule that has more sensitivity or specificity. This has also been called the "diversity index" which is the sum of the false-negatives and false-positives. Either the false-negatives and false-positives can be weighted in order to preferentially reduce their occurrence.[6][7]
  • Disadvantages are:
    • Does not work well for continuous variables[8]
    • May overfit data.

Studies comparing the predictive accuracy of recursive partitioning with logistic regression have reported no difference[5], recursive partitioning to be more accurate[9] and logistic regression to be more accurate.[10]

A variation is 'Cox linear recursive partitioning'.[6]

Examples are available of using recursive partitioning in research of diagnostic tests.[10][11][12][13][14][15] Goldman used recursive partitioning to prioritize sensitivity in the diagnosis of myocardial infarction among patients with chest pain in the emergency room.[14]

Additional alternatives to recursive partitioning include latent class analysis.[16][17]

Calculations

Unweighted analysis

An 'Initial misclassification rate' is calculated by assigning all cases to a single class. Using the example in the CART Manual, in trying to diagnose 59 patients who have a disease of interest among a sample of 189 patients, arbitrarily classifying all patients as normal yields:[18]
For the top node, also called node 1:

For the next nodes: The next best independent, classifier variable creates two nodes:

  • Left node: 159 patients; 118 have disease and 41 are false positives who do not have the disease
  • Right node: 30 patients; 12 have disease and false negatives

Weighted analysis

In this analysis, the sensitivity is preferenced by weighted it by 2.2, which is the ratio of patients without disease over patients with disease.
For the top node:

For the next nodes: The next best independent, classifier variable creates two nodes:

  • Left node: 159 patients; 118 have disease and 41 are false positives who do not have the disease
  • Right node: 30 patients; 12 have disease and false negatives

Stopping rules

Various methods have been proposed to avoid overfitting of data.

  • One method is to grow the tree as far as possible, then to 'prune' the tree. Pruning is based on a balance between misclassification costs and tree complexity.[2][19]

Another method is to grow the tree "until p<0.05 or until one or more of the marginal counts in the two-by-two table is less than 10."[5]

Software for recursive partitioning

Software available for recursive partitioning includes:

References

  1. Breiman, Leo (1984). Classification and Regression Trees. Boca Raton: Chapman & Hall/CRC. ISBN 0-412-04841-8. 
  2. 2.0 2.1 Lewis RJ (2000). An introduction to classification and regression tree (CART) analysis. Accessed March 12, 2008
  3. Strobl C, Malley J, Tutz G (2009). "An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests.". Psychol Methods 14 (4): 323-48. DOI:10.1037/a0016973. PMID 19968396. PMC PMC2927982. Research Blogging.
  4. Yohannes Y, Hoddinott J (1999). Classification and regression trees: an introduction. Accessed March 12, 2008
  5. 5.0 5.1 5.2 James KE, White RF, Kraemer HC (2005). "Repeated split sample validation to assess logistic regression and recursive partitioning: an application to the prediction of cognitive impairment". Statistics in medicine 24 (19): 3019-35. DOI:10.1002/sim.2154. PMID 16149128. Research Blogging.
  6. 6.0 6.1 Cook EF, Goldman L (1984). "Empiric comparison of multivariate analytic techniques: advantages and disadvantages of recursive partitioning analysis". Journal of chronic diseases 37 (9-10): 721-31. PMID 6501544[e]
  7. Nelson LM, Bloch DA, Longstreth WT, Shi H (March 1998). "Recursive partitioning for the identification of disease risk subgroups: a case-control study of subarachnoid hemorrhage". J Clin Epidemiol 51 (3): 199–209. PMID 9495685[e]
  8. Lee JW, Um SH, Lee JB, Mun J, Cho H (2006). "Scoring and staging systems using cox linear regression modeling and recursive partitioning". Methods of information in medicine 45 (1): 37-43. PMID 16482368[e]
  9. Kattan MW, Hess KR, Beck JR (1998). "Experiments to determine whether recursive partitioning (CART) or an artificial neural network overcomes theoretical limitations of Cox proportional hazards regression". Comput. Biomed. Res. 31 (5): 363-73. PMID 9790741[e]
  10. 10.0 10.1 Fonarow GC, Adams KF, Abraham WT, Yancy CW, Boscardin WJ (2005). "Risk stratification for in-hospital mortality in acutely decompensated heart failure: classification and regression tree analysis". JAMA 293 (5): 572-80. DOI:10.1001/jama.293.5.572. PMID 15687312. Research Blogging.
  11. Stiell IG, Wells GA, Vandemheen KL, et al (2001). "The Canadian C-spine rule for radiography in alert and stable trauma patients". JAMA 286 (15): 1841-8. PMID 11597285[e]
  12. Haydel MJ, Preston CA, Mills TJ, Luber S, Blaudeau E, DeBlieux PM (2000). "Indications for computed tomography in patients with minor head injury". N. Engl. J. Med. 343 (2): 100-5. PMID 10891517[e]
  13. Stiell IG, Greenberg GH, Wells GA, et al (1996). "Prospective validation of a decision rule for the use of radiography in acute knee injuries". JAMA 275 (8): 611-5. PMID 8594242[e]
  14. 14.0 14.1 Goldman L, Weinberg M, Weisberg M, et al (1982). "A computer-derived protocol to aid in the diagnosis of emergency room patients with acute chest pain". N. Engl. J. Med. 307 (10): 588-96. PMID 7110205[e]
  15. Heikes KE, Eddy DM, Arondekar B, Schlessinger L (May 2008). "Diabetes Risk Calculator: a simple tool for detecting undiagnosed diabetes and pre-diabetes". Diabetes Care 31 (5): 1040–5. DOI:10.2337/dc07-1150. PMID 18070993. Research Blogging.
  16. McCutcheon AL. Latent Class Analysis. Beverly Hills, CA: Sage Publications; 1987.
  17. Schur EA, Afari N, Furberg H, et al (2007). "Feeling bad in more ways than one: comorbidity patterns of medically unexplained and psychiatric conditions". Journal of general internal medicine : official journal of the Society for Research and Education in Primary Care Internal Medicine 22 (6): 818–21. DOI:10.1007/s11606-007-0140-5. PMID 17503107. Research Blogging.
  18. Steinberg, Dan and Phillip Colla. (1995). CART: Tree-Structured Non- Parametric Data Analysis. San Diego, CA: Salford Systems, 39. 
  19. Yohannes Y, Hoddinott J (1999). Classification and regression trees: an introduction. Accessed March 12, 2008
  20. Torsten Hothorn; Everitt, Brian. CRAN - Package HSAUR, 2nd ed.

External links