Talk:Applied statistics: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>Nick Gardner
imported>Nick Gardner
 
(23 intermediate revisions by 5 users not shown)
Line 3: Line 3:
== Help welcomed ==
== Help welcomed ==


My personal experience of this subject does not extend beyond economic statistics and quality control, so I should welcome inputs that put me right on other aspects. Also, I have had no mathematical training beyond the basic necessities of a professional engineer, so a mathematical input for the same purpose would also be welcome. By way of explanation of my drafting  I should explain that I have in mind a readership of graduates and undergraduates that are neither statisticians nor mathematicians but may want  to use statistics in their work or leisure. In view of the magnitude of the subject,  I am trying to touch briefly upon most aspects, but long enough only to convey its flavour and provide links to authoritative references. I think that my cautionary notes in the text are justified by the damage done by  the misuse of statistics, especially by my mathematical fellow-economists and their dramatic misapplication of statistics  to financial risk assessment. [[User:Nick Gardner|Nick Gardner]] 14:46, 28 June 2009 (UTC)
My personal experience of this subject does not extend beyond economic statistics and quality control, so I should welcome inputs that put me right on other aspects. Also, I have had no mathematical training beyond the basic necessities of a professional engineer, so a mathematical input for the same purpose would also be welcome. By way of explanation of my drafting  I should explain that I have in mind a readership of graduates and undergraduates that are neither statisticians nor mathematicians but may want  to use statistics in their work or leisure. In view of the magnitude of the subject,  I am trying to touch briefly upon most aspects, but long enough only to convey its flavour and provide links to authoritative references. I think that my cautionary notes in the text are justified by the damage done by  the misuse of statistics, especially by my mathematical fellow-economists and their dramatic misapplication of statistics  to financial risk assessment. [[User:Nick Gardner|Nick Gardner]] 14:46, 28 June 2009 (UTC)  


My thanks for help with definitions - and I look forward to further initiatives from mathematicians in creating further definitions.
My thanks for help with definitions - and I look forward to further initiatives from mathematicians in creating further definitions.
I would point out, however, that a definition that contains undefined terms is unhelpful. The current defimition of "sample" is a case in point. In my opinion it cannot be allowed to stand. [[User:Nick Gardner|Nick Gardner]] 06:00, 30 June 2009 (UTC)
I would point out, however, that a definition that contains undefined terms is unhelpful. The current defimition of "sample" is a case in point. In my opinion it cannot be allowed to stand. [[User:Nick Gardner|Nick Gardner]] 06:00, 30 June 2009 (UTC)
I should like to add a note to the glossary  title on the Related Articles subpage to the effect that for mathematically precise explanations of the concepts, the reader should refer to the [[Statistics theory]] article. This does not seem helpful in the present state of that article, but perhaps I should do so in anticipation of its further development? [[User:Nick Gardner|Nick Gardner]] 10:53, 30 June 2009 (UTC)
: Using "Related Articles" as a glossary works quite well for some terms, but not for all. There are some terms which never will get a page under this title, e.g. ([[Mean]] which needs disambiguation, or others for which the explanation is too long, or only suitable in a certain context. <br> (And a definition probably should not contain a displayed formula?) [[User:Peter Schmitt|Peter Schmitt]] 12:45, 1 July 2009 (UTC)
::There are lots of other glossary definitions that will never have their own pages. Is there any harm in that? Point taken about the formula. I'll delete it. [[User:Nick Gardner|Nick Gardner]] 17:20, 1 July 2009 (UTC)
== Category ==
I don't understand the choice "Library and information science". Shouldn't it be Mathematics, Sociology, Health Sciences, Geography, ... [[User:Peter Schmitt|Peter Schmitt]] 12:54, 1 July 2009 (UTC)
:I didn't know where to put it. There seems to be no obvious place for decision theory, or information technology which I am inclined to think are near neighbours. I did not want, for reasons that I have explained,  to give the impression that it should be treated as  a branch of mathematics. It is true that it uses a lot of mathematical theorems, but so does engineering - perhaps I should put it there?  I think I will add it there,  but I am open to suggestions. [[User:Nick Gardner|Nick Gardner]] 17:10, 1 July 2009 (UTC)
== Definitions / Glossary ==
Nick, your extended definition 
{{r|confidence interval}}
shows the problem of combining it with a glossary: It certainly is suitable in the context of the article.
However, as a definition it has to stand independently of the article &mdash; and then "population" is not correct.
Or would you use "population" when you calculate the confidence interval of the measurements of a distance?
[[User:Peter Schmitt|Peter Schmitt]] 20:22, 1 July 2009 (UTC)
:Thanks. I don't see why not, but please make whatever addition or  qualification that  you consider appropriate. [[User:Nick Gardner|Nick Gardner]] 06:24, 2 July 2009 (UTC)
Another point: The guidelines (which appear on new Definition pages) asks to start with a capital letter.
(I probably would also use lower case. However, I understand that it is necessary to keep a uniform appearance.)
[[User:Peter Schmitt|Peter Schmitt]] 20:29, 1 July 2009 (UTC)
:Oh dear! You have spotted a rule that I have broken several hundred times in the course of the last  year! I think I will leave to someone else to go back over my work and put it all right.  I might be running out of time![[User:Nick Gardner|Nick Gardner]] 06:24, 2 July 2009 (UTC)
:Thanks for your comments, Peter. You have concvinced me that the definitions were more trouble than they were worth, and I have deleted them - as far as I can.  I hope somone can tidy the matter up by deleting them fully or putting them right. [[User:Nick Gardner|Nick Gardner]] 16:05, 2 July 2009 (UTC)
== Ready? ==
This article is now complete to the extent that I have been able to make it so (although I have no doubt that it could be augmented by contributions from professional statisticians). [[User:Nick Gardner|Nick Gardner]] 11:51, 15 January 2010 (UTC)
== Tutorials: The prosecutor's fallacy ==
I think, it is less simple; several scenarios should be treated separately.
Scenario A: the investigator have, or is able to obtain, DNA of all half a million people. Then indeed it may happen that he just took the first mathching one, and then indeed the probability of error is close to 1.
Scenario B: the investigator is able to check only one (well, maybe two or three) man for DNA. Here we have two sub-scenarios.
B1: The investigator, when seeing a man, is able to guess before the test, whether his DNA matches or not. (Quite fantastic assumption, isn't it?) Then maybe he, seeing a lot of men, choses one that should fit, gets the positive result of the test, — and indeed, it is false positive with probabilty close to 1 (like in scenario A).
B2: The investigator, when seeing a man, cannot guess before the test, whether his DNA matches or not. (Quite realistic assumption, isn't it?) Maybe he chooses a man for the test using his wild ideas (say, like that: I do not like his face, therefore he is probably guilty). Anyway, assume that the DNA test gives the positive result. Now the chance of false positive is close to 0 (in contrast to scenarions A and B1).
In fact, in the B2 scenario, a positive result of the test means one of the two: either just a false positive (but this is quite improbable), or that the investigator was successful in guessing who is guilty. (Maybe he is Sherlock Holmes...)
[[User:Boris Tsirelson|Boris Tsirelson]] 17:58, 16 January 2010 (UTC)
:An interesting point. I accept that a less simple approach to the matter would be interesting, and I believe that there is a lot of investigative work (such as [http://papers.ssrn.com/sol3/papers.cfm?abstract_id=462880]) that could be called on. But I am inclined to think that it would better to present a fuller analysis in a separate article.
:For the purpose of the present article, I think that it would be sufficient to put more stress on the qualification "in the absence of other evidence". Your case B2 does, of course, involve other evidence - namely the fact that the DNA match was obtained after testing only one person (or two or three people). (Additionally, one would have to suppose that there had been a reason for choosing the subject of the test, but that is not essential to the point).
:However, inclusion of the prosecutor's fallacy in this article is not essential, and little would be lost if it were deleted. Its inclusion served only to illustrate two simple points: namely that statistics is not just a branch of mathematics, and that intuitive interpretations, even by professionals, are often wrong. The point is well enough made by the Sally Clark case - although it could be argued that what the article says about that case is an over-simpification (I give a fuller account of the case in chapter 7 of my book on ''Mistakes'') [[User:Nick Gardner|Nick Gardner]] 06:08, 17 January 2010 (UTC)
::I see. Yes, B2 involves other evidence. But I guess, B2 is the closest to reality among the three approaches. And if so, then it should be noted explicitly that the conclusion about highly probable error refers to the case of a lot of tests made. Just because statistics is not just a branch of mathematics! [[User:Boris Tsirelson|Boris Tsirelson]] 07:12, 17 January 2010 (UTC)
:::I have amended the text to meet your point. [[User:Nick Gardner|Nick Gardner]] 08:25, 17 January 2010 (UTC)
==Machine learning?==
I am always semi-shaky when I get near statistics, so I don't know for sure, but there possibly ought to be a useful tie-in from this topic to "machine learning" (http://en.citizendium.org/wiki/Machine_Learning), a subfield of artificial intelligence that uses statistics to infer all kinds of interesting and surprising things.  For example, they can listen to the sound of a person typing at a keyboard and use the frequency of letters in English to infer, given enough keypresses, which sound which keys make, and thus what the person is typing.  They can infer all kinds of information about a database based on how long it takes to respond over the network, so even if it responds "not found", if it responds quickly, that means it probably WAS found but doesn't intend to admit that it was found, whereas a long response probably means an exhaustive search resulting in the search term really not having been found.  Etc.  Wikipedia's article on machine learning is currently much more complete the CZ's.  Anyway, I might be tempted to add the Computers workgroup to this article and see if anyone will come along, while browsing in Computers, and contribute to it.[[User:Pat Palmer|Pat Palmer]] 23:11, 12 March 2011 (UTC)
: An interesting subject, Pat. There seem to be a number of ways of finding and using regularities in complex sets of data. I believe that weather forecasters try to fit one of the recognised patterns of atmospheric data to current observations in order to make forward projections. I know that econometricians try to fit equations to sequences of economic statistics, for use in economic models and forecasting models. The implicit assumption that a pattern is invariable (so that the sequence a,b,c is always followed by the sequence d,e,f , for example) can be  dangerous, however. It is believed to be responsible for the mistaken financial forecasts that led to the current recession.  I have very little understanding of these topics, but I agree that there may well be someone in the computers workgroup who knows about them. [[User:Nick Gardner|Nick Gardner]] 07:17, 13 March 2011 (UTC)

Latest revision as of 01:17, 13 March 2011

This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
Tutorials [?]
 
To learn how to update the categories for this article, see here. To update categories, edit the metadata template.
 Definition the practice of collecting and interpreting numerical observations for the purpose of generating information. [d] [e]
Checklist and Archives
 Workgroup categories Engineering, Economics and Mathematics [Please add or review categories]
 Talk Archive none  English language variant British English

Help welcomed

My personal experience of this subject does not extend beyond economic statistics and quality control, so I should welcome inputs that put me right on other aspects. Also, I have had no mathematical training beyond the basic necessities of a professional engineer, so a mathematical input for the same purpose would also be welcome. By way of explanation of my drafting I should explain that I have in mind a readership of graduates and undergraduates that are neither statisticians nor mathematicians but may want to use statistics in their work or leisure. In view of the magnitude of the subject, I am trying to touch briefly upon most aspects, but long enough only to convey its flavour and provide links to authoritative references. I think that my cautionary notes in the text are justified by the damage done by the misuse of statistics, especially by my mathematical fellow-economists and their dramatic misapplication of statistics to financial risk assessment. Nick Gardner 14:46, 28 June 2009 (UTC)

My thanks for help with definitions - and I look forward to further initiatives from mathematicians in creating further definitions. I would point out, however, that a definition that contains undefined terms is unhelpful. The current defimition of "sample" is a case in point. In my opinion it cannot be allowed to stand. Nick Gardner 06:00, 30 June 2009 (UTC)

I should like to add a note to the glossary title on the Related Articles subpage to the effect that for mathematically precise explanations of the concepts, the reader should refer to the Statistics theory article. This does not seem helpful in the present state of that article, but perhaps I should do so in anticipation of its further development? Nick Gardner 10:53, 30 June 2009 (UTC)

Using "Related Articles" as a glossary works quite well for some terms, but not for all. There are some terms which never will get a page under this title, e.g. (Mean which needs disambiguation, or others for which the explanation is too long, or only suitable in a certain context.
(And a definition probably should not contain a displayed formula?) Peter Schmitt 12:45, 1 July 2009 (UTC)
There are lots of other glossary definitions that will never have their own pages. Is there any harm in that? Point taken about the formula. I'll delete it. Nick Gardner 17:20, 1 July 2009 (UTC)

Category

I don't understand the choice "Library and information science". Shouldn't it be Mathematics, Sociology, Health Sciences, Geography, ... Peter Schmitt 12:54, 1 July 2009 (UTC)

I didn't know where to put it. There seems to be no obvious place for decision theory, or information technology which I am inclined to think are near neighbours. I did not want, for reasons that I have explained, to give the impression that it should be treated as a branch of mathematics. It is true that it uses a lot of mathematical theorems, but so does engineering - perhaps I should put it there? I think I will add it there, but I am open to suggestions. Nick Gardner 17:10, 1 July 2009 (UTC)

Definitions / Glossary

Nick, your extended definition

  • Confidence interval [r]: the range of a random variable, such as the mean of a sample, that — with a specified probability — contains the true value for the population. [e]

shows the problem of combining it with a glossary: It certainly is suitable in the context of the article. However, as a definition it has to stand independently of the article — and then "population" is not correct. Or would you use "population" when you calculate the confidence interval of the measurements of a distance? Peter Schmitt 20:22, 1 July 2009 (UTC)

Thanks. I don't see why not, but please make whatever addition or qualification that you consider appropriate. Nick Gardner 06:24, 2 July 2009 (UTC)

Another point: The guidelines (which appear on new Definition pages) asks to start with a capital letter. (I probably would also use lower case. However, I understand that it is necessary to keep a uniform appearance.) Peter Schmitt 20:29, 1 July 2009 (UTC)

Oh dear! You have spotted a rule that I have broken several hundred times in the course of the last year! I think I will leave to someone else to go back over my work and put it all right. I might be running out of time!Nick Gardner 06:24, 2 July 2009 (UTC)
Thanks for your comments, Peter. You have concvinced me that the definitions were more trouble than they were worth, and I have deleted them - as far as I can. I hope somone can tidy the matter up by deleting them fully or putting them right. Nick Gardner 16:05, 2 July 2009 (UTC)

Ready?

This article is now complete to the extent that I have been able to make it so (although I have no doubt that it could be augmented by contributions from professional statisticians). Nick Gardner 11:51, 15 January 2010 (UTC)

Tutorials: The prosecutor's fallacy

I think, it is less simple; several scenarios should be treated separately.

Scenario A: the investigator have, or is able to obtain, DNA of all half a million people. Then indeed it may happen that he just took the first mathching one, and then indeed the probability of error is close to 1.

Scenario B: the investigator is able to check only one (well, maybe two or three) man for DNA. Here we have two sub-scenarios.

B1: The investigator, when seeing a man, is able to guess before the test, whether his DNA matches or not. (Quite fantastic assumption, isn't it?) Then maybe he, seeing a lot of men, choses one that should fit, gets the positive result of the test, — and indeed, it is false positive with probabilty close to 1 (like in scenario A).

B2: The investigator, when seeing a man, cannot guess before the test, whether his DNA matches or not. (Quite realistic assumption, isn't it?) Maybe he chooses a man for the test using his wild ideas (say, like that: I do not like his face, therefore he is probably guilty). Anyway, assume that the DNA test gives the positive result. Now the chance of false positive is close to 0 (in contrast to scenarions A and B1).

In fact, in the B2 scenario, a positive result of the test means one of the two: either just a false positive (but this is quite improbable), or that the investigator was successful in guessing who is guilty. (Maybe he is Sherlock Holmes...)

Boris Tsirelson 17:58, 16 January 2010 (UTC)

An interesting point. I accept that a less simple approach to the matter would be interesting, and I believe that there is a lot of investigative work (such as [1]) that could be called on. But I am inclined to think that it would better to present a fuller analysis in a separate article.
For the purpose of the present article, I think that it would be sufficient to put more stress on the qualification "in the absence of other evidence". Your case B2 does, of course, involve other evidence - namely the fact that the DNA match was obtained after testing only one person (or two or three people). (Additionally, one would have to suppose that there had been a reason for choosing the subject of the test, but that is not essential to the point).
However, inclusion of the prosecutor's fallacy in this article is not essential, and little would be lost if it were deleted. Its inclusion served only to illustrate two simple points: namely that statistics is not just a branch of mathematics, and that intuitive interpretations, even by professionals, are often wrong. The point is well enough made by the Sally Clark case - although it could be argued that what the article says about that case is an over-simpification (I give a fuller account of the case in chapter 7 of my book on Mistakes) Nick Gardner 06:08, 17 January 2010 (UTC)
I see. Yes, B2 involves other evidence. But I guess, B2 is the closest to reality among the three approaches. And if so, then it should be noted explicitly that the conclusion about highly probable error refers to the case of a lot of tests made. Just because statistics is not just a branch of mathematics! Boris Tsirelson 07:12, 17 January 2010 (UTC)
I have amended the text to meet your point. Nick Gardner 08:25, 17 January 2010 (UTC)

Machine learning?

I am always semi-shaky when I get near statistics, so I don't know for sure, but there possibly ought to be a useful tie-in from this topic to "machine learning" (http://en.citizendium.org/wiki/Machine_Learning), a subfield of artificial intelligence that uses statistics to infer all kinds of interesting and surprising things. For example, they can listen to the sound of a person typing at a keyboard and use the frequency of letters in English to infer, given enough keypresses, which sound which keys make, and thus what the person is typing. They can infer all kinds of information about a database based on how long it takes to respond over the network, so even if it responds "not found", if it responds quickly, that means it probably WAS found but doesn't intend to admit that it was found, whereas a long response probably means an exhaustive search resulting in the search term really not having been found. Etc. Wikipedia's article on machine learning is currently much more complete the CZ's. Anyway, I might be tempted to add the Computers workgroup to this article and see if anyone will come along, while browsing in Computers, and contribute to it.Pat Palmer 23:11, 12 March 2011 (UTC)

An interesting subject, Pat. There seem to be a number of ways of finding and using regularities in complex sets of data. I believe that weather forecasters try to fit one of the recognised patterns of atmospheric data to current observations in order to make forward projections. I know that econometricians try to fit equations to sequences of economic statistics, for use in economic models and forecasting models. The implicit assumption that a pattern is invariable (so that the sequence a,b,c is always followed by the sequence d,e,f , for example) can be dangerous, however. It is believed to be responsible for the mistaken financial forecasts that led to the current recession. I have very little understanding of these topics, but I agree that there may well be someone in the computers workgroup who knows about them. Nick Gardner 07:17, 13 March 2011 (UTC)