Welcome to the MacNN Forums.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

You are here: MacNN Forums > Community > MacNN Lounge > Any statisiticians? Anybody take statistics?

Any statisiticians? Anybody take statistics?
Thread Tools
awaspaas
Mac Elite
Join Date: Apr 2001
Location: Minneapolis, MN
Status: Offline
Reply With Quote
Apr 25, 2005, 09:44 PM
 
I'm trying to find the best way to correlate groups of data. So, for example, I have one column of experimental data, and 4 columns of computed data. Only one of these columns should be a theoretical match to the experimental data. These values are pretty much evenly distributed between about 3 and 12, and must be correctly paired (the 3rd value of the experimental column must match the 3rd value of the computed column, and so on)

Since the data are evenly/randomly distributed, as I understand, correlations such as ANOVA and t-test are not appropriate, since they judge correlation of the means, standard deviations, and so forth. It also seems to me that chi-square, Pearson correlation, and other least-squares methods are not appropriate, since you're always dividing each difference by an expected value. I would think that would disproportionately favor the smaller values (dividing by a smaller value means a smaller contribution to the overall chi-square for a column!)

Is a simple sum of the squares of the differences the most appropriate method for this? This way, it doesn't matter how big or small the value is, just how close it is to the expected value. Those with large deviations for each data point will end up with a larger and less favorable match, sort of like a chi-square. Is this the best way to do it? Is there a name for this method?

An example of analogous data is temperature readings taken throughout the day. If you have a week's worth of this data, how would you tell which day has the closest temperature readings to an average day's temperature readings? We would only be interested in how much a particular day's values differ from the average day's values at all the points through the day.
( Last edited by awaspaas; Apr 25, 2005 at 10:01 PM. )
     
ManOfSteal
Addicted to MacNN
Join Date: Aug 2004
Location: Outfield - #24
Status: Offline
Reply With Quote
Apr 25, 2005, 10:38 PM
 
My head is spinning; however, I do study baseball stats...does that count?
     
TailsToo
Mac Elite
Join Date: Jun 2004
Location: Westside Island
Status: Offline
Reply With Quote
Apr 25, 2005, 11:53 PM
 
I'm in stats now, but I've only covered ANOVAs and z&t-tests so far... maybe I can help you in a couple of weeks.
     
TheIceMan
Mac Elite
Join Date: Dec 2002
Location: Trapped in the depths of my mind
Status: Offline
Reply With Quote
Apr 26, 2005, 04:14 AM
 
Ah yes the dreaded statistics or as we called it "sadistics." Good luck. Sorry, I absolutely hated that class.

[Edit:] Maybe this might help. http://davidmlane.com/hyperstat/
     
philm
Mac Elite
Join Date: May 2001
Location: Manchester, UK
Status: Offline
Reply With Quote
Apr 26, 2005, 05:42 AM
 
My first reaction is that your sum-of-squares of the differences for each time point (for your temperature analogy, at least) would be the way to do. Usually, the best statistical approach is the one which is the most simple, and I think this is quite an elegant approach.
     
awaspaas  (op)
Mac Elite
Join Date: Apr 2001
Location: Minneapolis, MN
Status: Offline
Reply With Quote
Apr 26, 2005, 10:37 AM
 
That's actually the test I've been using for a while. I called it chi-square prime for lack of a better name, since I surprisingly can't find any mention of that specific method anywhere. Weird.
     
turtle777
Clinically Insane
Join Date: Jun 2001
Location: planning a comeback !
Status: Offline
Reply With Quote
Apr 26, 2005, 10:57 AM
 
Ahh, the art of lying with facts.

-t
     
Cipher13
Registered User
Join Date: Apr 2000
Status: Offline
Reply With Quote
Apr 26, 2005, 12:13 PM
 
Only to people ignorant of how stats work

A simple "show me the data" usually fixes that.
     
strictlyplaid
Senior User
Join Date: Jun 2004
Status: Offline
Reply With Quote
Apr 26, 2005, 10:59 PM
 
Well, I'm having a hard time figuring out exactly what it is you want to do here, but here's my suggestion: try a Wilcoxon Rank Sum test (a.k.a. the Mann-Whitney U Test.) It's a non-parametric test designed to detect whether two samples of data are from the same distribution. The form of the distribution need not be known, and the "non-parametric" designation means that it isn't dependent on sample moments (mean and variance). Most statistical packages will perform this test, including Excel.

Here's a helpful link: http://www.netnam.vn/unescocourse/statistics/13_3.htm.

Hope that helps!
     
strictlyplaid
Senior User
Join Date: Jun 2004
Status: Offline
Reply With Quote
Apr 26, 2005, 11:02 PM
 
Originally Posted by turtle777
Ahh, the art of lying with facts.

-t
Maybe, but it's still a bit better than the non-quantitative folks whose lies are completely unencumbered by fact.
     
TailsToo
Mac Elite
Join Date: Jun 2004
Location: Westside Island
Status: Offline
Reply With Quote
Apr 26, 2005, 11:22 PM
 
Originally Posted by strictlyplaid
Maybe, but it's still a bit better than the non-quantitative folks whose lies are completely unencumbered by fact.
Sounds like my workplace!
     
awaspaas  (op)
Mac Elite
Join Date: Apr 2001
Location: Minneapolis, MN
Status: Offline
Reply With Quote
Apr 27, 2005, 02:03 AM
 
Okay, how about some example data. Those of you who know organic chemistry know that coupling constants (J values) give lots of information about the structure of a molecule. We have experimental J values for a molecule, and corresponding calculated J values for 4 similar molecules, one of which is the same molecule as the experimental. By doing that sum of the squares analysis, I see that compound 2 is very close to the experimental values, whereas 1, 3, and 4 are nowhere near. Compound 2 is therefore the same molecule that the experiment was run on.



I want to know if there's an actual bona-fide statistical method for quantitatively determining which of the 4 calculated data groups is the most similar to the experimental data, and if it is significantly close enough to say that it's a "match."

Thanks for your help!!
( Last edited by awaspaas; Apr 27, 2005 at 02:09 AM. )
     
ghporter
Administrator
Join Date: Apr 2001
Location: San Antonio TX USA
Status: Offline
Reply With Quote
Apr 27, 2005, 09:33 AM
 
By simple observation the second calculated group is far closer than any of the others. Of course you saw that, and you're looking for a mathematical way to show it.

I think you may not be looking at the problem right. Maybe you're looking for a way to show that method two produced closer results than all the others. For that, shouldn't the deltas between experimental and calculated be sufficient? I mean simply showing that variance between calculated and experimental was very close (within a certain SD) should show that method two is the "correct" calculation method. Use the common correlation that best fits those deltas-THE DELTAS ARE YOUR DATA HERE.

P.S., I loved my Stats class. Stats for Engineers was taught by Dr. Ron Reagan when I attended USM (Longbeach) in the early '90s, and he made it interesting by getting us to apply it. One guy's project was analysis of the frequency of different colors of M&Ms in retail packages, both plain and peanut. Let's just say that the whole class' sweet tooth was satisfied for quite some time after that presentation!

Glenn -----OTR/L, MOT, Tx
     
strictlyplaid
Senior User
Join Date: Jun 2004
Status: Offline
Reply With Quote
Apr 28, 2005, 01:14 AM
 
Well, considering that you've got pairs of data (one experimental J matched with each known J forms a single observation) standard OLS regression would be one way to go here. You'd be looking to test the hypothesis that the beta-coefficient = 1; you should be able to reject that hypothesis for all but the correct column of data. Run OLS separately for each sample, using your calculated J for the X (independent variable) and your experimental J for the Y, then use the standard t-test.

The t is based on means and standard deviations, but I don't see why that's a problem given this data structure. Let me elaborate: the data structure you've got here is forecasts vs. observed values, such that OLS is going to construct a line that minimizes the sum of forecast errors. You're looking for beta = 1 because you want the model that forecasts your data accurately, i.e. if your predicted J is A then your experimental J had better be close to A, not (for instance) beta*A where beta is not equal to one. If your forecasts are no good, you'll end up with white noise (forecasts not at all corresponding to observations) and OLS will reflect that by giving you insignificant t-tests and betas near zero.

Alternatively, you could do multivariate OLS, plugging each one of your sample columns in as an X and using the experimental data as a Y. The best match of your experimental columns should show up statistically significant and beta about equal to one, with the rest insignificant.

If you really want the non-parametric test, I think there is some form of a paired Wilcoxon Rank Sum that you can use -- check your friendly neighborhood statistical manual for details on that one, as I've never used it. But don't use the regular Wilcoxon/Mann-Whitney, as I believe that's for random draws out of a distribution, which your data most definitely are not.
( Last edited by strictlyplaid; Apr 28, 2005 at 01:23 AM. )
     
awaspaas  (op)
Mac Elite
Join Date: Apr 2001
Location: Minneapolis, MN
Status: Offline
Reply With Quote
Apr 28, 2005, 09:44 AM
 
Wow, now that's helpful to me. The OLS test is just the Regression command in Excel, right? It's just a linear regression. So, if the data's a match, of course the x coefficient should be about 1 (that's your beta value right?) but the y-intercept should be close to zero as well, AND if the data were identical, R-squared would be 1 as well. Would we be increasing the power of the comparison if we included all of these parameters into a test for matching somehow? The comparison needs to be absolutely as powerful as possible since in other systems, the matches are not nearly as clear-cut. Also I'm a little confused about what to run the t-test on - the beta values? The columns of data themselves? If the former, how can you get meaningful results from only one value? If the latter, what's the point of running the OLS in the first place?

Thank you SO much for your help!

Edit: hey that's post 1500, cool!
     
strictlyplaid
Senior User
Join Date: Jun 2004
Status: Offline
Reply With Quote
Apr 29, 2005, 02:50 AM
 
Originally Posted by awaspaas
Wow, now that's helpful to me. The OLS test is just the Regression command in Excel, right? It's just a linear regression. So, if the data's a match, of course the x coefficient should be about 1 (that's your beta value right?) but the y-intercept should be close to zero as well, AND if the data were identical, R-squared would be 1 as well.
Yes, to all those questions. But if you have other options, I'd recommend against Excel. Use one of the professional statistical packages, like Stata, SPSS, SAS, or whatever. Any of those packages are going to report results in a much more interpretable form than Excel will, and they'll give you more information too.

Originally Posted by awaspaas
Would we be increasing the power of the comparison if we included all of these parameters into a test for matching somehow? The comparison needs to be absolutely as powerful as possible since in other systems, the matches are not nearly as clear-cut. Also I'm a little confused about what to run the t-test on - the beta values? The columns of data themselves? If the former, how can you get meaningful results from only one value? If the latter, what's the point of running the OLS in the first place?
The t-test is indeed on the beta values; it's a test for whether a restriction on the value of beta can be accepted or rejected according to a threshold probability that the observed value of beta is consistent with that restriction. So if you test for the restriction that beta = 0, you are testing whether the X variable has any effect on Y at all. Under this circumstance, the statistic beta/(standard error) is distributed according to the Student-t with n-k degrees of freedom, where k is the number of regressors (two in this case: X and the constant.) You find the critical value of t for a defined level of statistical significance (say, for instance, .05 -- indicating that you're willing to reject the null hypothesis that beta =0 whenever there is less than a 5% chance that your observed beta is consistent with the true value of beta being equal to zero) and then see whether your value of t is greater than that value. I'm leaving out a lot of the details here, but you can check a basic stats book to get more info.

Also: the t-test isn't exactly on "one value" per se, as the standard error of beta is the square root of the sum of squared errors divided by n-k. That incorporates all of the observations. Furthermore, the beta value is calculated using all the observed data points.

Good luck on your project!
     
   
 
Forum Links
Forum Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Top
Privacy Policy
All times are GMT -4. The time now is 05:11 AM.
All contents of these forums © 1995-2017 MacNN. All rights reserved.
Branding + Design: www.gesamtbild.com
vBulletin v.3.8.8 © 2000-2017, Jelsoft Enterprises Ltd.,