One of the most common questions that is coming up today, when researchers are using Proc GLIMMIX is: What distribution should I use? So, let’s take a look at this question from a very practical application viewpoint.
First of all – what the heck is “Non-Gaussian”???
Gaussian is a fancy word for Normal. So if I have normally distributed data or data that conforms to a Normal distribution, my data is Gaussian. So you don’t have to worry about the DIST= option in PROC GLIMMIX (now that is of course, assuming that your residuals are normal and all that other good stuff too!!!).
But, if you’ve collected data that may be categorical in nature, a proportion, or a time to event – then you’ve got non-normal data or non-gaussian data, and YES! you have to figure out what distribution is appropriate when running your model using PROC GLIMMIX.
Based on the paper by Walter W. Stroup (2015) Rethinking the Analysis of Non-Normal Data in Plant and Soil Science, (Agronomy Journal 107(2): 811-827), I’m recommending that you use examples as your starting place. My goal here is to keep adding examples to this post to provide you with a guide as a starting point to help you determine the appropriate distribution. Remember! When you use the DIST= option in your SAS PROC GLIMMIX code, make sure that you are using the appropriate LINK= option as well. These two go hand-in-hand.
Below are brief descriptions of the different distributions currently available through PROC GLIMMIX and examples listed below each. As you work with GLIMMIX and the different distributions, please pass on your examples to me, so that I can add them below. Also, note that there are relationships between and among many of the distributions, and that is why you will have people comparing distributions for the best fit. Please use this as a guide to help you determine a starting point for your analysis.
A binomial distribution is one that we are all familiar with, believe it or not! Remember way back in introductory statistics class, the flip the coin exercise? Yup – that’s a binomial distribution. There are only 2 possibilities, and you are looking at the proportion of “individuals” that result in one category or another. Remember p and q? p = individuals in one category and q = 1-p or the proportion of individuals in the other category.
Does the data that you’ve collected fall into under this distribution?
- % of seeds that germinate – seeds germinate or not
Poisson or Negative Binomial Distribution
When we think about count data we tend to think of a Poisson distribution. The Poisson distribution is often used where we count events that occur randomly over space or time. There are usually no fixed number of trials or events – so we don’t know whether we will count an event happening or an occurrence 5 times or twice or even 30 times.
One key attribute of the Poisson distribution is that the variance is equal to the mean. so, when we have a high mean value (µ = 100), we expect to see more variability in our sample counts – or conversely a low mean value (µ = 5) we expect to see little variability. There are times when the variance exceeds the mean value, this is referred to as Overdispersion – essentially – we are not able to do a great job with the Poisson distribution in describing the variation of our measure.
This situation is where the Negative Binomial distribution, an extension of the Poisson distribution comes in handy. The Negative Binomial distribution accounts for the random variation around the mean.
If you are collecting count data for your project, you have 2 options: Poisson distribution – if your mean = variance, and Negative Binomial – if your variance is greater than your mean.
- Weed count per plot
Exponential or Gamma Distribution
Exponential distribution is a model that predicts a measure over time, the basic one representing a constant over time. Survival probabilities are a great example of exponential distributions. The Exponential distribution is a special form of the Gamma distribution. The Gamma distribution allows you to model for the shape of the curve or a measure of kurtosis (is there a peak in the curve and if so, how peaked or how flat), and it allows you to model for the scale of the curve – almost like the range of the observations. Both of these distributions are characterized by Probability Density Functions.
- Time to flowering
Multinomial Distribution is the situation where you may have more than 2 possible outcomes. A great example of this is a rating scale. The data collected has an equal chance of falling into one of several groups. If you think back to your types of data – this would be nominal or ordinal data. You may also look at this data from a count perspective and there is a strong relationship between Poisson distribution and the Multinomial distribution – you need to think about what your measure is to determine the most appropriate starting distribution.
- Disease Rating Category
The Beta Distribution is another example of a Probability Density function, similar to the Exponential and Gamma distributions. The Beta distributions are characterized by a measure of a proportion.
- Proportion of leaf area affected