This will be the first in a mini series of blog posts discussing the differences between R and SAS.
One of the challenges of supporting the 2 programs (R and SAS) is not the differences in the coding – I can talk about the differences in the coding being similar to different spoken languages. We may be fluent in English, but we may also be able to speak in French or another language. This is no different than the coding for SAS and R. I may be fluent in SAS, but now I can also speak or rather code in R. Same concepts – just a slightly different language. Many of us that speak more than one language can see the similarities between the languages. This is also true with the coding language of SAS and R. There are many similarities, you just need to learn the specific language nuances.
So, yes learning the language can be challenging, but I think I’m finding the differences in the outputs can be equally as challenging. I remember the first time I saw the output for LMER and thought – where is my ANOVA table??? and my p-values? Heck SAS gives it all to me – maybe I should stick to SAS. But the industry and many of my students and researchers are moving to R – for many, many great reasons. Ok – time to admit that I’m falling in love with R too (shhh… don’t let SAS know 🙂 ). So, I think it’s time to dig into the output differences.
Before we get too deep into our example – we need to take a step back and talk about “statistical models”. Understanding why these are so important and why they are the key to our analyses, will help us better understand the differences we may see between the SAS and R outputs.
If you’ve taken a workshop or a class with me – you know that I am a firm believer in experimental designs and statistical models. Once you have a research question, you can design your experiment – with your experimental design, you know what your statistical model is – with your statistical model in hand, you know what data you will be collecting – with all this information, you know what statistical analyses you will be conducting, and how you will be presenting your results. Phew! and I haven’t even collected my data yet! Just for giggles try it out with your current project and let me know how it works out.
The example dataset we will be using for this blog post and the following one is a dataset from the Kuehl textbook (example 8.1: Design of Experiments: Statistical Principles of Research Design and Analysis).
We will first be using the data as if it were collected from a CRD or a completely randomized design. In other words we have 24 experimental units that were randomly assigned to 6 treatment groups. With this design our statistical model is:
Nitrateij = μ + trmti + eij
Nitrateij = Stem nitrate amount of the jth observation in the ith trmt
μ = Overall mean or model intercept
trmti = the effect of the ith treatment group
eij = random error or experimental error
Let’s just break this down a little bit more before we look at the data.
We have a number of observations where we’ve measured the stem nitrate content of wheat within the experimental unit – a plot in a field. We have 24 of these plots and we randomly assigned the plots to receive one of 6 treatments. In an ideal world, each of the 24 plots are identical – but we know that this just isn’t possible. There may be differences between the plots due to their location in the field, maybe some receive more sun than others do, we know that the soil in a plot can vary a lot for many reasons. In the end, we know that there are inherent differences between our plots – but we are confident that they aren’t “THAT” different and that we can safely assume they are similar enough to use in this experiment. Now, let’s turn our attention to our treatments. As a researcher, we will do our best to ensure that the treatments are applied to our experimental units as similarly as possible. We know that it is almost impossible to ensure that the treatments are applied identically to all plots! We do our best though!
Can you see where I’m going with this?
The goal of our experiment is to have experimental units that are as similar as possible and to apply the treatments as similarly as possible, so that when we see any differences at the end of our trial, we are confident that those differences are due to the treatments applied. However, we know that this isn’t possible – that we have other sources of variation that may come into play – experimental units are not identical, applying the treatments was not perfect, etc…
When we conduct our analysis – note that we are doing an ANOVA – analysis of variance. Yes – we are looking at the variation in the measures we took – stem nitrate content in this example – and we are analyzing it or better yet – we are partitioning or breaking apart the overall variation we see in our stem nitrate measures into its components. This is why know the statistical model is so important!!
In our CRD – we are partitioning the variation of our nitrate measures into our treatments – nothing else! However, we recognize that we cannot explain it all and that’s why we will always have that experimental error or random error. This is the part of the variation in our measure that we just cannot explain with our data – that bit where our experimental units are not identical or that bit where we could not apply our treatments identically. In other words – if you think of the nitrate measures that were collected from this study and visualize them as a cloud of data – the variation of the measures is the cloud. The ANOVA will look at this cloud and determine if we can pull apart the different treatments – can we see a clumping of that cloud in one area that represents a treatment? Or maybe all the treatments overlap and we cannot pull them apart of partition the variation of the different treatment levels.
Our statistical model allows us to look at our data in a couple of ways. First it helps us identify the different sources of variation in our measures, and second, it also allows us to predict values. Hmm.. what? Remember that we always need to check the assumptions of our model – and the assumptions all deal with the model residuals. How do we define residuals again? Predicted – observed values. Predicted values come from our statistical model.
Let’s take another look at our statistical model.
Nitrateij = μ + trmti + eij
What it’s saying is: our stem nitrate measure is made up of the overall mean + the effect of the treatment it was on + some random error. Once I run my ANOVA, I should be able to tell you what the predicted value of an observation on any given treatment should be. In other words, I could break up the measure I took for stem nitrate from my trial, and tell you what the overall mean for the trial was, and how much of the measure was attributed to the treatment it was on. Cool eh??
I hope this all makes sense – as it is important to be comfortable with this as we move along to talking about how SAS and R differ.
Research question -> experimental design
Experimental design -> statistical model
Statistical model -> ANOVA
ANOVA – partitioning of variation in our outcome measure – stem nitrate amount in this example. Think of all your data as a cloud – the ANOVA will tell you whether it is able to break apart the cloud (variation) into the treatment groups.
ANOVA – also tells you how much each treatment contributes to the outcome measures. Predicted values.
Coming up next in this mini series
- R vs. SAS Series: Getting the data ready – ANOVA
- R vs. SAS Series: Conducting the ANOVA
- R vs. SAS Series: Reading the ANOVA outputs
- R vs. SAS Series: RCBD – ANOVA
- R vs. SAS Series: RCBD – Reading the ANOVA outputs