The last workshop for the S18 SAS series will deal with estimating linear curves using PROC REG and PROC GLIMMIX and estimating non-linear curves using PROC NLIN and PROC NLMIXED. I will be using examples from Bowley, S.R. 2015. A Hitchhiker’s Guide to Statistics in Biology: Generalized Linear Mixed Model Edition. Plants et al. Inc., Guelph, Ontario.
It’s funny how many people I talk with that remember the following phrase:
Y = mX + b
Our math teachers in high school did a great job with this equation, since most of us remember it. We also tend to remember what the different parts mean:
m = slope or rise/run
b = Y-intercept or where our line crosses the Y-axis
As soon as we say Y=mX + b, the anxiety that accompanies statistics seems to disappear. Of course! we can do this! But, how in the world do we accomplish this in SAS. No worries, it is fairly straightforward, as long as you remember your goal of Y = mX + b.
We will be using Example 5.3 to demonstrate a Simple Linear Regression. As statistical procedures mature and new SAS PROCs are developed, it turns out that you can run this analysis in several PROCs. However, we will concentrate on 2: PROC REG and PROC GLIMMIX.
Proc reg data=linear_reg;
model absorb = glucose;
Proc reg data=linear_reg; Calls the PROCedure REG and using the linear_reg dataset
model absorb = glucose; This is our Y = mX + b – we only have the Y and X in our dataset – absorb = Y and glucose = X. e want SAS to estimate m and b
Quit; <- We need to add a Quit for PROC REG, it will keep running in the background until we give it more instructions. So let’s finish it off by adding the quit. It gives us results (if successful) regardless.
From our output our simple linear regression equation is:
Y = 0.00073*X + 0.0047 or absorb = 0.00073 * glucose + 0.0047
Now many of us are using PROC GLIMMIX and it is just one of the many PROCs in SAS that can also estimate a linear regression equation. To run this please use the following coding:
Proc glimmix data=linear_reg;
model absorb = glucose / solution;
random block / solution;
covtest “block=0” 0 .;
Notice that we have a random effect of Block? With PROC REG we cannot account for any random effects that may be in our design. This is one of the benefits of using PROC GLIMMIX, you can do a much better job at modelling your data correctly.
Proc glimmix data=linear_reg; Calls the PROC GLIMMIX and specifies the linear_reg dataset
class block; Lets SAS know that we have a classification or grouping variable: BLOCK
model absorb = glucose / solution; Here is our Y = mX + b. In order to obtain the estimates for our slope and our Y-intercept, we need to add the option SOLUTION after our model statement.
random block / solution; Our design had a random BLOCK, to better represent our design we are adding the random effect of BLOCK. By adding the option SOLUTION we will obtain an estimate for the BLOCK effect
covtest “block=0” 0 .; Of course, we want to know whether the BLOCK effect is significantly different from 0, so we will add the COVTEST statement here
Multiple Linear Regression
If you have several independent variables you want to use in a regression equation, or in other words you want to be able to predict your Y with several Xs, you would be estimating a multiple linear regression. You can use either of the 2 PROCs listed above. I, personally, prefer using PROC REG for these situations, as it allows me to use a number of selection methods to help me determine the best fitting equation. I will not cover these in this session, but may offer a workshop on this topic in the Fall.
Let’s be honest and ask ourselves, how many of us have only linear regressions? I would guess that many of us have non-linear regression curves, but shy away from working with them, because it just seems SOooo challenging! Well, let’s tackle the challenge head-on! What better way of doing that? With examples!
We will be working through 2 examples here Example 5.5 and Example 5.7.
Let’s start with Example 5.5 : Estimating a curve with a peak.
Let’s work through the Data Step together in class. When it comes to your own data and analysis, you may have the data in an Excel spreadsheet all set for the analysis, then by all means, bring in the data in that format. These examples showcase different ways of bringing data into SAS.
Take advantage of your PROC PRINT.
Proc print data=nonlin_peak;
One of the most basic ways of debugging your data entry in SAS. Take a look at it! For the first example we will use PROC NLIN and PROC NLMIXED. PROC NLIN is a very common PROC used to estimate non-linear regression curves, however, with the introduction of PROC NLMIXED, many are now moving to using it rather than NLIN. So, we’ll work through the first example with both PROCs and only use NLMIXED for the second example.
Challenges encountered when estimating non-linear curves are twofold:
- What curve to use?
- What starting parameters to use?
In order to use SAS to estimate non-linear curves, you need to have an idea as to what the curve you are trying to fit to your data. SAS in incapable of reading your data and fitting a curve to it, we need to guide it by providing a curve or an equation to work with. On top of this, we also need to guide SAS with starting values for the parameters we are trying to estimate. Unlike our Y = mX + b, where we are estimating the m(slope) and b(Y-intercept) parameters, we are estimating parameters that determine the rate, the curve, the intercept, etc… many different aspects of a curve.
A great resource for helping you decide on the curve to use is the Catalog of Curves for Curve Fitting, created by the Ministry of Forests for the Province of British Columbia. The authors, V. Sit and M. Poulin-Costello, identify a large number of curves and describe each curve with a different set of parameter estimates, providing the reader with a better understanding of each curve. To read and download this document, visit the Research Branch of Forests, Lands, Natural Resource Operations & Rural Development in British Columbia.
The curve we are going to fit to our data is:
As you can see we have 4 parameters that we need SAS to estimate. This means we also have 4 parameters that we need to find starting values for. The question that always comes up – where do I find starting values? Or can I just use 0? The idea behind selecting optimum starting values, is that you are providing SAS a place to start that is close to the “right” values for your dataset. I know, near impossible, especially if this is innovative research and the first time you are trying this curve with this data. So, where can you find starting values? Published papers – maybe one you’ve used in your literature review? The data will not match your data, the curve will, so use their parameters as starting values for your curve.
Another source for starting parameters – if you are familiar with the type of curve you are fitting and understand what each of the parameters represents, try plotting your data and estimating the parameters and use these as the starting values.
Last but not least, using 0 as a starting value is not recommended. SAS has no idea which way to go, so it may go positive the first time, run the results, then as a check it might go negative, no convergence is reached, so it goes positive again, then negative – you get the idea. It can bounce around a lot and it may never converge.
Selecting starting parameters can be a daunting task, as a last resort, select a number that should be in the range and let the program run its course.
Proc nlin data=nonlin_peak;
parms B = 0.02 C = 0.001 X0 = 10 Y0 = 1;
model y = Y0 * exp(-B*((x – X0)**2) + C*((x – X0)**3));
output out=nlin_out pred=pred;
Proc nlin data=nonlin_peak; We are using the PROCedure NLIN and specifying the dataset nonlin_peak.
parms B = 0.02 C = 0.001 X0 = 10 Y0 = 1; In this curve we have 4 parameters we need to estimate: B, C, X0, and Y0. The parms (short for parameters) provides SAS with the starting values.
model y = Y0 * exp(-B*((x – X0)**2) + C*((x – X0)**3)); This is the curve or equation we are fitting our data to. This should match the equation listed above – carefully translated into SAS.
output out=nlin_out pred=pred; Similar to GLIMMIX we are saving our predicted values into a dataset called nlin_out.
Why do we save the predicted values? What do we use them for? One of the most commonly used and probably best understood statistics for a regression is an R-square. However, when you’re working with non-linear curves, there is no R-square. So, a way around it is to calculate Effron’s Pseudo R-square. To do this, we will calculate the square of the correlation coefficient between the predicted values and the actual values of our Y. If you review the SAS code you downloaded you will see the following:
Proc corr data=nlin_out;
var y pred;
We are running a PROC CORR using the dataset that was saved from our PROC NLIN. Then we are calculating the correlation between our Y and the pred(icted). When you run this you obtain a correlation coefficient of 0.99467. When you square this you get: 0.989 – not bad!
There was also a plot included in the code – when you run it and observe it, you can see that YES! we did a great job with this curve.
Proc nlmixed data=nonlin_peak;
parms B = 0.02 C = 0.001 X0 = 10 Y0 = 1 s2e = 1;
LP = Y0 * exp(-B*((x – X0)**2) + C*((x – X0)**3));
eta = LP;
model y ~ normal(eta,s2e);
predict LP out=second;
Proc nlmixed data=nonlin_peak; Calling on the PROC NLMIXED and using the nonlin_peak data
parms B = 0.02 C = 0.001 X0 = 10 Y0 = 1 s2e = 1; Same as the PROC NLIN, but this time we also need to provide a starting value for the variance
LP = Y0 * exp(-B*((x – X0)**2) + C*((x – X0)**3)); This is the MODEL statement in PROC NLIN – in PROC NLMIXED we give it a name – here we’ve called it LP for Linear Predictor
eta = LP; creating a new variable called eta – that we will use in the model statement
model y ~ normal(eta,s2e); model statement in NLMIXED looks a little different – reason being that we can select different distributions for our data. In this case we are using the normal distribution. So we are telling SAS that we have a Y variable which we want to model using the eta (LP) with an estimated variance (s2e) of 1 from a normal distribution.
predict LP out=second; Similar to the OUTPUT statement in PROC NLIN. We are asking for the predicted values from LP to be saved in a dataset called second.
Similar to what we did with PROC NLIN, we will calculate the correlation coefficient between our actual and our predicted values to determine Effron’s Pseudo R-square.
Dose Response Example – 5.7
We’ll work through this one together in class.