This workshop will walk through merging datasets and creating new variables in SAS. You can choose to do all these steps in Excel and many people do. But sometimes it can be easier to do it all in SAS. Use Excel to enter your data, but then let SAS do the data manipulations and analysis. One of SAS’ strengths is in data manipulation!
Let’s start by bringing the data into SAS. Here is a link to an Excel file called Wksp2_data. Download the file and save it on your computer. There are 4 worksheets in this file. I will leave it up to you as to how you bring the data into SAS, but you will need to bring all 4 sheets in as 4 different SAS datasets. Let’s call the January worksheet – January, the February one February, and so on.
End goal: 4 SAS datasets: January, February, March, April
Review your LOG window after you bring each one in to make sure you haven’t missed anything. Then run a Proc Print to ensure the data looks like you were expecting it – in other words, it should look like the Excel worksheets.
Best Practice Note: Add a comment line before each Proc to describe what you’re doing! This way you’ll remember when you go back to review your syntax.
When you think about merging datasets there are 2 ways that you would want to merge. Across or down.
Across or adding Variables
We have 3 months of data where the IDs are the same: January, February, and March. We want to add the weights taken from February and March and add them to the end of the January file. So we’re adding variables in this case.
To accomplish this we need to take 2 steps. The first is to sort each dataset to make sure they are in the same order.
Proc sort data=January;
Proc sort data=February;
Proc sort data=March;
Create a new Dataset for the merged data. Remember that the Data statement saves the data using the name you give it – so let’s call it jan_mar – for January to March.
We’re then going to tell SAS that we want to merge the 3 datasets and we want to merge them by ID.
merge january february march;
What does your LOG say? Is it right? What is the best way to make sure SAS has done what you wanted it to do?
What happened? How should we fix this?
We had 3 files that had the same variable names in each. So merging them, by adding variables didn’t really work, because we were not adding new variables, but we were replacing the contents of the variables month and weight. To fix this, we need to call the variables something different in each of the months. I will add the _jan, _feb, _mar, and _apr to the weight variable in dataset. I will also change the variable month month_1, month_2, etc…
Make these changes and rerun. Did it work this time?
You should now see where there are missing data too. Something that was not apparent the first time we ran this.
Adding Observations or Down
We have a 4th dataset that contains weight measurements taken in January and April, but you’ll notice that these belong to individuals who were not included in the first 3 data files of the trial. So we will need to add these to the bottom of the dataset currently called jan_mar.
We already have a file called April that contains the April data and now we have one called jan_mar that contains the merged data from January to March. Since these individuals were not included in the original data, we do not need to sort them, since ID plays no role here.
To add the data to our merged dataset we use the SET command rather than the MERGE command:
set jan_mar april;
Proc print data=jan_apr;
Creating New Variables
We now have all the weight data for the individuals in our trial and would like to calculate the individual weight gains from January to March. As noted before we can calculate these in Excel, but let’s use SAS to do it, especially since we have a new merged dataset.
In order to calculate the weight gain, we will be touching the data, and whenever we touch the data, we need to work within a DATA step. So let’s create a new dataset and call it jan_apr_wtgain.
Since we will be using a dataset that is already available to us in SAS we use the SET command again to recall the jan_apr dataset.
Now we can create our new variable. Let’s call it wtgain and it will be the difference between the weight taken in January (weight_jan) and the weight taken in March (weight_mar).
wtgain = weight_mar – weight_jan;
Run a Proc Print to check your data. Did it work?
Recoding a variable
Sometimes we have a variable that we want to recode – so in our case we are going to create a new variable called wtclass that will take the weights measured in January and put them into 3 weight classes: 1 = 13-16; 2 = 17-20; 3 = 21-24
There are a number of different ways to accomplish this, I will post one here and depending on time in the workshop, I will show you others. I will also include the other options in the accompanying SAS syntax – to be posted after the workshop.
We are working with data again, so need to work within a DATA step. Let’s use the jan_apr_wtgain dataset.
wtgain = weight_mar – weight_jan;
if weight_jan < 17 then wtclass = 1;
if weight_jan ge 17 and weight_jan < 21 then wtclass = 2;
if weight_jan >20 then wtclass = 3;
Run a Proc Print and see what happened. Are we happy with these results?
What can SAS do?
There are so many more manipulations that SAS can do. These are just a couple of them and ones that may help you out as you start using SAS for your own research data. Check out others that have been discussed in the past on the SASsyFridays blog