Before I learn a new software or new skills, I often like to do some homework and ask the silly questions like: what, when, why, and how, to give me a base understanding of the software. So, let’s work through these questions for R.
What is R?
R is a system that is used for statistical computation and graphics. It has a number of aspects to is that include a programming language, graphics, interfaces or connection opportunities with other languages, and debugging capabilities. I have found that many do not refer to R as a statistical software package, because it can do so much more.
What does this all mean? It means that R is a very robust program that folks use for a variety of reasons, it’s not just for statistical analysis!
Where did R come from?
The history of software packages can be quite interesting to learn about. For instance R has been defined as a “dialect” of S. Some of you may remember the statistical software called S-Plus? Well, that’s where R comes from. It was developed in the 1980s and has been one of the fastest growing open-source software packages since.
What does “open-source” mean?
I’m sure you’ve heard of this term in the past or in different contexts. One thing that you will hear when people talk about R, is that it is free or that is is open-source. Keep in mind that open-source means that it is freely available for people to use, modify, and redistribute. Which usually translates to: there is no cost to acquire and use the R software! Another aspect of open-source is that it is or rather can be community-driven. So, any and all modifications to the software and subsequent documentation (if it exists) is driven by the community.
Please note, that R has matured over the years, and today’s R community is extremely strong, and encouraging documentation for anything that is released, making it a very desirable product. This may not always be the case with open-source software.
Who uses R?
Business, academia, statisticians, data miners, students, and the list goes on. Maybe we should ask the question, who is NOT using R, and then ask the question Why?
There are so many different statistical software options today and which one you choose to use will depend on several different factors:
- What does your field of study use?
- If you are a graduate student, what does your supervisor suggest and use?
- What type of analyses are you looking to perform and does your program of choice offer those analyses?
- What types of support do you have access to?
How does R work?
If you’re looking for a statistical package that is point and click, R is not for you! R is driven by coding. YES! you will have to learn how to write syntax in R. You can use R interactively by using R-Studio, and you may never reach a point in your studies or your research where you will move away from the interactive capabilities of R – so no big worries! Besides, today there are a lot of resources available to help you learn how to use R. So don’t let that stop you!
Base R and Packages
When you download and install R, the Base R program is installed. To run many of the analyses you may be required to install a package. What is a package? It is a collection of functions, data, and documentation that extend the current capabilities of the Base R program. These are what makes R so versatile! As we work through our workshops and associated code, I will provide you with the name of the Package. There are a number of ways to acquire and install packages, we will review these as we work through them. Please note that there may be several packages that perform a similar analysis, please read all the documentation before selecting a package to use.
I will add a page to this Blog in the near future (Summer 2018) that will list the packages and associated documentation that I have used and recommend.
How do I acquire R? Where can I download it?
Visit the Comprehensive R Archive Network (CRAN) website to download the R software. https://cran.r-project.org/ Please note that this will also be the website used to download future packages used in analyses as well.
To download R-Studio, visit The RStudio website at https://www.rstudio.com/
Both websites have comprehensive instructions to assist you with the installation on your own computers.
Let’s get started by reviewing some definitions
When you think about conducting any statistical analysis, your starting point is data. So let’s start with a few definitions of the different data types observed in R.
Numeric, Character, or Logical
A quick overview of the different types of data you can work with in R.
- Numeric = numbers
- Character = words
- Logical = TRUE or FALSE – not all data is in the form of numbers or letters, sometimes you might have data that has been collected as matching a criteria (TRUE) or not matching a criteria (FALSE). We’ll work through examples of this in another session, for now just be aware that this type of data is commonly used in R.
- How do you find out what form your data are in?
- The results of this statement will tell you exactly what form your data are.
testform <- c(12, 13, 15)
Numeric Classes in R
Numbers are handled in a couple of ways in R. These are referred to as the Numeric Classes of R, and two that we will are known as integer and double. Having a basic understanding of these different numeric classes will come in handy.
- If you think back to high school math, you’ll probably remember the term “integer”. First thing that comes to my mind when I think of integer – is Whole number, no fractions, no decimal places.
- As you can imagine storing numeric data as integers does not require a lot of space. So, in terms of computing, if you do not foresee your analysis needing decimals and precision numbers, then integers are the way to go.
- Double precision floating point numbers – think of this as the decimals side of your numeric data.
- Storing Double numeric data takes up more space than Integer data. But sometimes you’re just not sure what you will need, so R will switch between the 2 numeric classes as it is required for your analysis.
Data Types in R
Let’s review the different data types available to you in R.
- Let’s not panic at some of these terms, but work through examples of each. Think of a vector as a column of data or one variable.
- Vectors can be numeric, characters, or logical format.
- How to create a vector:
# a numeric vector
a = c(2, 4.5, 6, 12)
# a character vector
b = c(“green”, “blue”, “yellow”)
# a logical vector
c = (TRUE, TRUE, FALSE, TRUE)
a = ; b = ; c = ; creating vectors called a, b, c respectively. Please note that a <- is the same as a =
c(x, x, x ) tells R that we are creating a vector or a column with the contents found in the parentheses. The , tells R to drop to the next row in the vector/column being created.
character values must be contained in ” “, but logical values do not.
- Think of a matrix as an object made up of rows and columns.
- The vectors within a matrix must all be the same type, so all numeric, or all character, or all logical.
- How to create a matrix:
# creates a 5 x 4 numeric matrix – 5 rows by 4 columns
y <- matrix(1:20, nrow=5,ncol=4)
y = or y <- create a matrix called y
matrix( ) – call the function matrix to create the matrix y
1:20 – the values of the matrix
nrows = let’s R know how many rows are in the matrix that you are creating
ncol= let’s R know how many columns are in the matrix that you are creating.
Resulting matrix y will look like:
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
- Arrays are very similar to matrices. Think of an array as a matrix with an added dimension. For example, we may have a matrix that contains data for 2015. We want to add in the same data for 2016 in the same format. So we can create an array, with a matrix that contains 2015 data and a matrix that contains a matrix of the 2016 data.
- A Data Frame is a general form of a matrix. What this really means, is that a data frame is like a dataset that we use in other programs such as SAS and SPSS. The columns or variables do not need to be the same type as is required in a matrix.
- We can have one vector/column/variable in a data frame that is integer (numeric), followed by a second one that is character, followed by a third that is logical. But in a matrix, all three vectors/columns/variables must be the same type: numeric, character, or logical.
- How to create a data frame:
d <- c(10, 12, 31, 4)
e <- c(“blue”, “green”, “red”, NA)
f <- c(TRUE, TRUE, TRUE, FALSE)
sampledata <- data.frame(d, e, f)
names(sampledata) <- c(“ID”, “Colour”, “Passed”) # variable names
sampledata <- or sampledata = name of the data frame that we are creating
data.frame( ) calling on the function that creates a data frame
d, e, f tells R that we are creating the data frame with the 3 vectors in the order of d, followed by e, followed by f
names(sac(“ID”, “Colour”, “Passed”) mpledata) – providing variable names within the data frame
c(“ID”, “Colour”, “Passed”) – creating or identifying the 3 variable names within the data frame: ID, Colour, Passed are the variable names
- an ordered collection of objects.
- objects in the list do not have to be the same type.
- You can create a list of objects and store them under one name.
- How to create a list:
# a string, a numeric vector, a matrix, and a scaler
wlist <- list(name=”Fred”, mynumbers=a, mymatrix=y, age=5.3)
wlist <- or wlist = creating a list called wlist
list( ) – calling the function to create a list
name=”Fred”, mynumbers=a, mymatrix=y, age=5.3 values that are to be contained in the list called wlist
Factors are categorical variables in your data. You can have a nominal factor or you can have an ordinal factor. Yup, those words again – remember nominal and ordinal data are categorical pieces of data, so you can fall into one group or another. Nominal, there is no relationship or order to the categories, whereas ordinal data there is an order to the different levels.