By Ivan Kisjes
R is a free, open source, flexible and data analysis package. It has powerful visualisation tools and is very versatile and customizable. You can use R for serious analysis of any data set, and because the way you work in R is programmatic, your results are reproducible and repeatable.
But your data is in Excel, or Access… how can you use R on it?
First you’ll want to install R (from here). I’d recommend picking the GUI version. For Windows, pick the ‘base‘ version.
If your data is in Excel:
1) Install Perl (Windows or OSX/Linux)
2) Start the program R
3) Install the ‘gdata’ package inside R (‘packages’ menu, option ‘Install package(s)’. It will ask you to pick a repository, so pick one close to you geographically. Then select ‘gdata’ from the list of packages presented and click ‘OK’). (Note: there are many other packages that can get your data from Excel into R, this is just one example.)
4) Type “require(gdata)” (without the quotes) into the “R Console” window, press Enter. This ‘loads’ the package gdata so you can use it.
5) Type the following line into the R console:
myData <- read.xls (“c:/tmp/Book1.xlsx”, sheet = 1, header = TRUE)
What this does is create a spot in the computers memory (a so-called ‘variable’) with the name “Mydata”, and assign the data from your Excel file to it (hence the little ‘arrow’ pointing left). It only reads the first sheet (sheet=1) of the Excel file. Of course substitute “c:/tmp/Book1.xlsx” with the path to your file. You do need to substitute the backslashes in the file path with forward slashes. The last bit, header=TRUE means that the top row of the Excel sheet contains the names of the columns. If you want to see what happened just type the name of the variable, in my case “myData”, and it shows everything in there. Now your data is inside R and you can use any R function on your data set. For example, you can tally counts of rows and columns with a simple command like:
myTable <- table(myData$Title, myData$Year)
This short command creates a new variable we call “myTable”, which is assigned a 2-way frequency table over the title and year inside the Excel sheet we imported into “myData” – a dollar sign separates the name of the variable that stores the sheet (myData) and the name of the column in Excel (Title, Year). If you then type “myTable” and press enter, R will show you the nicely tallied table.
If your data is in MS Access:
1) Start R
2) Install the ‘RODBC’ package inside R (‘packages’ menu, option ‘Install package(s)’. It will ask you to pick a repository, so pick one close to you geographically. Then select ‘RODBC’ from the list of packages presented and click ‘OK’).
3) type “library(RODBC)” (without the quotes) into the R Console window. This ‘loads’ the RODBC package so you can use it.
4) Type the following into the R Console, substituting the path to your own database:
channel <- odbcConnectAccess2007(“C:/tmp/Database201.accdb”)
Now your Access database is available inside R already, through the variable you just named “channel”! You can easily run queries (in SQL) on it by typing things like
data <- sqlQuery(channel, paste(“select * from Table1”))
into the R Console window. The above command populates a variable called “data” with the results of the sqlQuery “select * from Table1” on the data stored in the database connection “channel”. Table1 is the name of the table you want to query in your database, the asterisk means that all the columns are selected in the result.
If your data is in a MySQL database:
1) Start R
2) Install the ‘RMySQL’ package inside R (‘packages’ menu, option ‘Install package(s)’. It will ask you to pick a repository, so pick one close to you geographically. Then select ‘RMySQL’ from the list of packages presented and click ‘OK’).
3) Type “library(RMySQL)” (without the quotes) into the R Console window. This ‘loads’ the RMySQL package so you can use it. 4) Connect to the database you want to use by typing
mydb = dbConnect(MySQL(), user=’user’, password=’password’, dbname=’database_name’, host=’host’)
into the R Console, substituting the login information for your server between the single quotes. This creates a connection to your database, and it stores the connection into a variable we named “mydb”. So you can use this “mydb” to query the database, like this: dbSendQuery(mydb, ‘select * from Table1’)
Some links to documentation for R: