Making the Transition — A Guide for Switching from Stata to R

Track 03 · Practical Tooling · Tutorial

R
Stata
Transition
Tutorial
If you use Stata for analysis but are curious about R, this is a side-by-side guide to the equivalents — working directories, packages, data loading, exploration, missing data, renaming and labelling variables, merges, IDs, frequencies, and tabulation.
Author

Roa, J.

Published

April 23, 2023

Making the Transition — Stata to R

Roa, J.  ·  April 2023

Joseph Wright of Derby — A Philosopher Lecturing on the Orrery (c. 1766) Derby Museum and Art Gallery · Derby, England A philosopher in red robes lectures an intimate gathering by the glow of a single lamp placed in the position of the sun. Wright borrows Caravaggio’s chiaroscuro to stage Enlightenment science as theatre — each upturned face a study in wonder, calculation, or rapt attention. One of the defining images of the Age of Reason.

Who doesn’t remember their first love? In my case, in the coding world, my first love was Stata. I clearly remember my first job as a research assistant to Laura Atuesta. Learning Stata implied an important learning curve since this language was used in my university for my classes, my work, and research. I did my bachelor’s thesis with Stata, and, being honest, I just have great memories using Stata because it was the first time that I’ve feel capable of doing multiple things. There my curiosity for data science and coding started. However, like any other Story love (that may be doesn’t end well), I met R. I would say with this language my interest and passion for coding exploded. Since then, I stopped using Stata and fall in love completely for R. Syntaxis, open-source building community and the approach that you can have with this language are just one of the multiple reasons why I decided to keep learning with R. It doesn’t matter what happens, Stata will be always in the bottom of my heart and my first love since it was the first coding language that I learned.

Because of this and because I know that different people and scholars are using Stata and they are curious about what R offers, here is an essential guide of commands for doing some data wrangling, analysis, and running some models. Of course, I don’t cover plenty of things, but this is just to show the differences between using R and Stata. I’m sorry if I don’t use the new Stata approaches since I learned the version 11º or 12º and now the are in the 17º version, however, the commands still can be useful.

   

We are going to work a lot with the dplyrpackage. You need to load the package just once when you open your environment. If you have any question, please check our other material to start learning R. However, the best way to learn is try different approaches (;

Working directory


Stata logo

/*Working directory path*/
pwd 

/*Changes the working directory path*/
cd c:\myfolder\data 

Equivalent in R

RStudio logo

Show / hide code
#Working directory path
getwd()

#Changes the working directory path
setwd("my_path") 

 

Packages


Stata logo

/*Install package*/
ssc install abc

Once is installed, we don’t need to “load it”

Equivalent in R

RStudio logo

Show / hide code
#Working directory path
install.packages("tidyverse")

#Load library
library(tidyverse)

 

Help


Stata logo

/*Get help with the command regression*/
help regress

Equivalent in R

RStudio logo

Show / hide code
#Help with mutate function from dplyr package
?mutate()

 

Load Data


 

CSV Files

Stata logo

/*Import csv file*/
import delimited "my_data.csv", clear

Equivalent in R

RStudio logo

Show / hide code
library(readr)

#Use read_csv2
df_data <- read_csv2("data.csv")

#Using read.table function with comma separator
df_data <- read.table("my_data.csv", 
                      sep = ",", 
                      header = TRUE)

 

Excel Files

Stata logo

/*Import excel file*/
import excel "df_data.xlsx", 
sheet("Sheet1") firstrow clear

Equivalent in R

RStudio logo

Show / hide code
#With readxl package
library(readxl)
df_data <- read_excel("my_data.xlsx", 
                      sheet = "Sheet1")

#With openxlsx package
library(openxlsx)
df_data <- read.xlsx("my_data.xlsx", 
                     sheet = "Sheet1")

 

Stata Files (.dta)

Stata logo

/*Import excel file*/
use "df_data.dta", clear

Equivalent in R

RStudio logo

Show / hide code
#Load the package
library(haven)            

df_data <- read_dta("df_data.dta")

 

Explore Data


Remember the difference between R and Stata. While in R you need to specify which object (or data) you want to work with, in Stata, the variables are loaded, so it’s easier to work with them.

Stata logo

/* Provides the structure of the dataset */
describe

/*Basic descriptive statistics */
summarize variable1 variable2

/*Lists the variables in the dataset */
ds

/* First 10 observations */
list in 1/10 

/* Last 10 observations */
list in -1/10

/* Show first 10 observations of the */
/*  first three variables of our data */
list var1-var3 in 1/10






/*View data */
browse 

Equivalent in R

RStudio logo

Show / hide code
#Provides the structure of our data 
str(df_data)

#Basic descriptive statistics
summary(df_data)

#Lists the variables of our data 
names(df_data)

#Show first 10 observations of our data 
head(df_data, n = 10)

#Show last 10 observations of our data 
tail(df_data, n = 10)

#Show first 10 observations of the 
# first three variables of our data 
df_data[1:10, 1:3]

# With dplyr, this would be other option
library(dplyr)
df_data %>% slice(1:10) %>% select(1:3)

#View data
View(df_data) 

 

Missing Data

Stata logo

/* Provides the structure of the dataset */
missing variable1 variable2 variable3

Equivalent in R

RStudio logo

Show / hide code
#Check missing data for our entire database
is.na(df_data)

#Check missing data for our entire database
is.na(df_data$variable1)

 

Rename variables

Stata logo

/* How to rename */
rename oldname newname

rename lastname lastname2
rename firstname firstname2
rename studentstatus studentstatus2
rename averagescoregrade avgscore

Equivalent in R

RStudio logo

Show / hide code
#Rename the variables of our data
names(df_data) <- c("newvar1", "newvar2")

#With the dplyr package

library(dplyr)
df_data_final <- rename(df_data, 
                        newvar1 = var1, 
                        newvar2 = var2)

 

Label variables

Stata logo

label variable w "Weight"
label variable y "Output"
label variable x1 "Predictor 1"
label variable x2 "Predictor 2"
label variable age "Age"
label variable sex "Gender"

Equivalent in R

RStudio logo

Show / hide code
#Load labelled package
library(labelled)

#Define the value labels for the codes
labels <- c("Option 1", "Option 2", 
            "Option 3", "Option 4", 
            "Option 5")
df_data$code <- labelled(df_data$code, 
                         labels = labels)

 

Value labels

Stata logo

/* Value labels for the codes */
label define label1 1 "Option 1" 
2 "Option 2" 3 "Option 3" 4 "Option 4" 
5 "Option 5"

/* Assign the value labels to the codes */
label values code label1

Equivalent in R

RStudio logo

Show / hide code
#Load labelled package
library(labelled)

#Define the value labels for the codes
labels <- c("Option 1", "Option 2", 
            "Option 3", "Option 4", 
            "Option 5")
df_data$code <- labelled(df_data$code, 
                         labels = labels)


#Define the value labels for the codes
labels <- c("Option 1", "Option 2", 
            "Option 3", "Option 4", 
            "Option 5")
df_data$code <- factor(df_data$code, 
                       levels = 1:5, 
                       labels = labels)

In this approach, you can assign values to numeric or character variables

 

Group variables

Stata logo

/* Value labels for the codes */
collapse (mean) var1, by(var2, var3)

list var2 var3 var1

Equivalent in R

RStudio logo

Show / hide code
#Load dplyr package
library(dplyr)

#We group the data by variables 2 and 3, 
#calculate the mean of variable 1
df_data_group <- df_data %>%
  group_by(var2, var3) %>%
  summarise(mean_var1 = mean(var1))

 

Merge two datasets

Stata logo

/* Exact match on the key variable(s). */
use dataset1.dta, clear
merge 1:1 id using dataset2.dta
/* Each observation in the first dataset */
merge 1:m id using dataset2.dta
/* Each observation in the second dataset */
merge m:1 id using dataset2.dta
/* Each observation many-to-many merge */
merge m:1 id using dataset2.dta

Remember that we need to have our data1 in the same path

Equivalent in R

RStudio logo

Show / hide code
#Load dplyr package
library(dplyr)

#1:1 using inner join
df_merged_inner <- inner_join(df_1, df_1, 
                              by = "id1")

#1:m using left join
df_merged_left <- left_join(df_1, df_1, 
                            by = "id1")

#m:1 using right join
df_merged_right <- right_join(df_1, df_1, 
                              by = "id1")

#m:m using full join
df_merged_full <- full_join(df_1, df_1, 
                            by = c("id1", 
                                   "id2"))
#You can merge with more tha one variable

Create sequence/ids

Stata logo

/* Sequence of numbers from 1 to n */
gen seq = _n

/* Sequence with the total number of 
observations */

gen total_seq = _N

Equivalent in R

RStudio logo

Show / hide code
v_seq <- seq(from = 1, to = 10, by = 1)

#Load dplyr package
library(dplyr)

df_final <- df %>%
  mutate(id = 1:n())

Drop variables

Stata logo

/* Drop variable */
drop var1 var2

/* Drop all variable and keep */
drop _all
keep var1 var2 var3

Equivalent in R

RStudio logo

Show / hide code
#Load dplyr package
library(dplyr)

#Drop some variables
df_new_data <- df_old %>% select(-var1, -var2)

#Keep specific variables
df_new_data <- df_old %>% select(var1, var2)

Frequencies/Tabulation

Stata logo

/* Cross-tabulate 3 variables*/
tabulate var1 var2 var3

/* Cross-tabulate with row and 
column percentages */
tabulate var1 var2, row col

Equivalent in R

RStudio logo

Show / hide code
#Cross-tabulate 3 variables
table(var1, var2, var3)

# The ~ symbol separates 
# the row and column variables
xtabs(~var1+var2, data = dataset)

 

Statistics


Summary

Stata logo

/* Summary of variables of interest*/
summarize var1 var2 var3

Equivalent in R

RStudio logo

Show / hide code
#Summary for entire data
summary(dataset)

#Summary for a variable
summary(dataset$variable)

Linear Model

Stata logo

/* Fit a linear model */
regress mpg weight length foreign

/* Print summary */
estimates table

Equivalent in R

RStudio logo

Show / hide code
#Fit a linear model
model_lm <- lm(mpg ~ wt + qsec + vs, 
               data = df_my_data)

#Print summary
summary(model_lm)

 

Logistic Model

Stata logo

/* Fit a logistic regression model */
logit foreign weight length

/* Print summary */
estimates table

Equivalent in R

RStudio logo

Show / hide code
#Fit a logistic regression model
model_log <- glm(vs ~ wt + qsec, 
                 data = df_my_data, 
                 family = binomial())

#Print summary
summary(model)

 

Poisson Model

Stata logo

/* Fit a Poisson regression model */
poisson accidents weight length foreign

/* Print summary */
estimates table

Equivalent in R

RStudio logo

Show / hide code
#Fit a Poisson regression model
model_poiss <- glm(am ~ wt + hp, 
                   data = df_my_data, 
                   family = poisson())

#Print summary
summary(model)

 

Plots


Scatter plot

Stata logo

/* Create scatter plot */
scatter price mpg

Equivalent in R

RStudio logo

Show / hide code
# Load data and ggplot2
library(ggplot2)
data(df_my_data)

#Create scatter plot
ggplot(df_my_data, aes(x = mpg, y = wt)) + 
  geom_point() + #Here is the scatter
  xlab("Miles per gallon") +
  ylab("Weight")

 

Line plot

Stata logo

/* Create Line plot */
twoway line le_w le_y, xlabel(1950(10)2010)

Equivalent in R

RStudio logo

Show / hide code
# Load data and ggplot2
library(ggplot2)
data(df_my_data)

#Create line plot
ggplot(df_my_data, aes(x = year, y = pop)) + 
  geom_line() + #Here is the line
  xlab("Year") +
  ylab("Population")

 

Bar plot

Stata logo

/* Create a Bar plot */
graph bar (count) rep78, over(foreign)

Equivalent in R

RStudio logo

Show / hide code
# Load data and ggplot2
library(ggplot2)
data(df_my_data)

#Create bar chart
ggplot(data = df_my_data, 
       aes(x = factor(cyl))) +
  geom_bar() + #Here is the bar
  xlab("Number of cylinders") +
  ylab("Count")

 

I also want to provide a good cheat sheet that contains more information about the differences between Stata and R.

Cheat Sheet: Stata to R


Stata to R cheatsheet — function and command mapping (page 1)

Stata to R cheatsheet — function and command mapping (page 2)

 

RStata

Believe it or not, R has a package called RStata. This package Stata interface allows the user to execute Stata commands (both inline and from a .do file) from R. I leave herethe package. Looks interesting; however, I personally prefer to use a language as expected.

Conclusion


In conclusion, the decision to transition from one to the other should be made based on the user’s specific needs and preferences. This notebook shows you the differences between Stata and R. At the end, use the language you prefer, but in my own experience, R was a life changer.

 

References


Note

Jutkowitz, E., Pizzi, L. T., Shewmaker, P., Alarid‐Escudero, F., Epstein‐Lubow, G., Prioli, K. M., … & Gitlin, L. N. (2023). Cost effectiveness of non‐drug interventions that reduce nursing home admissions for people living with dementia. Alzheimer’s & Dementia.