Three tips to process large datasets in Stata or R

The increasing availability of large-scale public datasets is a goldmine for many researchers and data analysts. For example, great potential resides in data from Wikipedia (~300 GB per month), OpenStreet Map (~70 GB), and Reddit (~600 GB). However, getting such large datasets ready for analysis is often difficult. Stata, for example, refuses file inputs that are larger than the available RAM in your computer. Of course, we might use computing services such as Amazon and Google, but this requires a research budget, setting up a customized environment, and a constant Internet connection.

In this blog post, I want to share three best practices on how to deal with large datasets and how to get them into statistic software like Stata.

1. Work with CSV files

Datasets come in different shapes. Some are JSONs, some are XML, and many more. While statistic software allows you to import files of various different formats, I always recommend transferring them into CSV (comma separated values) files first. The reason is that CSV is probably the leanest file format, as it goes without the various (and potentially duplicate) meta information that JSON or XML files have. Transferring into CSV can considerably reduce the size of the input files. There are various converters available, and I will share some of the mines in the next posts.

2. Split the input files

One way to deal with large datasets is to cut them into chunks and then process each chunk in a batch. When working with CSV files, there is a little tool called the Free Huge CSV File Splitter, which does its job perfectly fine for me. For batch processing all files in a directory using Stata, the following code helps:

set obs 1
gen x=.
save "output.dta", replace
cd "folder"
local commits : dir . files "*.csv"
foreach file of local commits {

import delimited `file', clear

**do all the processing here

append using "..\output.dta"
save "..\output.dta", replace

drop x
save "..\output.dta", replace

The code loops over all files in the folder called “folder”, processes them, and eventually writes them into one output file.

3. Get rid of strings as much as possible

String data processing is among the most computation-intensive operations. Try to avoid string data as much as possible even before importing data into Stata. Many datasets have hashcodes or control strings included, which can be completely unnecessary for you, but blow up the size of your dataset. Before importing files into Stata, I use EmEditor to have a first look at the structure of the dataset. I then drop unnecessary string data and then import it into Stata.

How to do a placebo simulation in difference-in-differences designs (part 1)

Marianne Bertrand’s 2004 article “How much should we trust differences-in-differences estimates?” (appeared in QJE) outlines several tests that can be done to assess the robustness of difference-in-differences estimates given concerns of false positives.

One recommendation is to run a placebo simulation in which–in a first step–the treatment indicator is randomly assigned to observations in the data set and–in a second step–the regressions are run again with the goal to compare the main estimates with those from the placebo regression.

I have written a little Stata script that runs such a placebo simulation and compiles an Excel spreadsheet which gives the placebo coefficient estimates along with the confidence interval bounds.

Here’s that script. It assumes a panel dataset in use which observations take the form of unit-years (e.g., firm-years). The only thing necessary to adjust for your purposes is to set the parameters at the top.

global project_folder = `"C:\Users\path to project"'
global depvar = "dependent variable"
global treatment = "treatment binary"
global post = "time binary which is 1 for observations after the treatment"
global idvar = "unit identifier variable (e.g., id)"
global timevar = "time identifier variable (e.g., years)"
global controls = "list of control variables (e.g., age)"
global seed = "110" //sets the memory for reproducible random variable generations
global treatment_groupsize = "number of observations in the treatment group (e.g., 100)"
global numruns = "#runs of the simulation (e.g., 60)"

**set excel headers
putexcel set $project_folder, replace
putexcel A1=("DV Coefficient")
putexcel B1=("DV Lower CI")
putexcel C1=("DV Upper CI")
local cellcounter = 3
set seed $seed

*estimate "true" regression
xtset $idvar $timevar
xtreg $depvar i.$treatment##i.$post $controls $timevar, fe robust
putexcel A2=(_b[1.$treatment#1.$post])
putexcel B2=(_b[1.$treatment#1.$post] - invttail(e(df_r),0.025)*_se[1.$treatment#1.$post])
putexcel C2=(_b[1.$treatment#1.$post] + invttail(e(df_r),0.025)*_se[1.$treatment#1.$post])

forvalues i=1/$numruns {
	randomtag if $timevar == awardm-4, count($treatment_groupsize) gen(r) //ssc
	bys $idvar: egen placebo = max(r)
	drop r
	tab placebo
	capture xtreg $depvar i.placebo##i.$post $controls $timevar, fe robust
	putexcel A`cellcounter'=(_b[1.placebo#1.$post])
	putexcel B`cellcounter'=(_b[1.placebo#1.$post] - invttail(e(df_r),0.025)*_se[1.placebo#1.$post])
	putexcel C`cellcounter'=(_b[1.placebo#1.$post] + invttail(e(df_r),0.025)*_se[1.placebo#1.$post])
    if _rc!=0 {
      display "Error on run "`i'
	else {
	   estimates store result`i'
	drop placebo
	local cellcounter=`cellcounter'+1

In one of the next blog posts, I will show how to use this generated spreadsheet for plots of the placebo confidence intervals or simple tabulation summaries for your papers.

How to make clean difference-in-differences graphs in Stata

Difference-in-differences designs seem to be everywhere now, but some of the papers I read don’t seem to leverage one of their key strengths: visualizing what is going on in the data.

For me, I tend to use the following graph style. It plots the dependent variable over time, here from April to October. The treatment and control groups go with different line patterns and colors. Instead of a bulky legend I denote the groups right next to their line. The treatment time is denoted by two vertical bars which separate the group lines. Instead of a complete grid, the graph only relies on a vertical grid to ease eyeballing the changes in the dependent variables.


Now here is the code for the graph in Stata.

**setup: fill the blanks

global dv = ""

global timevariable = ""

global graphtitle = "A clean graph"

global line1 = "Treatment"

global line2 = "Control"

global ytitle = "Mean of dependent variable"


**collapse the data into an aggregated time series

collapse (mean) y = $dv (semean) se_y = $dv, by(m treatment)

sort $timevariable

gen yu = y + 1.96*se_y

gen yl = y - 1.96*se_y

label  define m      1  "April"  2 "May" 3 "July"  4 "August" ///

                     5  "September" 6 "October" 7 "November"

label  value m m

twoway (scatter y m if m<=2 & treatment==1, msymbol(S) ) ///

       (rcap yu  yl m if m=3 & treatment==1) (line y m if m>=3 & treatment==1) ///

       (scatter y m if m<=2 & treatment==0, msymbol(S) ) ///

       (rcap yu  yl m if m=3 & treatment==0) (line y m if m>=3 & treatment==0) ///

       (function y=3.25,range(2.10 2.12) recast(area) color(gs12) base(4.25)) ///

       (function y=3.25,range(2.88 2.90) recast(area) color(gs12) base(4.25)) ///

		, ///

		graphregion(margin(large)) ///

		ylabel(3.25(.25)4.25) ///

		title($title) ///

		yscale(titlegap(*16)) ///

  	    xlabel(1(1)7, valuelabel ) xtitle(" ") ///

	    text(4.3 6.8 $line1) ///

		text(3.7 6.8 $line2) ///

		graphregion(color(white)) bgcolor(white) ///

	    ytitle($ytitle) legend(off) scheme(s2mono) ///

		saving("fig\clean_plot", replace)

gr combine "fig\clean_plot.gph", /*

	*/ iscale(.7) xsize(6)

graph export "fig\clean_plot.png", replace width(1600) height(800)