Ftools – One solution when Stata takes hours over hours to complete

Annoyed that Stata again took another 2 hours to complete your command, just to find out that you missed to something and need to rerun? And another two hours gone..

Experiences like this are typical for working on larger datasets. Long processing times for collapse, merge, and egen commands can take forever, at least it feels like this.

One approach to reduce frustration is to use the awesome ftools provided by Sergio Correia. Ftools are a reimplementation of sone of the most popular Stata data processing commands.

Currently, the following commands are available in a revised implementation:

  • egen group (now fegen group)
  • collapse (now fcollapse)
  • merge (now join)
  • levelsof (now flevelsof)

The time savings can be immense. From my experience in using ftools I experienced that my merges and collapse took only roughly a third of what they do with the regular Stata commands–immense productivity improvements!

Now how to get ftools? You get them via SSC, thus when in your Stata command window type the following command and hit enter:

ssc install ftools

Advertisements

Three tips to process large datasets in Stata or R

The increasing availability of large-scale public datasets is a goldmine for many researchers and data analysts. For example, great potential resides in data from Wikipedia (~300 GB per month), OpenStreet Map (~70 GB), and Reddit (~600 GB). However, getting such large datasets ready for analysis is often difficult. Stata, for example, refuses file inputs that are larger than the available RAM in your computer. Of course, we might use computing services such as Amazon and Google, but this requires a research budget, setting up a customized environment, and a constant Internet connection.

In this blog post, I want to share three best practices on how to deal with large datasets and how to get them into statistic software like Stata.

1. Work with CSV files

Datasets come in different shapes. Some are JSONs, some are XML, and many more. While statistic software allows you to import files of various different formats, I always recommend transferring them into CSV (comma separated values) files first. The reason is that CSV is probably the leanest file format, as it goes without the various (and potentially duplicate) meta information that JSON or XML files have. Transferring into CSV can considerably reduce the size of the input files. There are various converters available, and I will share some of the mines in the next posts.

2. Split the input files

One way to deal with large datasets is to cut them into chunks and then process each chunk in a batch. When working with CSV files, there is a little tool called the Free Huge CSV File Splitter, which does its job perfectly fine for me. For batch processing all files in a directory using Stata, the following code helps:


clear*
set obs 1
gen x=.
save "output.dta", replace
cd "folder"
local commits : dir . files "*.csv"
foreach file of local commits {

import delimited `file', clear

**do all the processing here

append using "..\output.dta"
save "..\output.dta", replace
}

drop x
save "..\output.dta", replace

The code loops over all files in the folder called “folder”, processes them, and eventually writes them into one output file.

3. Get rid of strings as much as possible

String data processing is among the most computation-intensive operations. Try to avoid string data as much as possible even before importing data into Stata. Many datasets have hashcodes or control strings included, which can be completely unnecessary for you, but blow up the size of your dataset. Before importing files into Stata, I use EmEditor to have a first look at the structure of the dataset. I then drop unnecessary string data and then import it into Stata.