How to scrape the data behind interactive web graphs

Sometimes we are interested in obtaining data that is behind web graphs like the ones here (e.g., produced through highcharts.js or something related). Sometimes the data points can be obtained by eyeballing, but there are also cases where we need hundreds or thousands of such graphs or where data is so fine-grained that it is impossible to simply spot it. In such a case, we are interested in an automatic procedure which scrapes these graphs. Unfortunately, such charts are tricky to scrape, because data is loaded dynamically in the background.

One trick to obtain the data is to inspect the website using your browser’s built-in developer tools. For example, in Chrome:

  1. Open the website which contains the graph.
  2. Right-click somewhere on the website and press “Inspect”.
  3. In the new window, proceed to the “Network” tab. This tab provides an overview of network transactions between your computer and the website.
  4. Look out for files with a “.json” ending–these are the ones which contain the graph data.json2
  5. Inspect the file by clicking on the “Headers” tab. We need the location of the file on the web server which should be somewhere in the general information.tempsnip
  6. Now we can pull the data into Python and work with the data right away using:
url = "http://pathToJSONfile"
x = requests.get(url).json()

 

Advertisements

How to make clean difference-in-differences graphs in Stata

Difference-in-differences designs seem to be everywhere now, but some of the papers I read don’t seem to leverage one of their key strengths: visualizing what is going on in the data.

For me, I tend to use the following graph style. It plots the dependent variable over time, here from April to October. The treatment and control groups go with different line patterns and colors. Instead of a bulky legend I denote the groups right next to their line. The treatment time is denoted by two vertical bars which separate the group lines. Instead of a complete grid, the graph only relies on a vertical grid to ease eyeballing the changes in the dependent variables.

did-rating

Now here is the code for the graph in Stata.


**setup: fill the blanks

global dv = ""

global timevariable = ""

global graphtitle = "A clean graph"

global line1 = "Treatment"

global line2 = "Control"

global ytitle = "Mean of dependent variable"

***

**collapse the data into an aggregated time series

collapse (mean) y = $dv (semean) se_y = $dv, by(m treatment)

sort $timevariable

gen yu = y + 1.96*se_y

gen yl = y - 1.96*se_y

label  define m      1  "April"  2 "May" 3 "July"  4 "August" ///

                     5  "September" 6 "October" 7 "November"

label  value m m

twoway (scatter y m if m<=2 & treatment==1, msymbol(S) ) ///

       (rcap yu  yl m if m=3 & treatment==1) (line y m if m>=3 & treatment==1) ///

       (scatter y m if m<=2 & treatment==0, msymbol(S) ) ///

       (rcap yu  yl m if m=3 & treatment==0) (line y m if m>=3 & treatment==0) ///

       (function y=3.25,range(2.10 2.12) recast(area) color(gs12) base(4.25)) ///

       (function y=3.25,range(2.88 2.90) recast(area) color(gs12) base(4.25)) ///

		, ///

		graphregion(margin(large)) ///

		ylabel(3.25(.25)4.25) ///

		title($title) ///

		yscale(titlegap(*16)) ///

  	    xlabel(1(1)7, valuelabel ) xtitle(" ") ///

	    text(4.3 6.8 $line1) ///

		text(3.7 6.8 $line2) ///

		graphregion(color(white)) bgcolor(white) ///

	    ytitle($ytitle) legend(off) scheme(s2mono) ///

		saving("fig\clean_plot", replace)

gr combine "fig\clean_plot.gph", /*

	*/ iscale(.7) xsize(6)

graph export "fig\clean_plot.png", replace width(1600) height(800)