Three hacks for parsing JSON in Python

Transforming data into one coherent format that can be used in statistics software such as Stata or R is a main task in data science.

One of the most popular data formats is JSON (Java-script object notation). For example, data retrieved from APIs or exported from NoSQL databases (e.g., MongoDB) is typically JSON.

On first sight, JSON looks a bit chaotic and like spaghetti. Here’s an example taken from data on software repositories. The code gives the name of a software repository (repo_name), along with a list of programming languages used in the repository, along with the precise number of bytes.

{"repo_name":"vilhelms/vilhelms.github.io","language":[{"name":"CSS","bytes":"10281"},{"name":"Ruby","bytes":"3086"}]}

How can we extract the data we need from JSON code? How to understand the underlying structure? In this post I want to share three hacks on how to get a better handle on it using Python.

1. Visualize, Visualize, Visualize
To get a handle on the spaghetti-structure of JSON, use some tools to visualize its structure. Once you understand the structure, you can more easily decide on what you need and how to extract it. To make it pretty, there are two main tools I use:
a) JSON-Formatter: a little webpage where you can paste your JSON-spaghetti and retrieve a visualized tree structure of the data. Works particularly good for longer JSON sets.
b) print(json.dumps(parsed_json, indent = 4,sort_keys=False)): Command can be put in Python, then JSON is printed on the console.

See, our above code has become more readable:
Capture

2. Handle JSON like an array
Here’s my standard procedure to access JSON data. Let’s continue with the above example. Say, we want to extract the programming languages used in the project. Then we would first read in the entire JSON file. Second, we would iterate over all language elements in the JSON file and write them into a variable

json_dta = json.loads(l) #this loads the JSON file into Python
for lang_element in json_dta["language"]: #iterate over language elements
    lang=lang_element["name"] #write each element into a variable
    #do something with the data

3. Always double-check if an element exists
I would highly recommend double-checking whether a JSON element exists before (trying) to access it. If you try to access an element that does not exist, then Python will complain or completely stop the entire process. This is somewhat likely to happen with large JSON files or files that are malformatted. Here’s a simple check that makes our above code failsafe:

json_dta = json.loads(l) #this loads the JSON file into Python
if "language" in json_dta:
    for lang_element in json_dta["language"]: #iterate over language elements
        lang=lang_element["name"] #write each element into a variable
        #do something with the data
Advertisements

3 Essential Python Tricks for Lean Code

Python is one of the most important programming languages, especially for data scientists.

Sometimes I find myself going through hundreds of lines of code for my projects. So I spent some hours researching on how to trim the massive code and make the overall coding leaner. Here are 5 tricks I learned.

1. Lean conditional statements

Conditional statements can be really clumsy:

if a == 0:
print("0")
else:
print("not 0")

But this can cost several lines of code. There’s a more lean way to write conditional statements:

print("0") if a ==0 else print("not 0")

2. Simple String-cutting

I work a lot with time-series data. Some of them are Unix timestamps, which look like this:

date = "1553197926UTC"

Converting the number itself into a date would not be a problem, but the remainder of the timestamp–the ‘UTC ‘ part–needs to be removed before we can do anything with the timestamp. Python offers a straightforward way to get rid of some parts of strings (here the trailing three characters):

date = "1553197926UTC"
date = date[:-3]
>>> 1553197926

3. Convert a Nested Array into One Array

Sometimes we get a nested array, especially when dealing with JSON responses from APIs:

array = [[1, 2], [3, 4], [5, 6]]

If we want to transform the nested array into one array, here’s a little trick that does it:

import itertools
list(itertools.chain.from_iterable(array))
>>> [1, 2, 3, 4, 5, 6]

 

3 little hacks for parsing web content with Python and Beautiful Soup

Over the past two weeks, I made great progress in collecting data for a new research project of mine. I had to deal with substantial amounts of web content and had to parse it in order to use it for some analyses. I typically rely on Python and its library Beautiful Soup for such jobs and the more I use it, the more I appreciate the little things. Here are the top three new hacks:

1. Getting rid of HTML tags

I had to extract raw text from web content I scraped. The content I wanted was hidden in a complete mess of HTML tags like this:

</span></div><br><div class=”comment”>
<span class=”commtext c00″>&gt; &quot;the models are 100% explainable&quot;<p>In my experience this is largely illusory. People think they understand what the model is saying but forget that everything is based on assuming the model is a correct description of reality.<p>

Getting the “real” text out of it can be tricky. One way is to use regular expressions, but this can become unmanageable given the variety of HTML tags.

Here comes a little hack: use BeautifulSoup’s built-in text extraction function.


from bs4 import BeautifulSoup

soup = BeautifulSoup(webcontent, "html.parser")

comment = soup.get_text()

 

2. No clue what you’re looking for? Prettify your output first

Before I do extract anything, I have a look at the web content–soup helps you get through the code salad with some function called “prettify” to make it readable:


from bs4 import BeautifulSoup

soup = BeautifulSoup(textstring, "html.parser")

print soup.prettify()

 

where “name” is the filename.

3. Extracting URLs from <a> tags

Sometimes you find a link like this and want to extract its URL:

Here’s the code:


from bs4 import BeautifulSoup

soup = BeautifulSoup(textstring, "html.parser")

a=soup.findAll("a")

url=a[1]["href"].lower().strip()

How to scrape the data behind interactive web graphs

Sometimes we are interested in obtaining data that is behind web graphs like the ones here (e.g., produced through highcharts.js or something related). Sometimes the data points can be obtained by eyeballing, but there are also cases where we need hundreds or thousands of such graphs or where data is so fine-grained that it is impossible to simply spot it. In such a case, we are interested in an automatic procedure which scrapes these graphs. Unfortunately, such charts are tricky to scrape, because data is loaded dynamically in the background.

One trick to obtain the data is to inspect the website using your browser’s built-in developer tools. For example, in Chrome:

  1. Open the website which contains the graph.
  2. Right-click somewhere on the website and press “Inspect”.
  3. In the new window, proceed to the “Network” tab. This tab provides an overview of network transactions between your computer and the website.
  4. Look out for files with a “.json” ending–these are the ones which contain the graph data.json2
  5. Inspect the file by clicking on the “Headers” tab. We need the location of the file on the web server which should be somewhere in the general information.tempsnip
  6. Now we can pull the data into Python and work with the data right away using:
url = "http://pathToJSONfile"
x = requests.get(url).json()