Transforming data into one coherent format that can be used in statistics software such as Stata or R is a main task in data science.
One of the most popular data formats is JSON (Java-script object notation). For example, data retrieved from APIs or exported from NoSQL databases (e.g., MongoDB) is typically JSON.
On first sight, JSON looks a bit chaotic and like spaghetti. Here’s an example taken from data on software repositories. The code gives the name of a software repository (repo_name), along with a list of programming languages used in the repository, along with the precise number of bytes.
How can we extract the data we need from JSON code? How to understand the underlying structure? In this post I want to share three hacks on how to get a better handle on it using Python.
1. Visualize, Visualize, Visualize
To get a handle on the spaghetti-structure of JSON, use some tools to visualize its structure. Once you understand the structure, you can more easily decide on what you need and how to extract it. To make it pretty, there are two main tools I use:
a) JSON-Formatter: a little webpage where you can paste your JSON-spaghetti and retrieve a visualized tree structure of the data. Works particularly good for longer JSON sets.
b) print(json.dumps(parsed_json, indent = 4,sort_keys=False)): Command can be put in Python, then JSON is printed on the console.
See, our above code has become more readable:
2. Handle JSON like an array
Here’s my standard procedure to access JSON data. Let’s continue with the above example. Say, we want to extract the programming languages used in the project. Then we would first read in the entire JSON file. Second, we would iterate over all language elements in the JSON file and write them into a variable
json_dta = json.loads(l) #this loads the JSON file into Python for lang_element in json_dta["language"]: #iterate over language elements lang=lang_element["name"] #write each element into a variable #do something with the data
3. Always double-check if an element exists
I would highly recommend double-checking whether a JSON element exists before (trying) to access it. If you try to access an element that does not exist, then Python will complain or completely stop the entire process. This is somewhat likely to happen with large JSON files or files that are malformatted. Here’s a simple check that makes our above code failsafe:
json_dta = json.loads(l) #this loads the JSON file into Python if "language" in json_dta: for lang_element in json_dta["language"]: #iterate over language elements lang=lang_element["name"] #write each element into a variable #do something with the data