All posts by Nikhil VJ

Using OpenRefine to iron out spelling differences in data

OpenRefine is a powerful open source tool for working with messy data, made for non-coders. And just to stress on something I don’t want readers to miss,

made for non-coders.

It has a ton of features and I have not had the chance to use and learn the vast majority of them. But I learned one thing so I’m going to share that particular implementation: reconciling spelling differences in data.

Background

The data that I worked on: a few worksheets having bus stop names from BRT & Non BRT Route Details & Bus Stop LatLong.xls from Pune Open Data Portal.

I’ve written a detailed log of the whole (incomplete) exercise here, but I’ll expand on just the OpenRefine part. Apologies if any of this gets confusing.. just see the screenshots!

Steps

1. All the names that need reconciling are brought under one column. Other columns are kept to track serial numbers, etc. If two or more data tables are involved then a combined table is created carrying columns from all the tables, and one more column is made to disambiguate which sheet the row is from.

2. A copy column is made of this names column in the same table. Just in case we screw up and need to revert. Always make backups!

3. Save the file and load it up on OpenRefine. (assuming that you have downloaded, installed /extracted and started up OpenRefine and it is loaded on a tab in your browser)

4. Now we see the table laid out. There are drop-down arrow buttons on each column heading leading to menu of actions to do on that column. Locate the Names column you want to work on and open its menu.

5. First things first: clean up the cells by trimming leading and trailing whitespaces.openrefine trim spaces.png

clustering screenshot
note: names looking same are the reason why step 5 is important

6. Open the column’s menu again and choose Text facet > Clustering on the target column. It should open up something like below. (click the screenshot to see full size)

7. Consider first listing as example. ‘Pune Vidyapeeth Gate’ has 3 different spellings spanning 63 rows. If we click on one of the spellings, it appears under “New Cell Value”. We can even type a whole new spelling. Select the checkbox under “Merge?” and press the big button at bottom “Merge Selected & Re-Cluster”.

8. What happens in the background: The rest of the table stays as-is. But these 63 cells are over-written with the single new value.

9. So in two clicks we have done what normally would be a very long process in excel of filtering, sorting, copy-pasting and restoring to original order.

10. Here is a saved diff check on diffchecker.com that shows on left my original list and on right the normalized list I got through OpenRefine.

11. Do explore the various other options on the page. For bringing similar-spelling words together, certain linguistic algorithms are used (apologies but I haven’t bothered digging deeper into them), and there are multiple options, seen at the top of the last screenshot (“Method”, “Keying function”). Depending on your data one way may work better than the other. You can do one run of clustering / normalizing and then choose another algorithm that might cluster in another way. There will be false positives too (ie names that are supposed to be different only but the program thinks they should be same) and you can ignore them by keeping their “Merge?” checkbox off.

false positives and funny names screenshot
Bekar Company :P!

12. One more thing: This program works just as well on other languages. And it is particularly useful in catching all the cases where there’s confusion between choti ee maatra and badi ee maatra:

marathi names clustering
It does marathi names too!

Bus yaar

So that’s it for this walkthrough. Gratitude to DMer Aravind from Bangalore for insisting I check out this tool when I was saying “it’s ok I’m doing it in excel”. Have you used OpenRefine too? What was your application? Do share here or drop me an email on nikhil.js [at] gmail.com.

Links

PS: No, I’m not going to explain why in the world I was having a table of 40,000+ rows of bus stop names. It’s too traumatizing to share, please understand 😛

A tool for composing transit schedules data in static GTFS standard

Over the last few months I went deep-dive into a project with WRI (World Resources Institute) and Kochi Metro in Kerala (KMRL) to convert their scheduling data to the global standard static GTFS format.

The first phase of the project was about just the data conversion. I wrote a python program that took in KMRL’s data files and some configuration files, and created a static GTFS feed as output. There were many more complexities than I can share here, and Shine David from KMRL was a crucial enabler by being the inside man sharing all necessary info and clarifications.

On 17 March this year, Kochi Metro Rail Ltd became India’s first transit agency to publish static GTFS feed of their system as open data.

See the KMRL open data portal and some news coverage: 1, 2, 3, 4.

See it visualized on a global GTFS feeds map called TRAVIC.
(zoom in to kochi and press fast forward. Can adjust time of day.)

Phase 2 of the project aimed higher : we started work on a program with a graphical user interface that would automate several manual processes and help KMRL update their data as the network grows, publish updated feeds on their own without having to rely on any external entity, and very importantly for their case, integrate bus and ferry routes of Kochi in the near future to build towards a unified public transport dataset and facilitate integrated ticketing. As we progressed into this we realised the potential this can have if we generalise it so that any transit agency can use it.

So, here’s launching..

https://github.com/WRI-Cities/static-GTFS-manager

Did I mention we have open sourced the whole thing? Big Kudos to WRI and especially Vishal who co-ordinated the whole project, for being proactive and pro-open-source with this.

The program runs in the browser (actually, please use Chrome or Chromium and no mobile!) as a website with a server backend created by a Python 3 program. It manages the data in a portable internal database and publishes fresh GTFS feeds whenever wanted.

To play around with a live demo version of the program online, contact nikhil on nikhil.js [at] gmail.com

Note: while it’s compatible to publish this program on a free heroku account, it is currently not designed for multi-user use. That’s kind of not in the basic requirements, as end user is just a transport agency’s internal team. (With your participation we can change that.)

So, why I am sharing about this here: Apart from obviously sharing cool stuff,

With this it’s possible to design any transport system’s static GTFS feed from scratch, or edit an older feed you have lying around and bring it up to date.

Invitation for Collaboration

There is more that can be done with enhancements and integrations, and there are still some limitations that need to be resolved. I’m documenting all I know in the issues section. So I’m reaching out for inviting collaborations on the coding and beta testing front. One motive behind open sourcing is that the community can achieve far more with this project than what any private individual or group can. There’s also scope to integrate many other GTFS innovations happening. Please visit the github repo and engage!

Lastly, big shout-out to DMers Srinivas Kodali from Hyderabad chapter for connecting and lots of guiding, and to Devdatta Tengshe from Pune chapter for helping me learn asynchronous server setup in Python in a lightning fast way (with a working example for dummies!)

Quick links:

static-GTFS-manager
https://developers.google.com/transit/gtfs/reference/

Guide on Digitizing Static Maps

I was recently invited to Nagpur by a group called Center for Peoples Collective, to brainstorm doing for Nagpur the kind of things I’ve done in Pune for budget data processing/viz and mapping. We found that they didn’t have any digital data (ie, shapefile, kml etc) of Nagpur’s electoral wards, but they did have some high-res images released by Nagpur Municipal Corporation (NMC) with the boundaries marked. So I walked them through a process that I’ve worked out, which uses free online services and doesn’t need any software or advanced skills to do. I’m sharing that process here.
Continue reading Guide on Digitizing Static Maps