All posts by Nikhil VJ

Open Data Kit: for mobile based surveys, data collection

TL;DR : Wanna survey without paperwork headache? Check out ODK and when you get the time.

A] ODK-1

ODK or Open Data Kit has been an important tool that made mobile-based survey data collection with complicated forms etc possible. Their XLSForm structure where the form is created in excel (!) is arguably more feature-packed than most online survey tools, and comes somewhere between coding and non-coding. It has companion websites like that take care of server-side things and enable folks to get started with on-field data collection with zero overhead costs. Being open-source and entrepreneur-friendly, a lot of enterprise services in this sector are actually running on the ODK base and have customized it to suit specific needs.

My experience with ODK:

1. I have guided a team member each about this in two NGOs in Pune: CEE and Grammangal, with just the setup and getting started, a couple of years back, and I have seen them deploy and take this forward far beyond my expectations. I would recommend other NGOs who want to do this to consult them for a peer-to-peer exchange : they’ll speak in your language! (Marathi? Even more so!)
2. I have also briefly taught about it to some interns from a social sciences college (major CV value-add bro!),
3. and in a crash course workshop for some governance Fellows on practical mapping/survey techniques.
4, But as of now I haven’t myself participated in a complete survey exercise using ODK.

ODK community has a cool discussion forum, where people can post queries. The majority or users and consequently deployers of ODK are more from social sciences background than coding background, so the forum is quite helpful and newbie-friendly.

B] Feedback

One regular feedback coming with the ODK tool has been about the one-way nature: If you have one project with 12 people on-field sending in data, it’s a hub-spoke model and the 12 people can’t easily check what the others have posted. They can’t even retain on phone the data they themselves collected once they have submitted it to “HQ”, and so it’s difficult for team members to keep track of things and stay in the loop, especially if they’re in remote areas or on the move and can’t get online on a laptop to inspect the collected data, or cases where the field agent is comfortable operating the phone app (multilingual btw) but not the laptop. (did I mention, you can operate ODK in zero network coverage and sync things with HQ at the end of the day / outing when you get back in range or climb up the roof or tie your phone to your pet eagle or something?).

C] ODK-2

The community of developers has responded and now launched ODK-2. The original ODK is still operational as ODK-1, and ODK-2 brings in a lot more ways to achieve synchronisations and customizations and all. I haven’t gone though the whole thing yet, but it seems they’ve resolved the feedback I mentioned earlier.
Take a look at the tour here:

So it seems to me like this platform is going to enable setting up survey exercises where the hierarchy is considerably flattened, and all the on-field folks can see what their colleagues are collecting and co-ordinate better. It seems like a good platform to run something like a citizens tree mapping or large area survey exercise on.


screenshot from ODK-2 tour
screenshot from ODK-2 tour

Happy exploring ūüôā

Disclaimer : I’m only rapid-sharing what I just stumbled upon, so will not be able to answer detailed queries right now. For that, check out ad join the forum. And see here the newsletter that sparked this post.

PS: Does your org use ODK? Then get yourselves listed on the users map, scroll to bottom of that newsletter to know how.

(this is a cross-post from an email. All opinions are personal. Author: Nikhil VJ, DataMeet Pune chapter)

Using OpenRefine to iron out spelling differences in data

OpenRefine¬†is a powerful open source tool for working with messy data, made for non-coders. And just to stress on something I don’t want readers to miss,

made for non-coders.

It has a ton of features and I have not had the chance to use and learn the vast majority of them. But I learned one thing so I’m going to share that particular implementation: reconciling spelling differences in data.


The data that I worked on: a few worksheets having bus stop names from BRT & Non BRT Route Details & Bus Stop LatLong.xls from Pune Open Data Portal.

I’ve written a detailed log of the whole (incomplete) exercise here, but I’ll expand on just the OpenRefine part. Apologies if any of this gets confusing.. just see the screenshots!


1. All the names that need reconciling are brought under one column. Other columns are kept to track serial numbers, etc. If two or more data tables are involved then a combined table is created carrying columns from all the tables, and one more column is made to disambiguate which sheet the row is from.

2. A copy column is made of this names column in the same table. Just in case we screw up and need to revert. Always make backups!

3. Save the file and load it up on OpenRefine. (assuming that you have downloaded, installed /extracted and started up OpenRefine and it is loaded on a tab in your browser)

4. Now we see the table laid out. There are drop-down arrow buttons on each column heading leading to menu of actions to do on that column. Locate the Names column you want to work on and open its menu.

5. First things first: clean up the cells by trimming leading and trailing whitespaces.openrefine trim spaces.png

clustering screenshot
note: names looking same are the reason why step 5 is important

6. Open the column’s menu again and choose Text facet > Clustering on the target column. It should open up something like below. (click the screenshot to see full size)

7. Consider first listing as example. ‘Pune Vidyapeeth Gate’ has 3 different spellings spanning 63 rows. If we click on one of the spellings, it appears under “New Cell Value”. We can even type a whole new spelling. Select the checkbox under “Merge?” and press the big button at bottom “Merge Selected & Re-Cluster”.

8. What happens in the background: The rest of the table stays as-is. But these 63 cells are over-written with the single new value.

9. So in two clicks we have done what normally would be a very long process in excel of filtering, sorting, copy-pasting and restoring to original order.

10. Here is a saved diff check on that shows on left my original list and on right the normalized list I got through OpenRefine.

11. Do explore the various other options on the page. For bringing similar-spelling words together, certain linguistic algorithms are used (apologies but I haven’t bothered digging deeper into them), and there are multiple options, seen at the top of the last screenshot (“Method”, “Keying function”). Depending on your data one way may work better than the other. You can do one run of clustering / normalizing and then choose another algorithm that might cluster in another way. There will be false positives too (ie names that are supposed to be different only but the program thinks they should be same) and you can ignore them by keeping their “Merge?” checkbox off.

false positives and funny names screenshot
Bekar Company :P!

12. One more thing: This program works just as well on other languages. And it is particularly useful in catching all the cases where there’s confusion between¬†choti ee maatra¬†and¬†badi ee maatra:

marathi names clustering
It does marathi names too!

Bus yaar

So that’s it for this walkthrough. Gratitude to DMer Aravind from Bangalore for insisting I check out this tool when I was saying “it’s ok I’m doing it in excel”. Have you used OpenRefine too? What was your application? Do share here or drop me an email on nikhil.js [at]


PS: No, I’m not going to explain why in the world I was having a table of 40,000+ rows of bus stop names. It’s too traumatizing to share, please understand ūüėõ

A tool for composing transit schedules data in static GTFS standard

Over the last few months I went deep-dive into a project with WRI (World Resources Institute) and Kochi Metro in Kerala (KMRL) to convert their scheduling data to the global standard static GTFS format.

The first phase of the project was about just the data conversion. I wrote a python program that took in KMRL’s data files and some configuration files, and created a static GTFS feed as output. There were many more complexities than I can share here, and Shine David from KMRL was a crucial enabler by being the inside man sharing all necessary info and clarifications.

On 17 March this year, Kochi Metro Rail Ltd became India’s first transit agency to publish static GTFS feed of their system as open data.

See the KMRL open data portal and some news coverage: 1, 2, 3, 4.

See it visualized on a global GTFS feeds map called TRAVIC.
(zoom in to kochi and press fast forward. Can adjust time of day.)

Phase 2 of the project aimed higher : we started work on a program with a graphical user interface that would automate several manual processes and help KMRL update their data as the network grows, publish updated feeds on their own without having to rely on any external entity, and very importantly for their case, integrate bus and ferry routes of Kochi in the near future to build towards a unified public transport dataset and facilitate integrated ticketing. As we progressed into this we realised the potential this can have if we generalise it so that any transit agency can use it.

So, here’s launching..

Did I mention we have open sourced the whole thing? Big Kudos to WRI and especially Vishal who co-ordinated the whole project, for being proactive and pro-open-source with this.

The program runs in the browser (actually, please use Chrome or Chromium and no mobile!) as a website with a server backend created by a Python 3 program. It manages the data in a portable internal database and publishes fresh GTFS feeds whenever wanted.

To play around with a live demo version of the program online, contact nikhil on nikhil.js [at]

Note: while it’s compatible to publish this program on a free heroku account, it is currently not designed for multi-user use. That’s kind of not in the basic requirements, as end user is just a transport agency’s internal team. (With your participation we can change that.)

So, why I am sharing about this here: Apart from obviously sharing cool stuff,

With this it’s possible to design any transport system’s static GTFS feed from scratch, or edit an older feed you have lying around and bring it up to date.

Invitation for Collaboration

There is more that can be done with enhancements and integrations, and there are still some¬†limitations¬†that need to be resolved. I’m documenting all I know in the issues section. So I’m reaching out for inviting collaborations on the coding and beta testing front. One motive behind open sourcing is that the community can achieve far more with this project than what any private individual or group can. There’s also scope to¬†integrate many other GTFS innovations¬†happening. Please visit the github repo and engage!

Lastly, big shout-out to DMers Srinivas Kodali from Hyderabad chapter for connecting and lots of guiding, and to Devdatta Tengshe from Pune chapter for helping me learn asynchronous server setup in Python in a lightning fast way (with a working example for dummies!)

Quick links:


Guide on Digitizing Static Maps

I was recently invited to Nagpur by a group called Center for Peoples Collective, to brainstorm doing for Nagpur the kind of things I’ve done in Pune for budget data processing/viz and mapping. We found that they didn’t have any digital data (ie, shapefile, kml etc) of Nagpur’s electoral wards, but they did have some high-res images released by Nagpur Municipal Corporation (NMC) with the boundaries marked. So I walked them through a process that I’ve worked out, which uses free online services and doesn’t need any software or advanced skills to do. I’m sharing that process here.
Continue reading Guide on Digitizing Static Maps