Using OpenRefine to iron out spelling differences in data

OpenRefine is a powerful open source tool for working with messy data, made for non-coders. And just to stress on something I don’t want readers to miss,

made for non-coders.

It has a ton of features and I have not had the chance to use and learn the vast majority of them. But I learned one thing so I’m going to share that particular implementation: reconciling spelling differences in data.

Background

The data that I worked on: a few worksheets having bus stop names from BRT & Non BRT Route Details & Bus Stop LatLong.xls from Pune Open Data Portal.

I’ve written a detailed log of the whole (incomplete) exercise here, but I’ll expand on just the OpenRefine part. Apologies if any of this gets confusing.. just see the screenshots!

Steps

1. All the names that need reconciling are brought under one column. Other columns are kept to track serial numbers, etc. If two or more data tables are involved then a combined table is created carrying columns from all the tables, and one more column is made to disambiguate which sheet the row is from.

2. A copy column is made of this names column in the same table. Just in case we screw up and need to revert. Always make backups!

3. Save the file and load it up on OpenRefine. (assuming that you have downloaded, installed /extracted and started up OpenRefine and it is loaded on a tab in your browser)

4. Now we see the table laid out. There are drop-down arrow buttons on each column heading leading to menu of actions to do on that column. Locate the Names column you want to work on and open its menu.

5. First things first: clean up the cells by trimming leading and trailing whitespaces.openrefine trim spaces.png

clustering screenshot
note: names looking same are the reason why step 5 is important

6. Open the column’s menu again and choose Text facet > Clustering on the target column. It should open up something like below. (click the screenshot to see full size)

7. Consider first listing as example. ‘Pune Vidyapeeth Gate’ has 3 different spellings spanning 63 rows. If we click on one of the spellings, it appears under “New Cell Value”. We can even type a whole new spelling. Select the checkbox under “Merge?” and press the big button at bottom “Merge Selected & Re-Cluster”.

8. What happens in the background: The rest of the table stays as-is. But these 63 cells are over-written with the single new value.

9. So in two clicks we have done what normally would be a very long process in excel of filtering, sorting, copy-pasting and restoring to original order.

10. Here is a saved diff check on diffchecker.com that shows on left my original list and on right the normalized list I got through OpenRefine.

11. Do explore the various other options on the page. For bringing similar-spelling words together, certain linguistic algorithms are used (apologies but I haven’t bothered digging deeper into them), and there are multiple options, seen at the top of the last screenshot (“Method”, “Keying function”). Depending on your data one way may work better than the other. You can do one run of clustering / normalizing and then choose another algorithm that might cluster in another way. There will be false positives too (ie names that are supposed to be different only but the program thinks they should be same) and you can ignore them by keeping their “Merge?” checkbox off.

false positives and funny names screenshot
Bekar Company :P!

12. One more thing: This program works just as well on other languages. And it is particularly useful in catching all the cases where there’s confusion between choti ee maatra and badi ee maatra:

marathi names clustering
It does marathi names too!

Bus yaar

So that’s it for this walkthrough. Gratitude to DMer Aravind from Bangalore for insisting I check out this tool when I was saying “it’s ok I’m doing it in excel”. Have you used OpenRefine too? What was your application? Do share here or drop me an email on nikhil.js [at] gmail.com.

Links

PS: No, I’m not going to explain why in the world I was having a table of 40,000+ rows of bus stop names. It’s too traumatizing to share, please understand 😛

A tool for composing transit schedules data in static GTFS standard

Over the last few months I went deep-dive into a project with WRI (World Resources Institute) and Kochi Metro in Kerala (KMRL) to convert their scheduling data to the global standard static GTFS format.

The first phase of the project was about just the data conversion. I wrote a python program that took in KMRL’s data files and some configuration files, and created a static GTFS feed as output. There were many more complexities than I can share here, and Shine David from KMRL was a crucial enabler by being the inside man sharing all necessary info and clarifications.

On 17 March this year, Kochi Metro Rail Ltd became India’s first transit agency to publish static GTFS feed of their system as open data.

See the KMRL open data portal and some news coverage: 1, 2, 3, 4.

See it visualized on a global GTFS feeds map called TRAVIC.
(zoom in to kochi and press fast forward. Can adjust time of day.)

Phase 2 of the project aimed higher : we started work on a program with a graphical user interface that would automate several manual processes and help KMRL update their data as the network grows, publish updated feeds on their own without having to rely on any external entity, and very importantly for their case, integrate bus and ferry routes of Kochi in the near future to build towards a unified public transport dataset and facilitate integrated ticketing. As we progressed into this we realised the potential this can have if we generalise it so that any transit agency can use it.

So, here’s launching..

https://github.com/WRI-Cities/static-GTFS-manager

Did I mention we have open sourced the whole thing? Big Kudos to WRI and especially Vishal who co-ordinated the whole project, for being proactive and pro-open-source with this.

The program runs in the browser (actually, please use Chrome or Chromium and no mobile!) as a website with a server backend created by a Python 3 program. It manages the data in a portable internal database and publishes fresh GTFS feeds whenever wanted.

To play around with a live demo version of the program online, contact nikhil on nikhil.js [at] gmail.com

Note: while it’s compatible to publish this program on a free heroku account, it is currently not designed for multi-user use. That’s kind of not in the basic requirements, as end user is just a transport agency’s internal team. (With your participation we can change that.)

So, why I am sharing about this here: Apart from obviously sharing cool stuff,

With this it’s possible to design any transport system’s static GTFS feed from scratch, or edit an older feed you have lying around and bring it up to date.

Invitation for Collaboration

There is more that can be done with enhancements and integrations, and there are still some limitations that need to be resolved. I’m documenting all I know in the issues section. So I’m reaching out for inviting collaborations on the coding and beta testing front. One motive behind open sourcing is that the community can achieve far more with this project than what any private individual or group can. There’s also scope to integrate many other GTFS innovations happening. Please visit the github repo and engage!

Lastly, big shout-out to DMers Srinivas Kodali from Hyderabad chapter for connecting and lots of guiding, and to Devdatta Tengshe from Pune chapter for helping me learn asynchronous server setup in Python in a lightning fast way (with a working example for dummies!)

Quick links:

static-GTFS-manager
https://developers.google.com/transit/gtfs/reference/

Data Diaries: What I learned

As some of you might know I’ve recently moved back to the US and after taking a break, I wanted to share some of my thoughts on the past 7 years of Open Data in India. These are just some of the big lessons I’ve learned and observations that I think are important.

Data needs advocates from every sector

Historically the biggest voices that government hears about data are corporations selling products or statisticians being gatekeepers. Now that data is a part of everybody’s life in ways that are unseen, data literacy is necessary for everyone and data needs advocates from every walk of life. What I experienced with DataMeet was that broad data ideas with inputs from experts from all sectors can be very powerful. When you advocate for the data itself and how it needs to be accessible for everyone you can give solutions and perspectives that statisticians and for profit companies can’t. Ideas that are new because they are in the best interest of the whole.  That’s why we are invited to the table because even though it doesn’t make political or economic sense (yet) to listen to us, it is a different perspective that is helpful to know.

This is why every sector, education, environment, journalists, all actors have to integrate a data advocacy component to their work.  Issues of collection, management, and access affect your work and when you go to talk to governments about the issues you want to improve, creating better data and making it easier to get should automatically be apart of it. The idea of “I got the data I need so I’m good” does not make the next time you need data, or being upset with the quality of data  being used to create policy, easier to deal with.

Building ecosystems are more important than projects

In 2011 when I started to work on water data, it became clear that there was no techie/data ecosystem for non profits to tap into for advice and talent. There were individuals but no larger culture of tech/data for public good. This hadn’t been the case in the US so when I was at India Water Portal I wanted to spend time to find it because it’s really important for success. I was basically told by several people that it wasn’t possible in India. That people don’t really volunteer or share in the way the west does. It will be difficult to achieve.

With open data growing quickly into an international fad with lots of funding from places like Open Gov Partnership and Omidyar, I knew open data projects were going to happen. But they would be in silos and they would largely not be successful. Creating a culture that asks and demands for data and then has the means to use it is not something that is created from funded projects. It comes from connecting people who have the  same issues and demonstrating the demand.

DataMeet’s largely been a successful community but not a great organization. This is my fault. A lot of my decisions were guided by those early issues. It was important to have a group of people demonstrating demand, need, and solutions who weren’t paid to be advocates but who were interested in the problem and found a safe space to try to work on it. That is how you change culture, that is why I meed people who say I believe in open data because of DataMeet. That would not have happened as much if we just did projects.

You can’t fundamentally improve governance by having access to data.

It is what we work toward as a movement but it just doesn’t really work that way- because bad governance is not caused by the lack of information or utilization of data. Accountability can’t happen without information or data; and good governance can’t happen without accountability. But all the work spent on getting the government to collect and better use data is often not useful. Mostly because of the lack of understanding of what is the root cause of the issue. I found that budget problems, under staffing, over stressed fire fighting, corruption, interest groups, and just plain apathy are more to blame then really the lack of information. This is something that civil society has to relearn all the time. Not to say data can’t help with these things, but if your plan is to give the government data and think it will solve a problem you are wasting time. Instead you should be using that data to create accountability structures that the government has to answer to. Or use that data to support already utilized accountability influences.

You gotta collect data

Funding that doesn’t include data collection, cleaning, processing costs is pointless. Data collection is expensive but necessary. In a context like India’s where it is clear that the government will not reach data collection levels that are necessary, you have to look at data collection as a required investment.  India’s large established civil society and social sector is one of its strongest assets and they collect tons of data but not consistently. A lot of projects I encountered were based on the western models of the data being there, even if not accessible, it is complete somewhere. NOPE. They count on the data existing and don’t bother to think about the problem of collection, clean up, processing, and distribution. You have to collect data and do it consistently it has to become integrated in your mission.

Data is a pretty good indicator of how big a gap exists between two people trying to communicate.

100% of every data related conversation goes like this “The data says this but I know from experience that…. ” Two people will have different values and communicating a value by saying “I think you should track xyz also, because its an important part of the story” can be a very productive way to work out differences. That is why open data methodology is so important. It also becomes a strong way for diverse interests to communicate and that is always a good thing.

Data is a common

In places that still don’t have the best infrastructure. Where institutions and official channels aren’t the most consistent. The best thing you can do is make information open and free. It will force issues out, create bigger incentives for solutions, and those solutions will be cheaper. Openness can be a substitute for money if there is an ecosystem to support the work.

You can collect lots of data but keeping it gets society no where.

A lot of people in India are wasting a lot of time doing the same thing over and over again. If I had 5 rupees for every person I spoke to who said they had already processed a shapefile that we just did, or had worked with some other dataset that is hard to clean up I could buy the Taj Mahal. Data issues in the country are decades old, but not sharing it causes stunting. Momentum is created from rapid information sharing and solutions; proprietary systems and data hoarding doesn’t. The common societal platforms that are making their way around India’s civil society and private company meeting rooms won’t do it either. You can’t design a locked in platform with every use in mind, its why generally non open portals have had such limited success. If you have solved a hard problem and make it open you save future generations from having to literally recreate the wheel you just made. How much more brainpower can you dedicate to the same problems? Let people be productive on new problems that haven’t been solved yet.

The data people in government are unsung heroes.

Whenever I met an actual worker at the NIC or BHUVAN or any of the data/tech departments they were very smart, very aware of the problems, and generally excited about the idea of DataMeet and that we could potentially help them solve a problem. It was not uncommon when being in a meeting with people from a government tech project for them to ask me to lobby another ministry to improve the data they have to process. While I wish I had that kind of influence it made me appreciate that the government is filled with people trying their best with the restrictions they have, but the government has “good bones” as they say and with better accountability could get to a better place.

I don’t think I covered everything but I’m very grateful for my time working on these issues in India. I feel like I was able to achieve something even though there is so much more to do. To meet all the people who are dedicated to solving hard problems with others and never giving up will inspire me for a long time.

 

 

Making A Football Data Viz With D3 and Reveal.js

This is a write-up on how I made a slideshow for the Under-17 World Cup.

The U-17 World Cup is the first-ever FIFA tournament to be hosted by India. Like many of you, I’ve seen plenty of men’s World Cups, but never an U-17 one. To try and understand how the U-17 tournament might be different from the ‘senior’ version, I compared data from the last U-17 World Cup held in Chile in 2015 and the last men’s World Cup in Brazil in 2014.

The data was taken from Technical Study Group reports that are published by FIFA after every tournament. (The Technical Study Group is a mixture of ex-players, managers and officials associated with the game. You can read more about the group here.)

In particular, I used the reports for the 2014 World Cup and the 2015 U-17 World Cup. The data was taken pretty much as is, and thankfully didn’t have to be processed much. An example of the data available in the report can be seen in the image below. It shows how the 171 goals in the 2014 World Cup came about.

A look at some of the data in the report

The main takeaway from the comparison with the men’s World Cup is that the U-17 World Cup might see more goals and fewer 0-0 draws on average. The flipside is that there could be more cards and penalties too. For more details, check the slideshow.

BE LESS INTIMIDATING FOR READERS

I know just using one World Cup each to represent men’s and U-17 football may not be particularly rigorous. We could have also used data from the previous three or four World Cups in each age format. But if I did that, I was scared the data story would become more dense and intimidating for readers. I wanted to make this easy to follow along and understand, which is why I simplified things this way.

A card from the slideshow

Another thing I did to make this easier to digest was to stick to one main point per card (see image above). The main point is in the headline, then you get a few lines of text below showing how exactly you’ve arrived at the main point. The figures that have been calculated and compared are put in a bold font. Then there is an animated graphic below that, which visually reinforces the main point of the slide.

The data story tries to simulate a card format, one that you can just flick through on the mobile. I used the slideshow library reveal.js to make the cards. But I suspect there is a standard, more established method that mobile developers have to create a card format, will have to look into this further.

The animations were done with D3.js, with help from a lot of examples on stackoverflow and bl.ocks.org. If you’re new to D3 and want to know how these animations were done, here’s more info.

ANIMATING THE BAR CHART

The D3 ‘transitions’ or animations in this slideshow are basically the same. There’s (a) an initial state where there’s nothing to see, (b) the final state where the graphic looks the way you want and (c) a transition from the initial state to the final state over a duration specified in milliseconds.

A snippet of code for animating the bars

For example, in the code snippet for the bar animation above, you see two attributes changing for the bars during the transition—the ‘height’ and ‘y’ attributes changing over a duration of 500 milliseconds. You can see another example of this animation at bl.ocks.org here.

ANIMATING THE STACKED BAR CHART

This animation was done in a way similar to the one above. The chart is called a ‘normalised stack chart’ and the code for this was taken from the bl.ocks.org example here.

The thing about this chart is that you don’t have to calculate the percentages beforehand. You just feed in the raw data (see image below) and you get the final percentages visualised in the graphic.

The raw data on goals gets converted to percentages

ANIMATING THE LINE CHART

The transition over here isn’t very sophisticated. In this, the two lines and the data points on them are basically set to appear 300 milliseconds and 800 milliseconds respectively after the card appears on screen (see the code snippet below).

A snippet of code for changing the opacity of the line

A cooler line animation would have been ‘unrolling’ the line as seen in this bl.ock.org example. Maybe next time!

ANIMATING THE PIE CHART

Won’t pretend to understand the code used here. I basically just adapted this example from bl.ocks.org and played around with the parameters till it came out the way I wanted. This example is from Mike Bostock, the creator of D3.js, and in it he explains his code line by line (see image below). Do look at it if you want to fully understand how this pie chart animation works.

Commented code from Bostock

ANIMATING THE ISOTYPE CHART

Yup, this chart is called an isotype chart. This animation is another one where the transition uses delays. So if you look in the gif, you see on the left side three cards being filled one after the other.

Some of the code used in animating this isotype chart

They all start off with an opacity of 0, which makes them invisible (or transparent, technically). What the animation does is make each of the cards visible by changing the opacity to 1 (see image above). This is done after different delay periods of 200 milliseconds for the bottom card, 400 for the card in the middle and 600 milliseconds for the card on top.

FINAL WORD

If you’ve never worked with D3 before, hope this write-up encourages you to give it a shot. You can look at all the code for the slideshow in the github repo here. All comments and feedback are welcome! 🙂

COVER IMAGE CREDIT: Made in inkscape with this picture from Flickr

Ethical Reporting of Data and Security Issues

Who is this document for  ? 

In this era where we have a mobile application  and a website for everyone even with more software engineers; data leaks or security issues are a common phenomenon. This document will hopefully help anyone who finds leaks report them ethically without causing too much harm. 

What is the need of this document ? 
In India recently we have seen a lot of leaks from government websites to JIO to Zomato to Medical Testing Data . More engineers are willing to share these leaks on twitter and more news organisations are covering them. But in most of these cases we have observed that care has not been taken to protect the user information and privacy. Hence we decided to write this document. 
Warning :  This is not a comprehensive guide  but this is what we think are best practices to follow at this time. Please make sure you talk to a lawyer in addition to this.

First things first

Determining the criticality of a Issue
 A data leak or a security issue could have varied levels of criticality depending on who it causes damage to. Some leaks could cause loss of revenue for a organisation (eg a food ordering service Food Panda in India lost a lot of revenue because of a bug). While other leaks could result in invasion of privacy and  end up affecting personal lives of people like the pathology clinic in Mumbai where the medical records of patients were leaked . Some leaks could have financial outcomes for the data subjects,  such as leaks involving financial information, passwords to accounts tied to transaction ability etc. 
The more the criticality of a leak the more caution you have to exercise in reporting either on social media or otherwise. So how does one determine the criticality ? Here are some simple questions to guide you to understand the criticality 
  1. Does this affect more than one person ?
  2. Does reporting this damage lives in a long term and is the damage irrevocable ?eg : The leak of health records of HIV Patients from a pathology lab or a hospital could affect the people involved by marking them for their entire life. This information coupled with social stigma could affect their prospects of a healthy life or their professional lives
  3. How long will it take for the organisation to close the leak ?
  4. In case there is no clear tangible impact, could this have an impact in combination with other public data?
Ideally  It is recommended that one does not publicize a leak without first informing the organisation affected and following up with them to close it. 
Consequences I should consider when i report a security issues
Reporting a leak in India is always a tricky situation. It could lead to criminal proceedings against you. We advice that you always get a legal advise before you report anything. Especially if the organisation has not defined a bug reporting program. The information technology act 2000 for eg is one of the laws which defines what is considered a crime with respect to action online.
For eg : Some of the sections particularly section 43 of the act defines it be a crime to gain unintentional access or even download or damage a system. 
Remember reporting a leak sometimes takes extensive  follow ups. It needs perseverance and patience. Sometimes reporting a leak requires you to gain skills which are not technology oriented eg:  networking to connect with decision maker who could help plug the issues or even writing skills to share the issue online. One has to be prepared to learn and get more support when not equipped with the right skills.  
We  highly recommend that you do a personal risk assessment before you move forward. 
Risk Assesment for a reporter 
Warning : This is only a guideline we recommend you talk to your lawyers and security professionals and your network support for a better assessment 
  1. Do I have support  by which we mean :
    • Financial to support me in case of consequences
    • Personal networks to support me
    • Legal Support Systems 
  2.  Who else is at risk if i am in trouble 
  3.  What are the chances that the organisation you are reporting against will act to prosecute you 
  4.  Do I have the mental ability / support to handle the stress for the period of time
  5.  Do I have the technological support to help me protect myself and my loved ones   
Calculating the risk of a vulnerability 
Security professionals around the world use some standard methodologies to calculate risk of a vulnerability. One such tools which can help you calculate the risk is this : 
It is highly recommend that you do the risk assessment of a vulnerability in order to help you stratergise or even decide on continuing to work on it 

Effectively document your finding 

Documenting your story
When you chance upon a leak and confirm it by double checking. It is highly recommended to make elaborate notes on how you discovered the leak. This documentation should be done as soon as possible if possible right after you are confident of it as a leak. The reason being this is going to be something various stakeholder will be asking you repeatedly and having as many accurate details will help support your case. You can use secure note taking platforms like etherpad / riseup pad to protect your privacy if you making notes online. It is recommended that you store this offline in an encrypted format. Also take screenshots with time stamps ( this could be a double edged sword too if you are liable to prosecution because of the leak) 

Documenting the Bug

An elaborate documentation of the leak itself helps in getting it fixed faster. Engineers always find it useful to have more information. 

  • Include steps to show how to replicate your bug , talk about pre conditions eg : it could be accessing a particular page with a particular browser /  accessing it with a certain phone 
  • Include screenshots if possible. 

Effective process to share your findings 

Reaching out to concerned officials / Organisations

The right way to report a leak would be to reach out the organisation of the leak and write to them about your findings before you go public with a leak ( offcourse this has consequences we certainly dont recommend this for Snowden or Manning type leaks assessing you risk is very important before you do this ). This can be done in many ways

  • Bug Bounty Programs  : Most technology orgs have a bug bounty program and also sometimes offer rewards for reporting of leaks this is the easiest and the most rewarding way to reach out to orgs.
  • Organisation Public Issue trackers :  Some open organisation do not have a bug bounty programs but have public bug repositories this is either linked to their code repos or to their websites. This is another way to report any security leaks.  
  • Community Outreach Co-ordinators : In the absence of a bug bounty programs or Issue trackers  some organisation have Outreach Coordinators and they are developer and business liaisons for a organisation. For any critical leak it is highly advisable to talk to them to report bugs this will ensure the closure of leaks with minimal lead time   
  • Public / Private mail ids :  While some prefer anonymous reporting , sometimes personal reporting helps build confidence and obtains quick results. If you wish to stay anonymous it might be best to report to the public email ids available on the website. On the other hand if you have a certain degree of confidence on the intent of an organisation. It might be best to use your personal network to reach out to them and talk them through it 

In case of leaks of the government websites  : Every country has a specific process to report leaks on government websites. In India specifically CERT India is responsible for the security of government websites and it is best to report to them. The other organizations one can reach out to are  

  1. CERT India
  2. MEITY
  3. Government Body that it is affected

Talking  about the Security Issue  in Public

While it is very easy to talk about a security issue in public . It is also considered a honor by many and sometime a necessity to report. We recommend the following actions if you plan to do so 

As a general rule avoid talking about a leak or a issue before it has been fixed  

Initiating the removal of sensitive data

Make sure atleast the sensitive data is removed before you share a issue if you cant get the complete issue fixed. By Sensitive Data we mean Personally Identifiable Information. It is also recommended that you clean the data for secondary Identifiers ( these are identifiers when coupled with other information can still make data personally identifiable). If you are not sure of identifiers we recommend that you talk to organisations which have been working on Open Data for a while  ( below is a list of some organisations) 

Building a campaign

While sporadic tweeting or sharing help sometime . It is always recommended that you build a plan to talk about the leak. 

  •  Decide on your objectives for sharing the leak what would like to achieve . 
  • Identify people who are working in this field and could help you amplify your voice.
  • Talk in as much accuracy of the effects of the leak and its origin
  • It might be best to leave out details of reproduction of the leak if you think it could harm more people  

Using Screenshots 

Sometimes using screen shots not amplifies your report and its impact. We recommend using it as opposed to share the methodology of replication in public. Though one has to be cautious while sharing screenshots make sure you block any personally Identifiable information or information that could cause damage to lives or property.

Talking to Press

Before talking to the press please be clear of your intentions to do so. Again we recommend this only after the issues have been fixed. But sometimes it is important that you talk in your help to close the issues and we understand that. So we have put together a set of things that would make this conversation ethical and effective

Disclosure  :

  • Make sure to disclose you intent of reporting leaks
  • Disclose any funding you have received to do this work 
  • Avoid sharing in detailed description of the leak in case it has not been closed yet
  • Make sure to not share sensitive data either through your screenshots or through data 

 

List of organizations to ask for support

Online Security and Whistle Blower Protection

Open Data Questions 

Legal Support ( India ) 

ALF  

Research Methodologies 

Security Methodology and Advice

Further Reading

This guide was written by Chinmayi S K with contributions and feedback from Thejesh GN , Chris Kubeca , Amber Sinha and Nisha Thompson

This post is released under the Creative Commons Share Alike 4.0 License

If you have any feedback or  comments please feel free to write to us through this contact form and we will get back to you asap 

Survey of India Nakshe Portal

The Survey of India has launched a map sharing portal called Nakshe.  This is a great first step for the SOI who have not exactly been the most open with their maps.

“In Nakshe portal, user can see the list and meta data of all Open Series map(OSM) district wise released by Survey of India in compliance with National Map Policy – 2005. These maps are available for free download once the user login to the site using his/her Aadhar number. ”

While we applaud this initiative we hope they make it even better and more useful to a wider population. We have submitted to the SOI a letter with recommendations for the portal you can see the letter below.

We hope to get some feedback from people who have used the portal to get maps. We are happy to keep sending them feedback in hopes they will continue to improve the portal.

Home for All our Maps

Over the years DataMeet community has created/cleaned lots of maps and made them available on GitHub. One of the biggest issue we had was visibility. Larger community couldn’t find them using google or couldn’t figure out how-to download maps or use them. Basically we lacked documentation. Happy to say we have started working on it

The home of all the projects will be

http://projects.datameet.org/maps/

From there you will be able to find links to others, This is the link you can use to share in general. More links below.

Most documentation have description of the map, fields, format, license, references and a quick view as to how the map looks. For example check the Kerala village map page.

There is a little bit of work left in documenting the Municipality maps. I am working on them. Otherwise documentation is in a usable state. P

lease add your comments or issues on GitHub or respond here. Each page has a link to issues to page on Github. You can use it.

In future I will try to add some example usage, links to useful examples and tutorials and also build our reference page. I am hoping

Thanks to Medha and Ataulla for helping to document these projects.

A few days back I also wrote about Community Created Free and Open Maps of India, let me know if I have missed any projects. I will add.

Map links

On github they remain same, We have mainly three maps repos

How to Make an Election Interactive

So I created an interactive for Wionews.com (embedded below) on the assembly elections taking place in five states. This write-up goes into how I did the interactive and the motivations behind it.


The Interactive is embedded below. Click on Start to begin.


The interactive looks at three things:

  • where each party won in the last assembly election in 2012 in each of the five states, visualised with a map.
  • where each party won in the last Lok Sabha (LS) election in 2014, if the LS seats were broken up into assembly seats. This was also done with a map.
  • the share of seats won by each major party in previous assembly elections, done with a line chart.

I got all my data from the Election commission website and the Datameet repositories, specifically the repositories with the assembly constituency shapefiles and historical assembly election results.

Now these files have a lot of information in them, but since I was making this interactive specifically for mobile screens and there wouldn’t be much space to play with, I made a decision to focus just on which party won where.

As mundane as that may seem, there’s still some interesting things you get to see. For example, from the break-up of the 2014 Lok Sabha results, you find out where the Aam Aadmi Party has gained influence in Punjab since the last assembly elections in 2012, when they weren’t around.

The interactive page on the AAP in Punjab, 2014
The interactive page on the AAP in Punjab, 2014

ANALYSING THE DATA

While I got the 2012 election results directly from the election commission’s files, the breakdown of the 2014 Lok Sabha results by assembly seat needed a little more work with some data analysis in python (see code below) and manual cross-checking with other election commission files.

Some of the python code used to break down the 2014 LS results by assembly seat.
Some of the python code used to break down the 2014 LS results by assembly seat. You can see all of it here.

For calculating the percentages of seats won by major parties in the past, I had to do some analysis in python of Datameet’s assembly election results file.

Some of the python code used to calculate historical seat shares of parties.
Some of the python code used to calculate historical seat shares of parties. You can see all of it here.

PUTTING IT ALL ONTO A MAP

The next thing to do was put the data of which party won where onto an assembly seat map for each state.

To get the assembly seat maps, I downloaded the assembly constituency shapefile from the datameet repository and used the software QGIS to create five separate shapefiles for each of the states. (Shapefiles are what geographers and cartographers use to make maps.)

A screenshot of the <a href="https://www.qgis.org" target="_blank">QGIS</a> software separating the India shapefile into separate ones for the states.
A screenshot of the QGIS software separating the India shapefile into separate ones for the states.

The next task is to make sure the assembly constituency names in the shapefiles match the constituency names in the election results. For example, in the shapefile, one constituency in Uttar Pradesh is spelt as Bishwavnathganj while in the election results, it’s spelt as Vishwanathganj. These spellings need to be made consistent for the map to work properly.

I did this with the OpenRefine software which has a lot of inbuilt tools to detect and correct these kinds of inconsistencies.

The purist way would have been to do all this with code, but I’ve been using OpenRefine, a graphical tool, for a while now and it’s just easier for me this way. Please don’t judge me! (Using graphical tools such as OpenRefine and QGIS make it harder for others to reproduce your exact results and is less transparent, which is why purists look down on a workflow that is not entirely in code.)

After the data was cleaned, I merged or ‘joined’ the 2012 and 2014 election results with the shapefile in QGIS, I then converted the shapefile into the geojson format, which is easier to visualise with javascript libraries such as D3.js.

I then chose the biggest three or four political parties in the 2012 assembly and 2014 LS election results for each state, and created icons for them using the tool Inkscape. This can be done by tracing the party symbols available in various election commission documents.

Some of the party icons designed for the interactive
Some of the party icons designed for the interactive

HOW IT’S ALL VISUALISED

The way the interactive would work is if you click on the icon for a party, it downloads the geojson file which, to crudely put it, has the boundaries of the assembly seats and the names of the party that’s won each seat.

The interactive map showing the NPF in Manipur in 2014
The interactive map showing the NPF in Manipur in 2014

You then get a map with the seats belonging to that party coloured in yellow. And each time you click on a different party icon, a new map is generated. (If I’ve understood the process wrong, do let me know in the comments!)

Here’s some of the d3 code used:

    map2
        .append("svg:image")  //put an image onto the canvas
        .attr("xlink:href","../exp_buttons/bharatiya_janta_party_75.png")  //take the image from the exp_buttons folder
        .attr('height', '75')
        .attr('width', '75')
        .attr('class','shadow partyButton')
        .attr('id','bjpButton')
        .attr("x", 30)             
        .attr("y", 0)    
        .on("click", function(){
            map
              .append("svg:g")         //create the map
              .style("fill","#4f504f")  //fill the map with this black color
              .selectAll("path")
              .data(json.features)
              .enter()
              .append("path")
                  .attr("d", pathx)
                  .style("stroke", "#fdd928")  //create yellow borders
                  .style("opacity","1")
                  .style("stroke-width", "1")
                  .style("fill",colorParty);      //colorparty is determined by the function below

		 //fill the seats with yellow if they were won by the “Bharatiya Janta Party”
		//and if they were won by someone else, make them black
					                
                function colorParty(d) {
                   if (d.properties.uttarakhand_2012_2012_1 == "Bharatiya Janta Party") {
                      return "#fdd928"
                } else {
                      return "#4f504f";
                    }
                };
              });

I won’t go into the nitty gritty of how the line chart works, but essentially every time you click on one of these icons, it changes the opacity of the line representing the party into 1 making it visible while the opacity of every other line is reduced to 0 making them invisible.

The historical performance of the MGP in Goa.
The historical performance of the MGP in Goa.

Here’s some of the relevant d3 code:

svg
	.append("svg:image")                                                             //this tells D3 to put an image onto the canvas
	.attr("xlink:href","../exp_buttons/bharatiya_janta_party_75.png")   //and this will be the bjp image located in the exp_buttons folder
	.attr('height', '75')
	.attr('width', '75')
	.attr('class','shadow partyButton')       //this is what gives a button the shadow, attributes derived from css 
	.attr('id','bjpButton')			     
	.attr("x", 0)             
	.attr("y", height + margin.top + 20)    
	.on("click", function(){
			d3.selectAll(".line:not(.bjpLine)").style("opacity", "0");  //make all other lines invisible
			d3.selectAll(".bjpLine").style("opacity", "1");                   //make the BJP line visible
			d3.select(this).classed({'shadow': false});		//remove the drop shadow from the BJP button 
											//so that people know it’s active
			d3.selectAll('.partyButton:not(#bjpButton)').classed({'shadow': true});  //this puts a drop shadow onto other buttons
													   //in case they were active
			
			});

I then put everything into a repository on Github and used Github pages to ‘serve’ the interactive to users.

Now I haven’t gone into the complexity of much of what’s been done. For example, if you see those party symbols and the tiny little shadows under them (they’re called drop shadows), it took me at least two days to make that happen.

It took two days to get these drop shadows!
It took two days to get these drop shadows!

MOTIVATIONS BEHIND THE INTERACTIVE

As for the design, I wanted something that people would just click/swipe through, that they wouldn’t have to scroll through, and also limit the data on display, giving only as much as someone can absorb at a glance.

My larger goal was to try and start doing data journalism that’s friendlier and more approachable than the stuff I’ve been doing in the past such as this blogpost on the Jharkhand elections.

I actually read a lot on user interface design, after which I made sure that the icons people tap on their screen are large enough for their thumbs, that icons were placed in the lower half of the screen so that their thumbs wouldn’t have to travel as much to tap on them, and adopted flat design with just a few drop shadows and not too many what-are-called skeumorphic effects.

Another goal was to allow readers to get to the information they’re most interested in without having to wade through paras of text by just tapping on various options.

The sets of options available to the user while in the interactive
The sets of options available to the user while in the interactive

I hacked a lot of D3.js examples on bl.ocks.org and stackoverflow.com to arrive at the final interactive, I’m still some way away from writing d3 code from scratch, but I hope to get there soon.

Because I’m not a designer, web developer, data scientist or a statistician, I may have violated lots of best practices in those fields. So if you happen to come across some noobie mistake, do let me know in the comments, I’m here to learn, thanks! 🙂


Shijith Kunhitty is a data journalist at WION and former deputy editor of IndiaSpend. He is an alumnus of Washington University, St. Louis and Hindu College, Delhi.

Demonetisation with Srinivasan Ramani

Srinivasan Ramani is Deputy National Editor who works with data at The Hindu. He has been a long time member of DataMeet community. This week I caught up with him to talk about Demonetisation move by Government of India.


Show Notes

Crossposted.

DataMeet is a community of Data Science and Open Data enthusiasts.