Crowdsouricng Data: Strategies, Stories, Tools.

I was a facilitator at the Techcamp Bangalore where I introduced several strategies of crowdsourcing data to the broad group of non-profit organisations. My idea was to walk the participants through two stories that I’m personally part of – Akshara Foundation and the Humanitarian OpenStreetMap Team – by asking four key questions – What data to collect, from whom to collect, how to collect and, how to verify. We had some interesting conversations at the event but most of them were in to learn about more tools. I put together a small slide deck and a repository for collecting tools and reading material.

Data and Maps workshop for IT for Change, Bangalore.

Few weeks back, I ran a three day workshop for the developers at IT for Change in Bangalore. They have been quite involved in OpenStreetMap by mobilising graduate students in various towns of Karnataka to map public infrastructure. Apart from this spatial data, they have collected demographics about these towns on several aspects. The workshop was intended to give them a complete coverage about geospatial technology, data and representation. The outline and code are available on Github.

Conversations at the Mozilla Summit 2013.

I was privileged to attend the Mozilla Summit this year in Santa Clara. Over the period of three days, I had some interesting conversation with people from different technology backgrounds and this post is more of a self reference for me to come back to.

Robert Kaiser and I had an interesting conversation about his maps application for FirefoxOS. His app uses just HTML Canvas to visualise the tiles on the client side and also take care of all the interactions – no third party libraries.

I grabbed breakfast with Mark Giffin one morning and we started talking about rendering indic language and making them print ready. At Akshara, we use several techniques like Phantom.js to achieve this. Mark suggested that we should conside DITA. DITA provides comprehensive solutions for typesetting.

Amir Aharoni of the Wikimedia Foundation joined our discussion and introduced Firefox as part of the solution. Pointing out that Firefox works very well in rendering indic language from his experience working with the language team at Wikimedia. That’s most of what we are doing at Akshara right now, but there are local dialects which need more work.

I have known James Hughman for couple of years now, since his visit to India for the Droidcon. He joined Mozilla recently and I was excited by the fact that I would get to see him at the summit. We spoke about the books that we are reading, the new DRM policies and so much.

Toby Elliot introduced the new location services that Mozilla is building. I had  a chat with him about how we can use OpenStreetMap data and probably help improve the infrastructure. There’s a very exciting email thread going on between us right now to figure out how we can get this going.

Bill Walker was curious about the new maps project that we are doing in Congo. His brother being an archeologist does a lot of mapping and have been considering building platforms for collaborative mapping. We shared and talked about some of the existing systems and how we can adapt them for the custom usecases.

There are more people that I have spoken to than the above, but definitely these are the conversations that will continue and probably make way for more posts!

The Reader’s Digest Great World Atlas of 1961.

I acquired the Reader’s Digest Great World Atlas of 1961, first edition, yesterday at a very old bookstore in Bangalore.

cover

It’s an amazing addition to my collection of maps. Interestingly, I couldn’t find any information about the atlas on the Internet.

The atlas was ‘planned’ under the direction of the famous geographer Frank Debenham. With involvement of the British and Foreign Bible Society, British Broadcasting Corporation, FOA, WHO, Information Service of India and numerous other organizations and individuals, the atlas is spread out in four sections.

Paradise is somewhere in the far east. Jerusalem is the center of all nations and countries, and the world itself is a flat disk surrounded by oceans of water. So the monks, map-makers of the Middle Ages, saw the world they lived in.

Jerusalem was considered to be the center of the world, while the geographic center was first calculated in 1864, revised in 1973 and finalised in 2003 by Andrew J. Woods. The atlas attempts to fix these wrong notions by collecting the sum of knowledge from the explorations and scientific discoveries at that time.

The first section called the Face of the World portrays some fascinating relief maps like the ones below. They are structurally and geographically to the utmost detail that I’ve seen in any of the old representations.

reliefreleif 2

The atlas employs various projections like the Conic, Lambert’s Azimuthal Equal Area, Bonne and the Van der Grinten projection. The following map of oceans is better depicted in the Van der Grinten projection.

vandergrintens

And the following interesting illustration about continental drift.

continents

There are more to the atlas than these. I hope to post them when I find time to read through this amazing record of history.

 

The Last 30 days.

I can’t believe that I’m sitting in my Bangalore home and writing this post after what happened in the last 30 days. I don’t have words to thank all those amazing people who took care of me over these days to bring me back and bouncing. Not quite there yet, but in a while. But I’m alive, for that matter.

I was between Italy and Germany during June 23 – July 9. We had an amazing time at the Info Activism Camp and later in Berlin with Kaustubh and Rome with Tin. It was fantastic. Towards the end of the trip I was quite tired from a sunstroke and irregular fever. On my flight back the fever decided to test the case and did the trick. 109 degree Fahrenheit with rigor. I arrived in Bangalore the next morning and went straight to a hospital.

From there until the last week, I have been to 4 hospitals, consulted 9 doctors, subjected to 7 blood diagnosis, 4 different radio-imaging, 3 antibiotics and a lot of stress. This was no fun. Not to any extent. I’ve cried and I’ve seen my mum crying at the same time. I was struck by an unidentifiable fever. I’ve lost weight and hair, and for whatever reasons my heart is heavy and life is rough.

It took a while to identify that I was suffering from a precursor of Enteric Fever. I’ve recovered now, though hopes weren’t too high in my mind. Time heals and patience count.

I want to thank Rahul – for coming over to check on me while I was down in Bangalore, staying over without sleep, taking care of me and taking me to another hospital the next day. I want to thank my mum and dad. I’ll easily run out of words here. What they went through is nothing compared to the pain I suffered. My aunts and brothers – for sending me food and supporting mum whenever she was alone in the hospital. I want to thank Gautam – for taking care of everything so that I could stay away from work as long as I wanted, checking on me and sending me one of my favorite books when I was getting bored. Francesca and Ashima – for talking to me when I wanted to. RijuShashank and Ayesha for letting me know that they miss me and I need to be all right soon.

And thank you everyone – your prayers and wishes helped me through.

Designing a New Map Portal for Karnataka Learning Partnership.

Wrote a rather detail post about the new maps for Karnataka Learning Partnership on the geohackers.in blog.

The map is an important part of our project, action and process because it serves as the pivot point of navigation. I will quickly talk about the data and tools before we discuss the design aspects.

We have a fairly large dataset of schools in Karnataka. The name of the school, location, number of girls and boys etc. in a database. Fortunately, the data was clean and properly stored in a PostgreSQL database with PostGIS extensions. Most of my task was to modify the API to throw GeoJSON to the client using the ST_AsGeoJSON function and export the data.

read more…

Indic Wikipedia: Visualizing Basic Parameters.

Riju (Sumandro) and I are with The Centre for Internet and Society to understand how the Indic Wikipedia community is growing. Today, we published the first set of visualizations and a blog post about why, what and how we did this.  Cross-posting from the CIS website.

Introduction

Understanding how the Indic or the Indian language Wikipedia projects are growing is something that we have been interested in for quite sometime. We were delighted to come across this opportunity from the Centre For Internet and Society (CIS) and Wikimedia Foundation. We divided our analyses into three focus areas: (1) basic parameters, (2) geographic patterns of edits, and (3) exploring the topics that receives the greatest number of edits. The existing infographics and data visualisations that we found about Indic Wikipedias mostly engaged on the first area, and also emphasised on yearly aggregates. We thought a more granular, that is monthly, understanding and a focus on the geographic and thematic spread of the edits would be very helpful to further appreciate the activities.

We began by collecting data about the following basic parameters:

  1. Number of Editors
  2. Number of Articles
  3. Page Views
  4. Number of Active Editors
  5. Number of New Articles
  6. Number of New Editors
  7. Edit Size

Acquiring the data

We explored the MediaWikiAPI, ToolServer and the Wikimedia Statistics Portal. These are several ways of obtaining data about Wikipedia in general. Depending on the use case, such as the quantity of data required or the need for customised/selective data scraping, any one or more of these methods of data gathering can be chosen. The API had limitations in terms of how much data you can access, and it is meant to be used to access actual Wikipedia entries. We, however, were looking for metadata about the entries/articles (such as when it was first created, when and how many times it was edited, etc.) and not the actual entries/articles, that is the actual contents of Indic Wikipedias. ToolServer is an excellent way of running custom scripts. Although, this takes for granted that user (of ToolServer) has substantial command over the back-end infrastructures and processes that Wikipedia runs on. We wrote a few scrapers to extract metadata about Indic Wikipedia projects from the ToolServer but not exactly being experts in the Wikipedia back-end systems, we found scraping from ToolServer rather time-and effort-intensive. The statistics portal is a well organised and an accessible place for collecting data for analyses. However, we came across several missing parameters and projects, that is the statistic portal did not have all the parameters and Wikipedia projects we were interested in. In our search for Indic Wikipedia datasets so far, we realised that the Wikimedia Analytics Team (WAT) puts a lot of effort in writing scripts and collecting various data at different levels. Wikimedia developer Yuvi Panda and the Access to Knowledge team at CIS, aware of our difficulty in obtaining the data, also pointed us towards the WAT. While we were already scraping data on some of the parameters, we approached the WAT whose prompt and very supportive response much accelerated our work process. The fantastic Wikimedia developers, especially Evan Rosen (a big ‘thank you’ for him) shared the needed data, which we cleaned up and archived at the Github repository for the project.

We obtained data for the period from January 2001 to December 2012. It appears that the Indic Wikipedia projects began their activities around 2005. A big part of cleaning the data involved identifying when each of the projects started and dropping data. There are 20 Indic Wikipedia projects with 4,98,964 articles, 5,689 editors and over 3,35,49,102 readers.

Deciding upon chart types

We spent quite some time discussing different methods of visualising the data. The major difficulty is that there are too many entities to be plotted. As each language must be plotted as a separate entity — point, line, circle, etc. — the chart has a tendency to become cluttered and illegible. Even if we take only one variable — say New Editors — there will still be 20 points or lines to be plotted. Hence, using any of the conventional charts becomes difficult. For example, if we chose a line chart with New Editors on the Y-axis and months on the X-axis, there will be 20 lines each of a different colour, representing different languages. Also, the five-six year monthly timeline translates into 60-72 temporal data points.

We have adopted two strategies, and related chart types, to address this difficulty.

Firstly, we used a monthly calendar-like heatmap chart that limits the temporal spread of data to one year for each section of the chart and uses a positionally uniform set of columns for each language so as to make reading the chart easier. Limiting each chart section to 12 months allow the user to focus on more granular movements of the variable concerned, say the number of New Editors per month. By representing each languages on an unique column, and not by an upwards-and-downwards moving line as in a line chart, makes it easier for the user to follow movements in each language (where movement is shown by the intensity of colour, as characteristic of heatmaps) without the need to have a separate coloured entity — point, line, circle — for each language.

Secondly, we used a motion chart, as made famous by Dr. Hans Rosling, that removes the temporal axis from X- and Y-axes of the chart and uses animated transition to represent temporal change. Motion chart has the unique ability to handle as many as five variables in an organised manner, using the following visual elements: X-axis, Y-axis, Z-axis (animated temporal transitions), size of bubbles, and colour of bubbles. It is, however, recommended that represented variables be limited to a maximum of four for easier legibility. In our case, we have used the X- and Y-axes to plot various related variables (which can be selected by the user) such as New Editors and New Articles, the Z-axis to represent time, and the colour of the bubbles to represent a third optional variable (also can be selected by the user). Since different Indian language Wikipedia projects often take a wide range of values for most variables, using the size of the bubble to represent any of those variables is avoidable. Further, the motion chart gives the user a lot of controls to explore the various projects and variables according to their interest and especially to compare particular projects and variables to each other.

Discussing the chart types with the Access to Knowledge team, we decided to use simpler line charts — emphasising upon single Indic Wikipedia projects — on the language-specific pages that we will be creating next.

Calendar charts

Calendar Chart

We visualised three parameters using the calendar heatmap strategy: (1) New Articles, (2) New Editors, (3) Active Editors.

The New Articles Calendar shows new articles posted on every Indic Wikipedias for every month since 2004. It was interesting to note the few number of articles in 2012 for all the languages. The first language to have the most number of new articles is Bengali. Hindi picks up around same time with fewer number of articles. Except Urdu and Nepali, every other language dropped in the number of new articles. However, we should remember that a lower number of new articles does not necessarily indicate at low overall activity in the project concerned.

Like the new articles, we wanted to explore the patterns in the number of new editors across all of the Indic Wikipedia projects. As you run through the new editors calendar chart, it is evident that there is consistent growth in the editor base for few projects like Hindi, Marathi, Bengali, Telugu, Tamil, Kannada and Malayalam. If one takes a step back and compares this with the number of new articles chart, something is not very clear — in some of the projects, there is a growth in the number of editors but not many new articles are posted. We are very keen to understand why this has happened.

If we look at the active editors calendar, Tamil started with 2 active editors in January 2004 and with few ups and downs grew to about 115 active editors in December 2012. Malayalam started slow in late 2004 with 2 editors and grew to 155 active editors in December 2012. We are sure the viewers should be able to find out more patterns by studying the charts closely and comparatively.

Motion chart

Motion Chart

We developed a motion chart comparing five variables: (1) Active Editors (> 5 edits per month), (2) New Editors, (3) Total Editors, (4) New Articles, and (5) Total Articles. When the visualisation is opened, Total Editors is plotted on the X-axis, Total Articles is plotted on the Y-axis, the colour of the bubbles indicate the Active Editors (Blue is low and Red is high) and the sizes of the bubbles are kept the same for easier comparison.

The user can click on the drop down menus at the X- and Y-axes, and next to the size and colour variables, and make them represent different variables.

We chose to configure the X- and Y-axes to show the data in logarithmic scales and not in linear scales. Since most projects experience small increments over time and there exists a wide difference between the most and the least popular/active projects, the logarithmic scale is better suited to represent the changes in the given data. The user has the option to select linear scale at the end of both X- and Y-axes (click on “Log”).

As evident in the visualisation, the Newari project and the Hindi-Malayalam project cluster show very interesting contrasting dynamics — while both achieve similar Total Articles numbers, the latter is much more editor-heavy. This suggests a smaller but more active editor community for the Newari project.

Please click on the image of the motion chart below to open the interactive version in a separate window. The code can be accessed at the project repository on Github.

 

Data and Maps workshop at OpenDataCamp Bangalore.

Kaustubh and I did a 6 hour long workshop on Data and Maps at the OpenDataCamp Bangalore 2013. Find everything related to the workshop here.

The fight for Guerilla Open Access

Yesterday, we met with the sad demise of one of most brilliant Internet Activist, Aaron Swartz. I cried and read a lot about Aaron that I found online the whole of last night. He was a hero. And when I look at the Remember Aaron Swartz website, my heart is sinking. I came across this piece by Aaron written in 2008 while he was in Italy, calling everyone to fight for Open Access:

Information is power. But like all power, there are those who want to keep it for themselves. The world’s entire scientific and cultural heritage, published over centuries in books and journals, is increasingly being digitized and locked up by a handful of private corporations. Want to read the papers featuring the most famous results of the sciences? You’ll need to send enormous amounts to publishers like Reed Elsevier.

There are those struggling to change this. The Open Access Movement has fought valiantly to ensure that scientists do not sign their copyrights away but instead ensure their work is published on the Internet,
under terms that allow anyone to access it. But even under the best scenarios, their work will only apply to things published in the future. Everything up until now will have been lost.

That is too high a price to pay. Forcing academics to pay money to read the work of their colleagues? Scanning entire libraries but only allowing the folks at Google to read them? Providing scientific articles to those at elite universities in the First World, but not to children in the Global South? It’s outrageous and unacceptable.

“I agree,” many say, “but what can we do? The companies hold the copyrights, they make enormous amounts of money by charging for access, and it’s perfectly legal — there’s nothing we can do to stop them.” But there is something we can, something that’s already being done: we can fight back.

Those with access to these resources — students, librarians, scientists — you have been given a privilege. You get to feed at this banquet of knowledge while the rest of the world is locked out. But you need not — indeed, morally, you cannot — keep this privilege for yourselves. You have a duty to share it with the world. And you have: trading passwords with colleagues, filling download requests for friends.

Meanwhile, those who have been locked out are not standing idly by. You have been sneaking through holes and climbing over fences, liberating the information locked up by the publishers and sharing them with your friends.

But all of this action goes on in the dark, hidden underground. It’s called stealing or piracy, as if sharing a wealth of knowledge were the moral equivalent of plundering a ship and murdering its crew. But sharing isn’t immoral — it’s a moral imperative. Only those blinded by greed would refuse to let a friend make a copy.

Large corporations, of course, are blinded by greed. The laws under which they operate require it — their shareholders would revolt at anything less. And the politicians they have bought off back them, passing laws giving them the exclusive power to decide who can make copies.

There is no justice in following unjust laws. It’s time to come into the light and, in the grand tradition of civil disobedience, declare our opposition to this private theft of public culture.

We need to take information, wherever it is stored, make our copies and share them with the world. We need to take stuff that’s out of copyright and add it to the archive. We need to buy secret databases and put them on the Web. We need to download scientific journals and upload them to file sharing networks. We need to fight for Guerilla Open Access.

With enough of us, around the world, we’ll not just send a strong message opposing the privatization of knowledge — we’ll make it a thing of the past. Will you join us?

Aaron Swartz
July 2008, Eremo, Italy

Among other things, I write code and talk about Open Data. I dream of an open world, where we don’t let politicians and bureaucrats lock up information. Using open technologies is a way of doing this. Learn it. Advocate it. Contribute back to it, for the better world. We don’t have Aaron to lead the light anymore, but I’m sure his thoughts and ideas like the one above will always vouch us when it has to.

Visualizing SSLC results over seven years.

I spent most of the last one month building a dashboard for the SSLC results at the Karnataka Learning Partnership. We released it in beta yesterday and here’s the blog post I wrote detailing how we went about it.

Patterns in examination results are something which we are always interested at the Karnataka Learning Partnership. After the design jam in June 2012, where we tried to understand the SSLC data – it’s content and structure, and visualized performance of Government and Private schools in contrast to each other, we decided to take a step deep and find patterns from the past seven years. Results of this effort is what you find here, in beta.

The Karnataka Secondary Education Examination Board shared the data over the last seven years in a combination of several Microsoft Access Database. It came with very little meta data and Megha did all the hard work of making sense of this and pulled it into a PostgreSQL database. Inconsistencies are everywhere, and the quality always depends on how you handle each exception in isolation. Among other things, we decided to look at three aspects of the data to begin with – performance of Government and Private schools, performance in Mathematics, Kannada and English, and performance of each gender. All three across seven years (from 2004-2005 to 2010-2011) for each district in Karnataka.

One of the important data wrangling that we did this time was to aggregate this data at the district level. The raw data came at the educational district level and unfortunately, we did not have geographic boundary shapes for this classification. What we have instead is the geographic boundary at the political level. We massaged the shapefiles, geocoded the data and converted it to GeoJSON in QGIS. We wrote a bunch of Python scripts to perform the aggregation and generate JSON (JavaScript Object Notation) required for the visualization. Every bit of code that we wrote for this project is on Github.

A dashboard of this sort is something which we have never attempted, and honestly it took a while for us to get around it. I had tried D3.js sometime last year and found it to be amazing. D3 is Data Driven Documents, a brilliant JavaScript library to make infographics on the web driven completely by the data. What makes D3.js awesome for me is that everything is an SVG (Scalable Vector Graphic), and there are barely any limits to the representation and interaction that you can bring into the browser with it. I’ve had good experiences with Twitter’s Bootstrap to quickly design and be consistent on the page layout and aesthetics. There are some issues while you work with D3.js and Bootstrap together, especially the way bootstrap manages events. The best way is to trust D3.js and use Bootstrap features of scaffolding and layout.

We found few interesting facts from this exercise. As you may guess, private schools perform better than government schools consistently. Western districts like Udupi, Uttar Kannada and Belgaum performs better than rest of the state. North Karnataka, especially Bidar performs terribly across the last seven years. Something which we are very curious to know why. Bangalore Rural performs better than Bangalore Urban. Government schools does much better and comparable to private schools in Bangalore Rural than Bangalore Urban. Private schools grab the cap in all the three subjects across the last seven years. Girls performs way better than boys in private schools consistently across seven years in every district. Boys does a better job in Bangalore Urban while girls dominate in Bangalore Rural.

This research will continue while we churn few more aspects from the data as the dashboard gets out of beta.