Indic Wikipedia: Visualizing Basic Parameters.

Riju (Sumandro) and I are with The Centre for Internet and Society to understand how the Indic Wikipedia community is growing. Today, we published the first set of visualizations and a blog post about why, what and how we did this.  Cross-posting from the CIS website.

Introduction

Understanding how the Indic or the Indian language Wikipedia projects are growing is something that we have been interested in for quite sometime. We were delighted to come across this opportunity from the Centre For Internet and Society (CIS) and Wikimedia Foundation. We divided our analyses into three focus areas: (1) basic parameters, (2) geographic patterns of edits, and (3) exploring the topics that receives the greatest number of edits. The existing infographics and data visualisations that we found about Indic Wikipedias mostly engaged on the first area, and also emphasised on yearly aggregates. We thought a more granular, that is monthly, understanding and a focus on the geographic and thematic spread of the edits would be very helpful to further appreciate the activities.

We began by collecting data about the following basic parameters:

  1. Number of Editors
  2. Number of Articles
  3. Page Views
  4. Number of Active Editors
  5. Number of New Articles
  6. Number of New Editors
  7. Edit Size

Acquiring the data

We explored the MediaWikiAPI, ToolServer and the Wikimedia Statistics Portal. These are several ways of obtaining data about Wikipedia in general. Depending on the use case, such as the quantity of data required or the need for customised/selective data scraping, any one or more of these methods of data gathering can be chosen. The API had limitations in terms of how much data you can access, and it is meant to be used to access actual Wikipedia entries. We, however, were looking for metadata about the entries/articles (such as when it was first created, when and how many times it was edited, etc.) and not the actual entries/articles, that is the actual contents of Indic Wikipedias. ToolServer is an excellent way of running custom scripts. Although, this takes for granted that user (of ToolServer) has substantial command over the back-end infrastructures and processes that Wikipedia runs on. We wrote a few scrapers to extract metadata about Indic Wikipedia projects from the ToolServer but not exactly being experts in the Wikipedia back-end systems, we found scraping from ToolServer rather time-and effort-intensive. The statistics portal is a well organised and an accessible place for collecting data for analyses. However, we came across several missing parameters and projects, that is the statistic portal did not have all the parameters and Wikipedia projects we were interested in. In our search for Indic Wikipedia datasets so far, we realised that the Wikimedia Analytics Team (WAT) puts a lot of effort in writing scripts and collecting various data at different levels. Wikimedia developer Yuvi Panda and the Access to Knowledge team at CIS, aware of our difficulty in obtaining the data, also pointed us towards the WAT. While we were already scraping data on some of the parameters, we approached the WAT whose prompt and very supportive response much accelerated our work process. The fantastic Wikimedia developers, especially Evan Rosen (a big ‘thank you’ for him) shared the needed data, which we cleaned up and archived at the Github repository for the project.

We obtained data for the period from January 2001 to December 2012. It appears that the Indic Wikipedia projects began their activities around 2005. A big part of cleaning the data involved identifying when each of the projects started and dropping data. There are 20 Indic Wikipedia projects with 4,98,964 articles, 5,689 editors and over 3,35,49,102 readers.

Deciding upon chart types

We spent quite some time discussing different methods of visualising the data. The major difficulty is that there are too many entities to be plotted. As each language must be plotted as a separate entity — point, line, circle, etc. — the chart has a tendency to become cluttered and illegible. Even if we take only one variable — say New Editors — there will still be 20 points or lines to be plotted. Hence, using any of the conventional charts becomes difficult. For example, if we chose a line chart with New Editors on the Y-axis and months on the X-axis, there will be 20 lines each of a different colour, representing different languages. Also, the five-six year monthly timeline translates into 60-72 temporal data points.

We have adopted two strategies, and related chart types, to address this difficulty.

Firstly, we used a monthly calendar-like heatmap chart that limits the temporal spread of data to one year for each section of the chart and uses a positionally uniform set of columns for each language so as to make reading the chart easier. Limiting each chart section to 12 months allow the user to focus on more granular movements of the variable concerned, say the number of New Editors per month. By representing each languages on an unique column, and not by an upwards-and-downwards moving line as in a line chart, makes it easier for the user to follow movements in each language (where movement is shown by the intensity of colour, as characteristic of heatmaps) without the need to have a separate coloured entity — point, line, circle — for each language.

Secondly, we used a motion chart, as made famous by Dr. Hans Rosling, that removes the temporal axis from X- and Y-axes of the chart and uses animated transition to represent temporal change. Motion chart has the unique ability to handle as many as five variables in an organised manner, using the following visual elements: X-axis, Y-axis, Z-axis (animated temporal transitions), size of bubbles, and colour of bubbles. It is, however, recommended that represented variables be limited to a maximum of four for easier legibility. In our case, we have used the X- and Y-axes to plot various related variables (which can be selected by the user) such as New Editors and New Articles, the Z-axis to represent time, and the colour of the bubbles to represent a third optional variable (also can be selected by the user). Since different Indian language Wikipedia projects often take a wide range of values for most variables, using the size of the bubble to represent any of those variables is avoidable. Further, the motion chart gives the user a lot of controls to explore the various projects and variables according to their interest and especially to compare particular projects and variables to each other.

Discussing the chart types with the Access to Knowledge team, we decided to use simpler line charts — emphasising upon single Indic Wikipedia projects — on the language-specific pages that we will be creating next.

Calendar charts

Calendar Chart

We visualised three parameters using the calendar heatmap strategy: (1) New Articles, (2) New Editors, (3) Active Editors.

The New Articles Calendar shows new articles posted on every Indic Wikipedias for every month since 2004. It was interesting to note the few number of articles in 2012 for all the languages. The first language to have the most number of new articles is Bengali. Hindi picks up around same time with fewer number of articles. Except Urdu and Nepali, every other language dropped in the number of new articles. However, we should remember that a lower number of new articles does not necessarily indicate at low overall activity in the project concerned.

Like the new articles, we wanted to explore the patterns in the number of new editors across all of the Indic Wikipedia projects. As you run through the new editors calendar chart, it is evident that there is consistent growth in the editor base for few projects like Hindi, Marathi, Bengali, Telugu, Tamil, Kannada and Malayalam. If one takes a step back and compares this with the number of new articles chart, something is not very clear — in some of the projects, there is a growth in the number of editors but not many new articles are posted. We are very keen to understand why this has happened.

If we look at the active editors calendar, Tamil started with 2 active editors in January 2004 and with few ups and downs grew to about 115 active editors in December 2012. Malayalam started slow in late 2004 with 2 editors and grew to 155 active editors in December 2012. We are sure the viewers should be able to find out more patterns by studying the charts closely and comparatively.

Motion chart

Motion Chart

We developed a motion chart comparing five variables: (1) Active Editors (> 5 edits per month), (2) New Editors, (3) Total Editors, (4) New Articles, and (5) Total Articles. When the visualisation is opened, Total Editors is plotted on the X-axis, Total Articles is plotted on the Y-axis, the colour of the bubbles indicate the Active Editors (Blue is low and Red is high) and the sizes of the bubbles are kept the same for easier comparison.

The user can click on the drop down menus at the X- and Y-axes, and next to the size and colour variables, and make them represent different variables.

We chose to configure the X- and Y-axes to show the data in logarithmic scales and not in linear scales. Since most projects experience small increments over time and there exists a wide difference between the most and the least popular/active projects, the logarithmic scale is better suited to represent the changes in the given data. The user has the option to select linear scale at the end of both X- and Y-axes (click on “Log”).

As evident in the visualisation, the Newari project and the Hindi-Malayalam project cluster show very interesting contrasting dynamics — while both achieve similar Total Articles numbers, the latter is much more editor-heavy. This suggests a smaller but more active editor community for the Newari project.

Please click on the image of the motion chart below to open the interactive version in a separate window. The code can be accessed at the project repository on Github.

 

The fight for Guerilla Open Access

Yesterday, we met with the sad demise of one of most brilliant Internet Activist, Aaron Swartz. I cried and read a lot about Aaron that I found online the whole of last night. He was a hero. And when I look at the Remember Aaron Swartz website, my heart is sinking. I came across this piece by Aaron written in 2008 while he was in Italy, calling everyone to fight for Open Access:

Information is power. But like all power, there are those who want to keep it for themselves. The world’s entire scientific and cultural heritage, published over centuries in books and journals, is increasingly being digitized and locked up by a handful of private corporations. Want to read the papers featuring the most famous results of the sciences? You’ll need to send enormous amounts to publishers like Reed Elsevier.

There are those struggling to change this. The Open Access Movement has fought valiantly to ensure that scientists do not sign their copyrights away but instead ensure their work is published on the Internet,
under terms that allow anyone to access it. But even under the best scenarios, their work will only apply to things published in the future. Everything up until now will have been lost.

That is too high a price to pay. Forcing academics to pay money to read the work of their colleagues? Scanning entire libraries but only allowing the folks at Google to read them? Providing scientific articles to those at elite universities in the First World, but not to children in the Global South? It’s outrageous and unacceptable.

“I agree,” many say, “but what can we do? The companies hold the copyrights, they make enormous amounts of money by charging for access, and it’s perfectly legal — there’s nothing we can do to stop them.” But there is something we can, something that’s already being done: we can fight back.

Those with access to these resources — students, librarians, scientists — you have been given a privilege. You get to feed at this banquet of knowledge while the rest of the world is locked out. But you need not — indeed, morally, you cannot — keep this privilege for yourselves. You have a duty to share it with the world. And you have: trading passwords with colleagues, filling download requests for friends.

Meanwhile, those who have been locked out are not standing idly by. You have been sneaking through holes and climbing over fences, liberating the information locked up by the publishers and sharing them with your friends.

But all of this action goes on in the dark, hidden underground. It’s called stealing or piracy, as if sharing a wealth of knowledge were the moral equivalent of plundering a ship and murdering its crew. But sharing isn’t immoral — it’s a moral imperative. Only those blinded by greed would refuse to let a friend make a copy.

Large corporations, of course, are blinded by greed. The laws under which they operate require it — their shareholders would revolt at anything less. And the politicians they have bought off back them, passing laws giving them the exclusive power to decide who can make copies.

There is no justice in following unjust laws. It’s time to come into the light and, in the grand tradition of civil disobedience, declare our opposition to this private theft of public culture.

We need to take information, wherever it is stored, make our copies and share them with the world. We need to take stuff that’s out of copyright and add it to the archive. We need to buy secret databases and put them on the Web. We need to download scientific journals and upload them to file sharing networks. We need to fight for Guerilla Open Access.

With enough of us, around the world, we’ll not just send a strong message opposing the privatization of knowledge — we’ll make it a thing of the past. Will you join us?

Aaron Swartz
July 2008, Eremo, Italy

Among other things, I write code and talk about Open Data. I dream of an open world, where we don’t let politicians and bureaucrats lock up information. Using open technologies is a way of doing this. Learn it. Advocate it. Contribute back to it, for the better world. We don’t have Aaron to lead the light anymore, but I’m sure his thoughts and ideas like the one above will always vouch us when it has to.

ICS Ubuntu, Simplified.

We have a good WiFi enabled Internet campus. Unfortunately, one of the Ubuntu Lucid installations in a Dell Inspiron did not detect the WiFi controller. The last resort was to get an ethernet cable with Internet connection to activate the driver. We asked google and found out perhaps the best solutions to do ICS (Internet Connection Sharing) with Ubuntu here.

I would brief the steps taken so that anybody can get this done very quickly.

Our situation is different from that described in the above document.

Server connected to Internet via wlan0.

Client connected to server via eth0.

You have to make sure no changes to networking has been done since turning on both the computers. If any, you would have to restart both of them.

Server side configuration.

  • Connect the computer to wlan0 and make sure the Internet access is available.

Start a terminal and do the following to configure your network card and NAT.

  • sudo ifconfig eth0 192.168.0.1
  • sudo iptables -A FORWARD -i wlan0 -o eth0 -s 192.168.0.0/24 -m conntrack –ctstate NEW -j ACCEPT
  • sudo iptables -A FORWARD -m conntrack –ctstate ESTABLISHED,RELATED -j ACCEPT
  • sudo iptables -A POSTROUTING -t nat -j MASQUERADE

These settings will be cleared when you reboot the system. If you want to set it up permanently then:

  • sudo iptables-save | sudo tee /etc/iptables.sav

Edit /etc/rc.local and add the following lines before the “exit 0” line:

  • iptables-restore < /etc/iptables.sav

Enable routing:

  • sudo sh -c “echo 1 > /proc/sys/net/ipv4/ip_forward”

Edit /etc/sysctl.conf and add these lines:

  • net.ipv4.conf.default.forwarding=1
  • net.ipv4.conf.all.forwarding=1

Client setup:

Disable networking

  • sudo /etc/init.d/networking stop

Give static IP address.

  • sudo ifconfig eth0 192.168.0.100

Configure routing.

  • sudo route add default gw 192.168.0.1

Configure DNS Servers.

  • Open /etc/resolv.conf in the server and add the contents to resolv.conf of the client.

Open /etc/dhcp3/dhclient.conf and add:

  • prepend domain-name-servers 208.67.222.222,208.67.220.220;

Restart Networking:

  • sudo /etc/init.d/networking restart

You are good to go!

Note: If you are in a proxy network make sure to apply the proxy settings to the client also.