Saturday, December 10, 2016

#Wikidata - Sembiyan Mahadevi - is it a title or is she a queen?

Queen Sembiyan Mahadevi was the spouse of  Gandaraditya, her son was Uttama Chola. Many of the Chola queens who followed her used "Sembiyan Mahadevi" as a title. This is what the English article tells us.

To really accept that it was a title, a source would help. It would be cool to have a list of all the people who used the title and it would be good to separate the person from the title in separate articles. It seems that the Tamil article is more substantial but as I do not read Tamil and Google translate does not help me sufficiently to understand what it says. 

Queen Sembiyan Mahadevi matters not only because she is important in the Chola dynasty but also because of the relevance she has in Tamil culture. Her father was a Mazhavarayar chieftain but Wikipedia does not know about them. 

When Wikidata knows about Indian nobility, its dates and connections, it becomes a resource that is helpful. Once her father has a name and it is clear what is meant by a "Mazhavarayar chieftain", slowly but surely it becomes clear who ruled where and who were contemporaries. It would be cool when Wikidata allows for a query that shows a "monarch" and shows fellow monarchs in neighbouring countries. 

Thursday, December 08, 2016

Was Cezhiyan Cendana a Pandyan king?

There is no way for me to find out if Cezhiyan Cendan was a Pandyan king or not. The only source I can find is a blog saying so. The problem is that texts in Wikipedia make me doubt. The text in the article for Maravarman Avani Culamani states that he is succeeded by his son Jayantavarman.

One fun fact is that templates do not have sources. It is however what I base information on when I add information to Wikidata. The other interesting point is that dates given are overlapping to the extent that they are not reliable.

So this is where we get into a problem. When information is good enough for a Wikipedia, is it good enough for Wikidata. More importantly is the question how do we curate information like this in a way that helps us all?

Wednesday, December 07, 2016

A Pandya King did not rule #India

The Pandyan Kingdom existed for some fourteen centuries; for many of the kings not much is known; A template contains much of what is known about them; not much.

Arguably; having this information in Wikidata serves a purpose. The information can be curated by people who know about the Pandyan kings and there are several things that they could do.
  • Some of the names of kings seem to be incorrect, certainly inconsistent.
  • The names of these kings can be added in the original language
  • Dates may be added to the period these kings were king
  • The data can be used in one of the other Wikipedias that are relevant in India.
One funny fact is that for all these kings it is impossible to have been a citizen of India. They were citizens of the Panyan kingdom. Many of such facts were added by bot and, it reflects factoids that exist in Wikipedias. It is just wrong.

Tuesday, December 06, 2016

#Research to help #Wikipedia do better

It is one thing to bemoan everything that is problematic with research, it is another to do better. For research on Wikipedia to be published, it has to be about "English" OR it has to be linked to English OR publication is not the end goal.

At the Dutch Wikimedia Conference Professor de Rijke gave the keynote speech. He spoke about the kind of research he is into and he spoke about "Wikipedia" research performed at the University of Amsterdam. He challenged his audience to cooperate and his challenge resulted in me formulating ten proposals for research. The point of these proposals is that I hope they do provide more worthwhile insight and includes a link to “English” in order for it to be published.
  1. Previous research, studied how long it took for a subject to appear in English Wikipedia after it was first mentioned in the news / social media. The new question would be: how long does it take for the same subject to appear in any Wikipedia and, how long does it take and to what extend does it happen for those articles to get corresponding articles in other Wikipedias and how long does it take for the English Wikipedia to take notice?
  2. In the search engine for Wikidata we use the description to help differentiate between homonyms. There are two approaches to a description; many existing descriptions are not helpful and hardly any items have texts exist in all of the 280 languages. There are however automatically generated descriptions. The question is: what do people like more, the automated descriptions or the existing questions? Is there a real difference for people who use Wikidata in English as well?
  3. Many people know their languages, this is obviously true for readers of Wikipedia. For the regulars there is a “Babel” template that allows them to indicate what languages they know. For the others for some purposes geo-location is used to make a guess. Do people find it useful to have it indicated that articles exist in the languages they know in search requests? Does it make a difference that a quality indicator is set for those other texts on the same subject?
  4. Many people make spelling errors when they search for a subject or when they create a wiki link to another subject. Google famously suggests what people may be looking for. We can expand the search and include items from Wikidata (40% increase in reach) but we can also use Google or any other search engine to help people get to the sum of all knowledge. We can ask people to answer some questions after they are done. Are people willing to do this and how does it expand our range of subjects that we know about. Are people willing to curate this information so that we can expand Wikidata and at least recognise the subjects we have no articles about?
  5. When we show the traffic for the articles people edited on in the last month, we gain an insight in what people actually read. We also congratulate people on the work they did and show appreciation. Does this kind of stimulus stimulate more articles? How do you stimulate for subjects that people hardly read (eg Indian nobility).. Do you compare with existing articles in the same category?
  6. There have been several Wikipedias that include bot generated texts. It is a famously divisive issue in the Wikipedia community. There has been no research done on this. With Wikidata there is an alternative way to exploit the underlying data. When the data is included in Wikidata, it is possible to generate text on the fly. This data may be cached for performance issues but there are two main advantages; both the script and the data can be updated. The question is: does it serve a purpose for our readers? Will editors update the data or the script to improve results or will they use the text as a template for new articles? Will it take the heat of the argument of generated texts? How will it affect projects that were not part of the existing controversy and does it work for them?
  7. Wikidata does not allow for the dating of its labels. It follows that it is not easily understood what the relation is between Jakarta and Batavia. How are such issues generally stored as data and what alternatives exist for Wikidata. How does it improve the usefulness of Wikidata as a general topic resource?
  8. Wikidata now includes data from sources like Swiss-Prot. What are the benefits to both parties? Does it make for people editing this data at Wikidata and what is the quality of such edits? Does it get noticed by Swiss Prot and is there a cooperation happening? How is this organised and to what extend does “the community” interfere with the notions of academia? Do such communications exist or are these groups doing “their own thing”?
  9. What is the effect on the ultra small Wikipedias when generated texts are available based on available labels.. Does it mean more interest in creating the templates for articles and work on labelling? What does it mean when such generated articles are available to search engines?
  10. At this time many articles in the English Wikipedia are written by students, university students. The result is positive on many levels but the question is, is what they write understood by Wikipedia readers? When students write their articles, it is mostly based on literature. It is well known that the bias in scientific papers is huge. Negative results are not published and many results from studies are ignored. The question would be: is sufficient weight given to debunking studies or are they put aside with an argument of a “neutral point of view”. This would make sense when students are graded on what they write given accepted fact on the university.

Saturday, November 26, 2016

The problem with #science explained with #Wikipedia

It is a recurring theme. People study a subject and reality is different. The science is flawless, the results are impressive and indeed important strides are made forward. The study of heart disease is a great example; many studies resulted in an improved life expectancy for men. Particularly white men. The Dutch Hartstichting is raising funds for new research because of this existing bias in research. For women in the Netherlands, heart disease is the number one killer because heart disease is different in women; it was not noticed before because heart disease in women was not studied.

Wikipedia as it is commonly known in research has the same problem. It is not Wikipedia as we know it, it is English Wikipedia. My contributions to Wikipedia have not been to English Wikipedia; they went to the Dutch Wikipedia and I will not be noticed as one of the most prolific contributors to Wikimedia projects because my contributions to "Wikipedia" are hardly significant..

As I blogged before; scientific papers do not publish when it does not involve English Wikipedia. The consequence is that when people quote research, their quotes include this bias and strictly speaking it is not necessarily true when you consider Wikipedia. The problem with biased research is that the policies of the WMF are based on the known "facts".

Nothing new so far. We all know it when we are honest. So what can we do to remove some of the bias? The first thing is to devalue any and all research that is English Wikipedia only. It only covers less than half of what we do.The second thing is to evaluate research for its algorithms. When both the algorithms and the data are available, it is possible to run the algorithm on a more inclusive data set and check the validity. With the quality of Wikidata data as a source on all the Wikipedias improving, such an approach is increasingly feasible. The last thing is for the Wikimedia Foundation itself to address this bias, With English Wikipedia being less than 50% of its traffic and workflow, it would be good when a similar percentage of its efforts is focused on the bigger half of what we all do.

So what is the harm? We expect all Wikipedians largely to do what "Wikipedians" do. However, we are not all English Wikipedians. The need other people have is not discussed, not taken seriously. We have seen wonderful examples of potential functionality showcased but it is not taken further, not taken in production because it does not fit the preconceived ideas of what we do, it is not part of the road map. The projects in Wikidata are not about Wikidata but about how to make us all in one big data glob and USING the data is only seen in relation to Wikipedia articles. We do not know how much Wikidata is used, some studies are done but they are in relation to "Wikipedia" and that is not relevant to me. We find that Wikisource gains more and more content that may be valuable to our readers but we do not market this data because we never did marketing for Wikipedia. There are several websites that only do this in a way that could be much improved if we took Wikisource seriously.

It hurts us to only consider English Wikipedia and this bias in research and policy is more damaging than the bias that is considered by the English Wikipedians.

Wednesday, November 23, 2016

#Bias in #research

Actually, it starts with something else. You need to publish so you have to select a subject to study that will be of interest to the publisher..

As a consequence hardly any research is done about the other Wikipedias. I have been informed by a reliable source that it has to be English or it will not be published.

Now Wikimedia Foundation, how about that? Is there any research done on Wikipedia or is all the research biased in this way?

Tuesday, November 01, 2016

#Wikidata year 4; What Gupta year is that?

Wikidata is celebrating its fourth birthday. It is celebrated by some mighty fine gifts. It is a time to reflect on what has gone before and what is ahead of us. Obviously there are challenges we face and my gift are some queries / questions I do not know how to address. I focus on the Gupta empire because it currently has my interest.

During the era of the Gupta empire there was a "Gupta year". An article refers to it and my first question is: what date would the birthdate of Wikidata be in Gupta years?

Obviously there are many maps including the Gupta empire, Can I have them sorted by date please? What other countries border the Gupta empire? Who were its rulers and how does the map change over time?

To get answers is nice but for me it is important that the algorithms involved are relevant to any country old and new. Relevant to timelines old and new. When we can express dates in the "Year Gupta", we can check if dates in Wikidata are indeed Julian or maybe Gregorian..

When we have continuance in maps over time, we will know if a location, a city for instance or the land of a tribe is part of what country; what culture.

Wikidata live long and prosper :)