Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate Social Indicators Feedback in Original Paper #20

Open
notconfusing opened this issue May 14, 2015 · 8 comments
Open

Incorporate Social Indicators Feedback in Original Paper #20

notconfusing opened this issue May 14, 2015 · 8 comments
Assignees
Milestone

Comments

@notconfusing
Copy link
Owner

For all to read, but particularly those working on the future end paper. @Masssly and @hargup, These were the critiques of our original paper in our Revise and Resubmit stage:

COMMENTS FOR THE AUTHOR:

Reviewer #1: With all its merits as an innovative research project, the "WIGI" index article suffers from two main critical flaws that undermine the interesting attempt to explore the gender inequality as represented in Wikipedia's biographies: over-generalizations and over-statements. Thus I suggest the author(s) to revise them accordingly.

First, the author(s) must specify clearly the scope and the unit of analysis of the research, otherwise the generalizations made, implicitly or explicitly in the article, will continue to undermine the validity of the main arguments and confuse the readers. Since the scope of research is Wikipedia biographies that are coded in Wikidata project, it must be pointed out the gaps between the actual and the expected scope of research. The actual scope is the biographies that are coded (and thus represented) in Wikidata, which may or may out be the same of the scope covered by the Wikipedia biographies. Also, it must be pointed out that the scope of Wikipedia biographies may be partial as well when considering the totality of available world's biographies. Such clarifications will enhance the analytical strengths and limitations of the approach. Instead of arguing for "an academic index allowing comparative study of gender inequality through space and time", the authors should use
more adequate and moderate arguments such as "a set of indicators that allow the measurement and monitoring the representative inequality of gender in biographies across countries and throughout time". The key terms are representation and biographies. Without these specification key terms, over-generalizations are made.

Second, while the author(s) comparative strategies (including using Inglehart-Welzel cultural clusters) serve very well in reducing the complexity of the datasets for interesting insights, the discussions and conclusions made tend to over-state what the findings may imply. For instance, the found relatively high ranking of the Confucian and South Asian clusters were described by the author(s) as "surprising", and then the author(s) proceed to propose the Celebrity Hypothesis (7.11) to explain such a surprise. What the author(s) should do and must do is instead use this opportunity to discuss the methodological issue that gender representative inequality may be quantitatively reduced by more presence of celebrity in biographies, but such phenomena may conceal, rather than reveal the actual social gender inequality for cross-cultural comparisons. In other words, the author(s) should continue to advance the discussions on the found patterns to show that different
aggregation/comparison strategies may reveal/conceal certain aspects of inequality. Attention must be paid to specify which aspects and how.

The two main points above lead to a critical question on the use of the term "index". The author(s) may want to refer the following sources:

  • Robert J. Rossi; Kevin J. Gilmartin (1980). The handbook of social indicators: sources, characteristics, and analysis. Garland STPM Press. p. 175. ISBN 978-0-8240-7135-6.
  • Fanchette, Serge, 1974. Social Indicators: Problems of Methodology and Selection' In Social Indicators: Problems of Definition and Selection. Paris: UNESCO Press.

and realize that an "index" refers to a specific "weighted combination of two or more indicators".

There is little discussions on how the so-called WIGI index may be weighted combinations of some kind. What the paper has already achieved successfully is to derive a set of gender representation inequality indicators in various aggregated time and country categories, which in turn provide interesting findings about the Wikipedia/Wikidata themselves. However, these outcomes are not yet weighted in combinations to derive an "index" about societies in general. Though I did not mean to say that an index is necessary for the article to succeed, it is important to manage the readers' expectation adequately. The authors' may want to give some thoughts about what this paper has already achieved and what they want to invite the readers of "Social Indicators" to contribute the further development of the work presented here. I personally think that my suggested description "a set of indicators that allow the measurement and monitoring the representative inequality of gender in
biographies across countries and throughout time" is more adequate. The author(s) can further discuss, as they have done partially in the paper, the limitations and possible extensions of the current research efforts. For instance, the low percentage of country information (23.47%) in Table 1 seriously undermines any further attempts to conduct cross-country comparisons (not to mention the aggregated outcomes based on nine world cultures consistently), including the ranking in Table 2. The author(s) may want to suggest a few research strategies in the future to tackle such a major data problem.

This work gas pioneered work in measuring and monitoring biographies in Wikipedia/Wikidata projects. A successful revision will require the author(s) to better defend their contribution in showing the representation inequality, instead of the overstated claim on "worldwide longitudinal gender inequality trends". Given the fact that many of the data points' country information are missing, the author(s) should acknowledge such a fact and try to turn this point of weakness into a point of strength in signaling the incomplete area for Wikipedia's improvement in the area. In other words, instead of suggesting the Wikipedia/Wikidata outcomes can be "indicative" or "reflective" of social phenomena, the author(s) should focus on the "representative" nature of Wikipedia/Wikidata in documenting social outcomes. The proposed celebrity hypothesis also highlights the "representative" dimension. In other words, instead of claiming that Wikipedia/Wikidata may reveal and/or reflect the
world's gender inequality, it may be more defensible that the proposed efforts help to identify the representative inequality across different cultural, linguistic and social categories as represented in Wikipedia/Wikidata projects.

@notconfusing notconfusing added this to the Phase D milestone May 14, 2015
@piokon
Copy link
Collaborator

piokon commented May 14, 2015

Me and Max already have our replies to the reviewer, but before we share
them, we would appreciate any interested readers to share their views
with us (as we don't want to influence you with our thoughts).

The paper is at
https://docs.google.com/document/d/1RbXH0hBp5Y_HqXUcpSUZ4d3c5Y_AhNKEmIhGdV9FF4U/edit

  • let us know if anyone needs additional permissions to leave comments
    inside

Piotr Konieczny, PhD
http://hanyang.academia.edu/PiotrKonieczny
http://scholar.google.com/citations?user=gdV8_AEAAAAJ
http://en.wikipedia.org/wiki/User:Piotrus

On 5/14/2015 13:55, Max Klein wrote:

For all to read, but particularly those working on the future end
paper. @Masssly https://github.com/masssly and @hargup
https://github.com/hargup, These were the critiques of our original
paper in our Revise and Resubmit stage:

COMMENTS FOR THE AUTHOR: Reviewer #1
<https://github.com/notconfusing/WIGI/issues/1>: With all its
merits as an innovative research project, the "WIGI" index article
suffers from two main critical flaws that undermine the
interesting attempt to explore the gender inequality as
represented in Wikipedia's biographies: over-generalizations and
over-statements. Thus I suggest the author(s) to revise them
accordingly. First, the author(s) must specify clearly the scope
and the unit of analysis of the research, otherwise the
generalizations made, implicitly or explicitly in the article,
will continue to undermine the validity of the main arguments and
confuse the readers. Since the scope of research is Wikipedia
biographies that are coded in Wikidata project, it must be pointed
out the gaps between the actual and the expected scope of
research. The actual scope is the biographies that are coded (and
thus represented) in Wikidata, which may or may out be the same of
the scope covered by the Wikipedia biographies. Also, it must be
pointed out that the scope of Wikipedia biographies may be partial
as well when considering the totality of available world's
biographies. Such clarifications will enhance the analytical
strengths and limitations of the approach. Instead of arguing for
"an academic index allowing comparative study of gender inequality
through space and time", the authors should use more adequate and
moderate arguments such as "a set of indicators that allow the
measurement and monitoring the representative inequality of gender
in biographies across countries and throughout time". The key
terms are *representation* and *biographies*. Without these
specification key terms, over-generalizations are made. Second,
while the author(s) comparative strategies (including using
Inglehart-Welzel cultural clusters) serve very well in reducing
the complexity of the datasets for interesting insights, the
discussions and conclusions made tend to over-state what the
findings may imply. For instance, the found relatively high
ranking of the Confucian and South Asian clusters were described
by the author(s) as "surprising", and then the author(s) proceed
to propose the Celebrity Hypothesis (7.11) to explain such a
surprise. What the author(s) should do and must do is instead use
this opportunity to discuss the methodological issue that gender
*representative* inequality may be quantitatively reduced by more
presence of celebrity in biographies, but such phenomena may
conceal, rather than reveal the actual social gender inequality
for cross-cultural comparisons. In other words, the author(s)
should continue to advance the discussions on the found patterns
to show that different aggregation/comparison strategies may
reveal/conceal certain aspects of inequality. Attention must be
paid to specify which aspects and how. The two main points above
lead to a critical question on the use of the term "index". The
author(s) may want to refer the following sources: * Robert J.
Rossi; Kevin J. Gilmartin (1980). The handbook of social
indicators: sources, characteristics, and analysis. Garland STPM
Press. p. 175. ISBN 978-0-8240-7135-6. * Fanchette, Serge, 1974.
Social Indicators: Problems of Methodology and Selection' In
Social Indicators: Problems of Definition and Selection. Paris:
UNESCO Press. and realize that an "index" refers to a specific
"weighted combination of two or more indicators". There is little
discussions on how the so-called WIGI index may be weighted
combinations of some kind. What the paper has already achieved
successfully is to derive a set of gender representation
inequality indicators in various aggregated time and country
categories, which in turn provide interesting findings about the
Wikipedia/Wikidata themselves. However, these outcomes are not yet
weighted in combinations to derive an "index" about societies in
general. Though I did not mean to say that an index is necessary
for the article to succeed, it is important to manage the readers'
expectation adequately. The authors' may want to give some
thoughts about what this paper has already achieved and what they
want to invite the readers of "Social Indicators" to contribute
the further development of the work presented here. I personally
think that my suggested description "a set of indicators that
allow the measurement and monitoring the representative inequality
of gender in biographies across countries and throughout time" is
more adequate. The author(s) can further discuss, as they have
done partially in the paper, the limitations and possible
extensions of the current research efforts. For instance, the low
percentage of country information (23.47%) in Table 1 seriously
undermines any further attempts to conduct cross-country
comparisons (not to mention the aggregated outcomes based on nine
world cultures consistently), including the ranking in Table 2.
The author(s) may want to suggest a few research strategies in the
future to tackle such a major data problem. This work gas
pioneered work in measuring and monitoring biographies in
Wikipedia/Wikidata projects. A successful revision will require
the author(s) to better defend their contribution in showing the
*representation* inequality, instead of the overstated claim on
"worldwide longitudinal gender inequality trends". Given the fact
that many of the data points' country information are missing, the
author(s) should acknowledge such a fact and try to turn this
point of weakness into a point of strength in signaling the
incomplete area for Wikipedia's improvement in the area. In other
words, instead of suggesting the Wikipedia/Wikidata outcomes can
be "indicative" or "reflective" of social phenomena, the author(s)
should focus on the "representative" nature of Wikipedia/Wikidata
in documenting social outcomes. The proposed celebrity hypothesis
also highlights the "representative" dimension. In other words,
instead of claiming that Wikipedia/Wikidata may reveal and/or
reflect the world's gender inequality, it may be more defensible
that the proposed efforts help to identify the representative
inequality across different cultural, linguistic and social
categories as represented in Wikipedia/Wikidata projects. 


Reply to this email directly or view it on GitHub
#20.

@hargup
Copy link
Collaborator

hargup commented May 14, 2015

@piokon @notconfusing To not to influence other project members I have sent my feedback through mail.

@Masssly
Copy link

Masssly commented May 15, 2015

@piokon @notconfusing
Sent my comments by mail

@piokon
Copy link
Collaborator

piokon commented May 20, 2015

I think at this point we can share your analysis with the group, so
everyone can read it? What do you think?

Piotr Konieczny, PhD
http://hanyang.academia.edu/PiotrKonieczny
http://scholar.google.com/citations?user=gdV8_AEAAAAJ
http://en.wikipedia.org/wiki/User:Piotrus

On 5/15/2015 05:34, Harsh Gupta wrote:

@piokon https://github.com/piokon @notconfusing
https://github.com/notconfusing To not to influence other project
members I have sent my feedback through mail.


Reply to this email directly or view it on GitHub
#20 (comment).

@piokon
Copy link
Collaborator

piokon commented May 20, 2015

I don't believe I ever got them, would you mind sending them here?

(Perhaps I am confused about something, but this is a listerv, right?)

Piotr Konieczny, PhD
http://hanyang.academia.edu/PiotrKonieczny
http://scholar.google.com/citations?user=gdV8_AEAAAAJ
http://en.wikipedia.org/wiki/User:Piotrus

On 5/15/2015 19:09, Mohammed Sadat Abdulai wrote:

@piokon https://github.com/piokon @notconfusing
https://github.com/notconfusing
Sent my comments by mail


Reply to this email directly or view it on GitHub
#20 (comment).

@Masssly
Copy link

Masssly commented May 21, 2015

Inclusion of a section header 6.3 that clearly spells out the scope of the study:

Scope of the study
The scope of the study is Wikipedia biographies that are coded in Wikidata. Wikidata is a project that aims to gather data from all Wikipedias in a single location in a form that can both be read by humans and be processed by machines. Its essence is a document oriented database that provides a common source of certain structured data types (for example, birth and death dates) which can be used by Wikipedia and sister Wikimedia projects.

The units that would be measured are the ratio of female Wikipedia biographies to total Wikipedia biographies against the background of these female personalities, that is, their place of birth and time of birth. When a biography does not contain place of birth, citizenship will be used in its stead if that is available. In addition, time of birth in a broad sense will refer to a specific timeframe within which a person is born but not necessarily the exact date....still expanding!

Though it has already been mentioned briefly. I think the systematic biases of Wikipedia need to be made clear to the reader in the ways it can potentially influence “biography” articles on Wikipedia. Also to be included in the scope of the study:

Wikipedia suffers from general systematic biases and this directly influences the nature or space of biographies that editors would choose to create or improve as well as the quality of such biography articles. A 2005 University of Würzburg Wikipedia User survey [1], found that “the common characteristics of ‘average Wikipedians’ inevitably color the content of Wikipedia”. The study defined the average Wikipedian on the English Wikipedia for example as
(1) a male,
(2) technically inclined,
(3) formally educated,
(4) an English speaker (native or non-native),
(5) aged 15–49,
(6) from a majority-Christian country,
(7) from a developed nation,
(8) from the Northern Hemisphere, and
(9) likely employed as a white-collar worker or enrolled as a student rather than employed as a blue-collar worker.
(Cohen, 2011) and (WMF survey Report, 2011) found women to be underrepresented on Wikipedia. (Lam et al, 2011) went further to suggest that the under-representation of Women contributors could have a detrimental effect on content coverage.

Access to internet, which is a major requirement to editing Wikipedia and creating/improving biography articles tend to be at the disposal of people in developed nations. (Nelson, Anne. "Wikipedia Taps College ‘Ambassadors’ to Broaden Editor Base") notes that Eighty percent of Wikipedia page views and 83% of global edits come from the Global North. Most countries in the Southern hemisphere have disproportionately less access to information technology which easily translates to technical inability to contribute to Wikipedia.[2][3][4][5]

Indo-European languages most notably from Anglophones countries dominate Wikipedia contributions. The majority of the world's population lives in the Northern Hemisphere, which is mostly Anglophone. No wonder of the over 35 million different language editions of Wikipedia, nearly 45% of them belong to only 8 Indo-European languages (English, Swedish, Dutch, German, French, Russian, Italian, Spanish) with the English Wikipedia which also happens to be the largest, making up 13.9%.[6]

Among the biases on Wikipedia are the unavailability of sources in some languages and the high cost involved in accessing quality sources from journals for example. Because reliable sources are required by Wikipedia policy, topics are limited in their contents by the sources available to editors. This is a particularly acute problem for biographies of living persons. Sources published in a medium that is both widely available and familiar to editors, such as a news website, are more likely to be used than those from esoteric or foreign-language publications regardless of their reliability.[7] ...There is a tall list of other biases I have gathered!

Because of the tendentious nature of Wikipedia contributions and the in-proportionate distribution of Wikipedia articles, the study expects biography articles on Wikipedia to be influenced by the above-mentioned circumstances of a Wikipedia editor and a spillover of these effects onto the units of measurements used for the analysis.

Furthermore, the Wikidata project is in its initial development stages and as of August 2013, the database was 106.6 gigabytes large with seventeen (17) million statements created.[8] The data imported for the analysis therefore is not representative of all biographies presently contained in all Wikipedias. In addition, only 28.34 percent of biography articles as of October 2014 contained the item “country”. This renders any cross-country inferences made based on the data still inconclusive.

In response to:

Instead of arguing for "an academic index...the authors should use more adequate and moderate arguments such as "a set of indicators that allow the measurement and monitoring the representative inequality of gender in biographies across countries and throughout time".

Since we’re not only coming up with a paper but also an automated statistical presentation of gender in articles by certain categories, and in heeding the reviewers advise of using moderate language; I think it is crucial that we emphasize the main argument of the paper to be “to develop a set of indicators to be used as a tool for measuring and monitoring the representative inequality of gender in biographies across countries and throughout time”.

In response to:
...the author(s) found relatively high ranking of the Confucian and South Asian clusters were described by the author(s) as "surprising", and then the author(s) proceed to propose the Celebrity Hypothesis (7.11) to explain such a surprise. and What the author(s) should do and must do is...continue to advance the discussions on the found patterns to show that different aggregation/comparison strategies may reveal/conceal certain aspects of inequality.

In relation to the celebrity hypothesis, we can produce another heatmap that excludes celebrities at all (per tested celebrity terms) from the data to see if the absence of celebrities has an effect on the number of female biographies. If heated areas are reduced that corroborates the celebrity hypothesis somehow though not conclusively yet.

Moving forward, I believe the key terms that are used for the celebrity tests are direct translations of the same terms in English. These professions though common among western Europeans may either not be easily associated with females in other demographics where their roles/ types of jobs in these societies differ, or hold significantly different perceptions among people. For example, a “model” in the Islamic/African/confusion cluster may not hold the same weight or celebrity status as a model in Western Europe. I can do more investigation as to what defines a celebrity in the different cultural clusters, and what it means (who fits the description best) in the different clusters so we do not have to use the same generic English terms.

Furthermore, the notion that, obtaining a huge positive percentage in fig.10. (Difference in female ratio by language-unique and language-many articles by language of Wikipedia) indicates a “focus to write more female-oriented local hero articles” can be viewed from other perspectives, and I think that should also be mentioned/explored in the study. The thrust of European political power, commerce, and culture though present is less felt in Confucian societies as compared with its immense presence in Islamic and African societies. In such societies where the exchange (or rather the hand down) of culture is more eminent, we would expect that “local heroes” may not be so local after all. We can find out if biographies from non-European languages that end up having inter-Wiki links are mostly translated into English or European languages. If that is the case then that might explain why the Confucian cluster dominates the chart with language-unique biographies, diluting the strong assertion that it is as a result of a “focus to write more female-oriented local hero articles”.

Also, given the period of birth or death, and the characteristics of biography articles such as occupation or cultural cluster, it will be interesting to perform multiclass LR analysis on the data to see if we can unearth a trend of ratio of mean article size by gender for Top 25 Wikipedias by language. I’m refereeing to Fig.11.

Lastly, it would be interesting to replicate all the computations and graphs on a different but similar set of clusters to see if there are variations or observable differences in results. Globe cultural clusters based on the effects of globalization are a compelling candidate. Its clusters are Anglo, Confucian Asia, Eastern Europe, Germanic Europe, Latin America, Latin Europe, Middle East, Nordic Europe, Southern Asia, Sub-Sahara Africa.[9]

In response to:
There is little discussions on how the so-called WIGI index may be weighted combinations of some kind. What the paper has already achieved successfully is...However, these outcomes are not yet weighted in combinations to derive an "index" about societies in general.

Indexes are meant to summarize and rank specific observations. Figure 6. is a subsets of Fig.7 since the later adds a third dimension of culture to the existing gender ratios against time. We can further investigate how gender ratios will behave against time, culture, and occupation. The next set of graphs can be combined to produce a plot of percentage of the magnitude of the difference between language-unique and language-many female biographies by culture, to mean article size. If we assign the same weight to both indicators, it is possible to formulate an index that encapsulates all the variables looking at how they relate to each other.

References

1
[http://www.psychologie.uni-wuerzburg.de/ao/research/wikipedia.php?lang=en]

2
[Mossberger, Karen (2009). "Toward digital citizenship: addressing inequality in the information age". In Chadwick, Andrew. Routledge handbook of Internet politics. Taylor & Francis. ISBN 9780415429146]
3
[Cavanagh, Allison (2007). Sociology in the age of the Internet. McGraw-Hill International. p. 65. ISBN 9780335217250]
4
[Chen, Wenhong & Wellman, Barry (2005). "Minding the Cyber-Gap: the Internet and Social Inequality". In Romero, Mary & Margolis, Eric. The Blackwell companion to social inequalities. Wiley-Blackwell. ISBN 9780631231547]
5
[Norris, Pippa (2001). "Social inequality". Digital divide: civic engagement, information poverty, and the Internet worldwide. Cambridge University Press. ISBN 9780521002233].
6
[Mark Graham. "Wikipedia's known unknowns". The Guardian.co.uk.]
7
[ https://en.wikipedia.org/w/index.php?title=Wikipedia:Systemic_bias&oldid=662207541]
8
[https://upload.wikimedia.org/wikipedia/commons/8/85/State_Of_Wikidata.pdf]
9
[https://en.wikipedia.org/wiki/Global_Leadership#Globe_cultural_clusters]

Sent from Windows Mail

@piokon
Copy link
Collaborator

piokon commented May 22, 2015

Mohammed, that's very interesting. I think the scope of the study should
be in the introduction, rather than in the middle. I'll see about
incorporating your scope, and discussion of biases, into the article.

Regarding your analysis of the Confucian hypothesis, I think it may be
better if you try to add it to our draft yourself, and see where it
fits. Regarding testing it further, all of them are interesting ideas,
but they are not something I can do (I don't code). Seeing as our word
count is going up, I wonder if we shouldn't consider removing the
celebrity hypothesis entirely, and writing another paper dedicated to it
in the future. It is very interesting, but as we have all agreed, the
paper is introducing the WIGI hypothesis, and the celebrity hypothesis
is going beyond it, trying to interpret it and therefore adding extra
complexity to the paper that may not be needed. Thoughts?

Piotr Konieczny, PhD
http://hanyang.academia.edu/PiotrKonieczny
http://scholar.google.com/citations?user=gdV8_AEAAAAJ
http://en.wikipedia.org/wiki/User:Piotrus

On 5/21/2015 20:40, Mohammed Sadat Abdulai wrote:

Inclusion of a section header 6.3 that clearly spells out the scope of
the study:

Scope of the study
The scope of the study is Wikipedia biographies that are coded in
Wikidata. Wikidata is a project that aims to gather data from all
Wikipedias in a single location in a form that can both be read by
humans and be processed by machines. Its essence is a document
oriented database that provides a common source of certain structured
data types (for example, birth and death dates) which can be used by
Wikipedia and sister Wikimedia projects.

The units that would be measured are the ratio of female Wikipedia
biographies to total Wikipedia biographies against the background of
these female personalities, that is, their place of birth and time of
birth. When a biography does not contain place of birth, citizenship
will be used in its stead if that is available. In addition, time of
birth in a broad sense will refer to a specific timeframe within which
a person is born but not necessarily the exact date....still expanding!

Though it has already been mentioned briefly. I think the systematic
biases of Wikipedia need to be made clear to the reader in the ways it
can potentially influence “biography” articles on Wikipedia. Also to
be included in the scope of the study:

Wikipedia suffers from general systematic biases and this directly
influences the nature or space of biographies that editors would
choose to create or improve as well as the quality of such biography
articles. A 2005 University of Würzburg Wikipedia User survey [1],
found that “the common characteristics of ‘average Wikipedians’
inevitably color the content of Wikipedia”. The study defined the
average Wikipedian on the English Wikipedia for example as
(1) a male,
(2) technically inclined,
(3) formally educated,
(4) an English speaker (native or non-native),
(5) aged 15–49,
(6) from a majority-Christian country,
(7) from a developed nation,
(8) from the Northern Hemisphere, and
(9) likely employed as a white-collar worker or enrolled as a student
rather than employed as a blue-collar worker.
(Cohen, 2011) and (WMF survey Report, 2011) found women to be
underrepresented on Wikipedia. (Lam et al, 2011) went further to
suggest that the under-representation of Women contributors could have
a detrimental effect on content coverage.

Access to internet, which is a major requirement to editing Wikipedia
and creating/improving biography articles tend to be at the disposal
of people in developed nations. (Nelson, Anne. "Wikipedia Taps College
‘Ambassadors’ to Broaden Editor Base") notes that Eighty percent of
Wikipedia page views and 83% of global edits come from the Global
North. Most countries in the Southern hemisphere have
disproportionately less access to information technology which easily
translates to technical inability to contribute to Wikipedia.[2][3][4][5]

Indo-European languages most notably from Anglophones countries
dominate Wikipedia contributions. The majority of the world's
population lives in the Northern Hemisphere, which is mostly
Anglophone. No wonder of the over 35 million different language
editions of Wikipedia, nearly 45% of them belong to only 8
Indo-European languages (English, Swedish, Dutch, German, French,
Russian, Italian, Spanish) with the English Wikipedia which also
happens to be the largest, making up 13.9%.[6]

Among the biases on Wikipedia are the unavailability of sources in
some languages and the high cost involved in accessing quality sources
from journals for example. Because reliable sources are required by
Wikipedia policy, topics are limited in their contents by the sources
available to editors. This is a particularly acute problem for
biographies of living persons. Sources published in a medium that is
both widely available and familiar to editors, such as a news website,
are more likely to be used than those from esoteric or
foreign-language publications regardless of their reliability.[7]
...There is a tall list of other biases I have gathered!

Because of the tendentious nature of Wikipedia contributions and the
in-proportionate distribution of Wikipedia articles, the study expects
biography articles on Wikipedia to be influenced by the
above-mentioned circumstances of a Wikipedia editor and a spillover of
these effects onto the units of measurements used for the analysis.

Furthermore, the Wikidata project is in its initial development stages
and as of August 2013, the database was 106.6 gigabytes large with
seventeen (17) million statements created.[8] The data imported for
the analysis therefore is not representative of all biographies
presently contained in all Wikipedias. In addition, only 28.34 percent
of biography articles as of October 2014 contained the item “country”.
This renders any cross-country inferences made based on the data still
inconclusive.

In response to:

Instead of arguing for "an academic index...the authors should use
more adequate and moderate arguments such as "a set of indicators that
allow the measurement and monitoring the representative inequality of
gender in biographies across countries and throughout time".

Since we’re not only coming up with a paper but also an automated
statistical presentation of gender in articles by certain categories,
and in heeding the reviewers advise of using moderate language; I
think it is crucial that we emphasize the main argument of the paper
to be “to develop a set of indicators to be used as a tool for
measuring and monitoring the representative inequality of gender in
biographies across countries and throughout time”.

In response to:

...the author(s) found relatively high ranking of the Confucian and
South Asian clusters were described by the author(s) as "surprising",
and then the author(s) proceed to propose the Celebrity Hypothesis
(7.11) to explain such a surprise. and What the author(s) should do
and must do is...continue to advance the discussions on the found
patterns to show that different aggregation/comparison strategies may
reveal/conceal certain aspects of inequality.

In relation to the celebrity hypothesis, we can produce another
heatmap that excludes celebrities at all (per tested celebrity terms)
from the data to see if the absence of celebrities has an effect on
the number of female biographies. If heated areas are reduced that
corroborates the celebrity hypothesis somehow though not conclusively yet.

Moving forward, I believe the key terms that are used for the
celebrity tests are direct translations of the same terms in English.
These professions though common among western Europeans may either not
be easily associated with females in other demographics where their
roles/ types of jobs in these societies differ, or hold significantly
different perceptions among people. For example, a “model” in the
Islamic/African/confusion cluster may not hold the same weight or
celebrity status as a model in Western Europe. I can do more
investigation as to what defines a celebrity in the different cultural
clusters, and what it means (who fits the description best) in the
different clusters so we do not have to use the same generic English
terms.

Furthermore, the notion that, obtaining a huge positive percentage in
fig.10. (Difference in female ratio by language-unique and
language-many articles by language of Wikipedia) indicates a “focus to
write more female-oriented local hero articles” can be viewed from
other perspectives, and I think that should also be mentioned/explored
in the study. The thrust of European political power, commerce, and
culture though present is less felt in Confucian societies as compared
with its immense presence in Islamic and African societies. In such
societies where the exchange (or rather the hand down) of culture is
more eminent, we would expect that “local heroes” may not be so local
after all. We can find out if biographies from non-European languages
that end up having inter-Wiki links are mostly translated into English
or European languages. If that is the case then that might explain why
the Confucian cluster dominates the chart with language-unique
biographies, diluting the strong assertion that it is as a result of a
“focus to write more female-oriented local hero articles”.

Also, given the period of birth or death, and the characteristics of
biography articles such as occupation or cultural cluster, it will be
interesting to perform multiclass LR analysis on the data to see if we
can unearth a trend of ratio of mean article size by gender for Top 25
Wikipedias by language. I’m refereeing to Fig.11.

Lastly, it would be interesting to replicate all the computations and
graphs on a different but similar set of clusters to see if there are
variations or observable differences in results. Globe cultural
clusters based on the effects of globalization are a compelling
candidate. Its clusters are Anglo, Confucian Asia, Eastern Europe,
Germanic Europe, Latin America, Latin Europe, Middle East, Nordic
Europe, Southern Asia, Sub-Sahara Africa.[9]

In response to:
There is little discussions on how the so-called WIGI index may be
weighted combinations of some kind. What the paper has already
achieved successfully is...However, these outcomes are not yet
weighted in combinations to derive an "index" about societies in general.

Indexes are meant to summarize and rank specific observations. Figure
6. is a subsets of Fig.7 since the later adds a third dimension of
culture to the existing gender ratios against time. We can further
investigate how gender ratios will behave against time, culture, and
occupation. The next set of graphs can be combined to produce a plot
of percentage of the magnitude of the difference between
language-unique and language-many female biographies by culture, to
mean article size. If we assign the same weight to both indicators, it
is possible to formulate an index that encapsulates all the variables
looking at how they relate to each other.

References

1
[http://www.psychologie.uni-wuerzburg.de/ao/research/wikipedia.php?lang=en]

2
[Mossberger, Karen (2009). "Toward digital citizenship: addressing
inequality in the information age". In Chadwick, Andrew. Routledge
handbook of Internet politics. Taylor & Francis. ISBN 9780415429146]
3
[Cavanagh, Allison (2007). Sociology in the age of the Internet.
McGraw-Hill International. p. 65. ISBN 9780335217250]
4
[Chen, Wenhong & Wellman, Barry (2005). "Minding the Cyber-Gap: the
Internet and Social Inequality". In Romero, Mary & Margolis, Eric. The
Blackwell companion to social inequalities. Wiley-Blackwell. ISBN
9780631231547]
5
[Norris, Pippa (2001). "Social inequality". Digital divide: civic
engagement, information poverty, and the Internet worldwide. Cambridge
University Press. ISBN 9780521002233].
6
[Mark Graham. "Wikipedia's known unknowns". The Guardian.co.uk.]
7
[
https://en.wikipedia.org/w/index.php?title=Wikipedia:Systemic_bias&oldid=662207541]
8
[https://upload.wikimedia.org/wikipedia/commons/8/85/State_Of_Wikidata.pdf]
9
[https://en.wikipedia.org/wiki/Global_Leadership#Globe_cultural_clusters]

Sent from Windows Mail


Reply to this email directly or view it on GitHub
#20 (comment).

@Masssly
Copy link

Masssly commented May 24, 2015

@piokon I am interested in exploring the Celebrity Hypothesis even further and I'd also prefer that we dealt with it rigorously in another paper. In the mean time it may not be too necessary to remove it entirely from this present paper, we can however trim it down by scantly mentioning it.
I will develop the Confucian hypothesis (in the coming days) and present it here, then we'll see ifs interesting enough to be incorporated.

@notconfusing Will time and resources allow us do further tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants