img folder: images
src folder: source code
Today one of the greatest collections of information is Wikipedia, consisting of more than 52 000 000 articles in 309 different languages. Each articles has some keywords within the text, which point to other articles with more information on these specific subjects. As such, the whole of Wikipedia (for a certain language) can be represented by a directed graph, with nodes representing the articles and arcs pointing to other articles.
The goal of this project will be to explicitly construct the network of articles for a certain language.
Once the network is constructed, it can be analyzed by studying different metrics, such as the distribution of vertex degrees, the average path length, the clustering coefficient,… (be creative!). Some interesting question could be:
- What do you expect for the distribution of vertex degrees and what do you actually find?
- Is the network completely connected or are there disconnected components which are not reachable from certain clusters of articles?
- Are there clear clusters of articles for certain subject/categories (e.g. biology, music, astronomy,…)? Are there quantitative differences between these clusters
- How many clicks does one on average need to reach a certain article starting from a random other article? I.e. does the network obey the small-world property (like Watts-Strogatz graphs)? What do you expect intuitively?
- Are there quantitative differences between the networks constructed for different languages (e.g. West-Vlaams and Hakka)?
As mentioned above the goal of this project is to create a network of Wikipedia articles based on the hyperlinks from one article to another. The languages that I decided to work on are the following:
- Romansh Wikipedia 4544 articles
- West Flemish Wikipedia 11321 articles
- Dutch Low Saxon Wikipedia 11077 articles
In order to scrap all Wikipedia articles, Python in combination with BeautifulSoup was used. The script accesses each Wikipedia article and searches for links to other Wikipedia articles. The drawback of this code is that it is slow and inefficient for large Wikipedias and its speed is based on the internet connection of the user. On the other hand it works pretty well for the size of Wikipedias that we want to work on.
The script first scans.all the articles from the “all articles” page and saves the title at a list (called a_links). Then it creates a NxN matrix with all values equal to zero (where N the number of articles). In the end, scans again all the articles and if a link is found, makes the specific cell that corresponds to these articles equal to 1. Of course links to images, navigation links and links to images have to be excluded.
How nodes are created:
Code that creates nodes between articles that have a link to each other:
import networkx as nx
G = nx.DiGraph() #graph constructor
for u in range(len(a_links)): #scan the matrix table
for v in range(len(a_links)):
if matrix[u][v] == 1: #if the is a non zero cell
G.add_edge(a_links[u],a_links[v]) #make an edge between the articles
Romansh | West Flemish | Dutch Low Saxon | |
---|---|---|---|
Number of Nodes | 4544 | 11338 | 11077 |
Number of Edges | 90986 | 253886 | 246389 |
Edges per node on average | 20,02 | 22,39 | 22,24 |
These are the nodes that have no link to another webpage. In the following image are some isolated articles from Romansh Wikipedia.
Romansh | West Flemish | Dutch Low Saxon |
---|---|---|
80 | 60 | 15 |
An example is the AMOLED article. As we can see there are no links to another webpage
Another example: Rakel
These are the articles with the most links from other articles.
Degree centrality of Romansh Wikipedia
We can see that the article with the most ingoing links is the Svizra (Switzerland)
Degree centrality of West Flemish Wikipedia
We can see that the article with the most ingoing links is the Bevolkenge (Population)
This is obvious because the West Flemish Wikipedia has many articles about cities and places all of which have a link to the 'Bevolkenge' article:
Here we can see the incoming links to 'Bevolkenge' articles
Degree centrality of Dutch Low Saxon Wikipedia
We can see that the article with the most ingoing links is the Nederland (Netherland)
Some articles with most nodes
WEST-FLEMISH | ROMANSCH | DUTCH LOW SAXON | |||
---|---|---|---|---|---|
West-Vloandern | 0,1514 | Chantuns da la Svizra | 0,1737 | Netherlands | 0,21 |
Belgie | 0,1427 | Svizra | 0,383 | Duutsland | 0,098 |
Nederlands | 0,08957 | Italia | 0,111 | Engels | 0,058 |
Frans | 0,0878 | Germania | 0,1052 | Duuts | 0,0574 |
Duutsland | 0,0732 | Frantsche | 0,0959 | Europa | 0,052 |
Familie (Biologie) | 0,0732 | Grunnegs | 0,1367 | ||
Animalia | 0,0621 | Stellingwarfs | 0,1269 |
Romansh | West Flemish | Low Saxon | |
---|---|---|---|
Density | 0,0044 | 0,0022 | 0,0024 |
Average Clustering | 0,447 | 0,2209 | 0,2431 |
Average Shortest path length | 3,471 | 3,048 | 2,963 |
The Romansh Wikipedia has the highest average shortest path length coefficient. We can assume that this has to do with the fact that the Romansh Wikipedia has many Geography articles. This makes the path from a Geography to a non-Geography article a lot longer.
In the following image there are some examples of the shortest paths between random nodes:
-
Romansh wiki has many geography articles and is focused around Switzerland
-
Flemish wiki has a wide variety of articles (geography, animals, etc) and many articles about Belgium and Flanders
-
Dutch Low-Saxon has also articles about geography and many articles connected to Dutch language and dialects
-
In all Wikipedia articles with max degree are geographically closely related to the country that speaks the language
Degree rank plot
Degree Histogram : Number of nodes with a specific degree
**Degree Histogram : **(Why dutch low saxon has so many nodes with degree = 30?)
There are so many articles with a node degree = 30 because the Dutch Low-Saxon Wikipedia has many articles for all years/numbers from 100 B.C. till 400 A.D. and these articles are very connected to each other (many links to and from the article to other same articles)
Entire Romansh Wikipedia graph
Neighboors of the node (Nederlaand) for Dutch Low Saxon
Neighboors of the node (Svizra) for Romansh
-
There are many similarities at the content the three Wikipedia but Dutch and flemish Wikipedia have a wider variety of articles
-
The two larger Wikipedia (dutch & flemish) have smaller clustering coefficient and smaller average shortest path length
-
The degree rank plot and the degree histogram is similar on all three wikipedia