Skip to content

Latest commit

 

History

History
47 lines (36 loc) · 1.68 KB

README.md

File metadata and controls

47 lines (36 loc) · 1.68 KB

EGC_2020

EGC 2020 Challenge: 20 years of history for which future?

The goal of this challenge is to take stock at the evolution of the EGC community over the past 20 years and try to predict the future. The principle is to apply techniques of knowledge discovery and data mining to explain the structure and evolution.

Dataset

The data set consists of 1200 titles and abstracts from the articles published at the EGC conference between 2004 and 2018.
Fields:

  • years
  • title
  • abstract
  • authors

Pipeline

  • filter_extreme
  • tf-idf
  • LDA (Coherence Score)
  • K-Means (Silhouette scores)

Cluster (Topics) evolution / Time

  ScreenShot

Our system deducted a sharp increase in articles related to the social network analysis over the past years 20 (Label 1).
On the other hand, rule-based algorithms seem to have declined drastically (Label 6).

Evaluation (Hyper-parameters defined in the Jupiter-notebook)

The pipeline used in this project doesn't seem to find a lot of structure for one cluster (Label 9), sadly this cluster represents ~30% of our training data (Silhouette plot below).

Silhouette plot for 10 clusters

  ScreenShot

There is still room for improvement.