Skip to content

Google Summer of Code GSoC 2021 Cuneiform Digital Library Initiative (CDLI) ideas list

Christian Chiarcos edited this page Apr 6, 2021 · 38 revisions

On this page you will find project ideas for applications to the Google Summer of Code 2021. We encourage creativity; applicants are welcome to propose their own ideas in CDLI realm.

See our issues tracker for tasks: https://gitlab.com/cdli/framework/issues
See our codebase here: https://gitlab.com/cdli/framework

To join the community visit here.

If the main mentor on your prefered project idea is not technical person but a domain specialist, a technical co-mentor or mentors or the other running projects will support you and evaluate your code.

About the Cuneiform Digital Library Initiative

The Cuneiform Digital Library Initiative (CDLI) is driven by the mission to enable collection, preservation and accessibility of information— image files, textual annotation, and metadata—concerning all ancient Near Eastern artifacts inscribed with cuneiform. With over 334,000 artifacts in our catalogue, we house information about approximately two-thirds of all sources from cuneiform collections around the world. Our data are publicly available at https://cdli.ucla.edu, and our audience comprises primarily scholars, students, museum staff, and informal learners.

Through its long history, CDLI is now integral to the Assyriological discipline fabric itself. It is used as a tertiary source, a data hub and a research data repository. Based on Google Analytics reports, CDLI's website is visited on average by 3,000 monthly users in 10,000 sessions and 100,000 pageviews. 78% of these users are recurring visitors. The majority of users access CDLI collections and associated tools seeking information about a specific text or group of texts; insofar as these are available to us, CDLI has authoritative records about where the physical document is currently located, when and where is was originally created and deposited in ancient times, what is inscribed on the artifact and where it has been published. Search results display artifact images and associated linguistic annotations when available.

CDLI is a collaboration of developers, language scientists, machine learning engineers and cuneiform specialists who are creating a software infrastructure to process and analyze curated data. To this effect, we are actively developing two projects: Framework Update and Machine Translation and Automated Analysis of Cuneiform Languages. As part of these endeavors, we are building a natural language processing platform to empower specialists of ancient languages to undertake translation of Sumerian language texts, thus enabling data-driven study of the languages, culture, history, economy and politics of ancient Near Eastern civilizations. In this platform we are focusing on data normalization using Linked Open Data to foster best practices in data exchange, standardization and integration with other projects in digital humanities and computational philology.

The CDLI offers catalogue and text data that are downloadable from our Github data repository, image files that can be simply harvested online or obtained on demand (including higher resolution images, for research purposes only), and textual annotations that are currently being prepared by the MTAAC research team.

List of potential project ideas

Description of potential projects

Transliterations editor and API

Mentor(s): Ilya Khait

JTF is a new JSON-based format for transliterations that aims to make cuneiform textual data easily accessible for processing and modifications. It comes with a NodeJS API, jtf-lib,to provide an ATF format parser and converter, a CRUD interface and a module for sign list operations, jtf-signlist. A React web application, uqnu, is being developed to import transliterations from files, validate, edit, export etc.

For motivation and further details, check this IDCS blog post.

The task is to integrate this infrastructure into CDLI's framework and allow crowdsourcing and individual work on texts.

Outcomes:

  • Framework integration:
    • The JTF API and web application run in a framework docker container.
    • The web application is accessible via a framework URL.
  • CDLI database:
    • Stores JTF data.
    • Has a version control system that efficiently stores changes to transliterations.
  • Framework API integration:
    • JTF API integration with the framework's public API.
    • JTF output function in the API.

Bonus:

  • CDLI Crowdsourcing functions:
    • Users can add / modify transliterations data and send them for approval directly from the application.
    • Admins (CDLI Autorisation required) can use the application to check and approve changes.
  • Application development:
    • Plugins and addons, e.g. to annotate linguistic features / images, to compare and highlight changes etc.
    • Standalone desktop / mobile version for different platforms.

Skills required:

  • CakePHP
  • Docker
  • NodeJS
  • Webpack
  • React & Redux

Github links:

Aggregating components into a phrase-based MT system - Apertium

Mentor(s): Rachit Bansal, supported by Christian Chiarcos

Current statistical and neural models face sparsity issues. For low-resource languages, symbolic machine translation is a viable alternative, and with annotations and annotation tools for Sumerian developed in the last years by CDLI/MTAAC, this becomes now a realistic possibility.

Proposal: Develop a Sumerian-English language pair of the Apertium machine translation platform (https://www.apertium.org/).

Apertium is relatively widely used for this purpose, it is free software and released under the terms of the GNU General Public License.

Starting points:

For this project, it is possible to reach out to the Apertium community, as well (also active on GSoC):

Requirements:

  • background in NLP or linguistics
  • programming in Python, Java or Perl
  • XML

Dependency parser for Sumerian

Mentor(s): Christian Chiarcos, Max Ionov

As a result of the MTAAC project, we have morphological annotations and syntactic dependencies (and baseline parsers ["pre-annotators"] for them), but morphological analysis is performed independently from (before) syntactic parsing. However, case markers are sensitive to the phrase structure so that both should be predicted in a joint fashion.

Proposal: Create a dependency parser that performs syntactic and morphological annotation in parallel. This involves two main tasks: Data preparation and machine learning.

Requirements:

  • programming in Python or Java

Detailed description: Goal is to develop a parser that performs joint morphological and syntactic parsing. Because characters in the transliteration do not directly correspond to morphemes, morphological annotations are added as empty nodes in the syntactic parse. This has the advantage of supporting the annotation of texts where morphology is systematically omitted.

Steps:

  • Take current MTAAC dependencies and transform them to phrase structures. Create an NP node for every noun that has a dependent and a CL node for every verb that has a dependent. Mark the head with the relation HD, link all dependencies to the newly created phrase by means of their dependency relation.

    Example:

      # tr.en: ruler of Lagasz built his E-dura
      1       ...-x-x _[_]    u       u       7       dep     _
      2       ensi2   ensi2[ruler]    N       N       7       ERG     _
      3       lagasz{ki}-ke4  lagasz{ki}[1]   SN      SN.GEN.ERG      2       GEN     _
      4       e2      e2[house]       N       N       6       nmod       _
      5       ansze   _       N       N       6       nmod       _
      6       dur9{ur3}-ka-ni e2-{ansze}du24-ur3[1]   TN      TN.3-SG-H-POSS.ABS      7       ABS     _
      7       mu-na-du3       du3[build]      V       VEN.3-SG-H.DAT.3-SG-H-A.V.3-SG-P        0       root    _
    

    Expected result:

      # tr.en: ruler of Lagasz built his E-dura
      (CL
        (u-dep       ...-x-x)
        (NP-ERG
          (N-HD      ensi2)
          (SN-GEN    lagasz{ki}-ke4) )
        (NP-ABS
          (N-nmod    e2)
          (N-nmod    ansze)
          (TN-HD     dur9{ur3}-ka-ni) )
        (V-HD        mu-na-du3) )
    
  • Integrate the morphosyntactic analysis into the tree structure. Every .-separated element of the morphological analysis should become a separate node in the phrase node that represents the original token. The word itself is treated like an opaque string and positioned in accordance with its POS tag in the morphological analysis and marked as the head. All other morphology nodes receive the empty string (*).

    Expected result:

      # tr.en: ruler of Lagasz built his E-dura
      (CL
        (u-dep          ...-x-x)
        (NP-ERG
          (N-HD          ensi2)
          (SN-GEN                           # phrase = original word 
            (SN-HD       lagasz{ki}-ke4)	# insert token to the position of its POS
            (GEN         *) 
            (ERG         *) ) )             # segmentation follows original annotation 
        (NP-ABS
          (N-nmod
            (N-HD         e2) )              # no additional morphology given
          (N-nmod    
            (N-HD         ansze) )           # no additional morphology given
          (TN-HD
            (TN-HD        dur9{ur3}-ka-ni)   # insert token into the position of its POS
            (3-SG-H       *
            (POSS         *)
            (ABS          *) ) )
        (V-HD
          (VEN            *)
          (3-SG-H         *)
          (DAT            *) 
          (3-SG-H-A       *)
          (V-HD           mu-na-du3)          # insert token into the position of its POS
          (3-SG-P         *) ) )
    

    Remark: This transformation should strictly mirror the original format. A more correct analysis would be to attach the ERG node to NP-ERG, but such linguistically informed interpretation is beyond the scope of this task.

  • Convert to a dependency representation, with HD-marked nodes as heads and terminal nodes as tokens

    Expected result

      1  ...-x-x         u        16 dep
      2  ensi2           N        16 ERG
      3  lagasz{ki}-ke4  SN       2  GEN
      4  *               GEN      3  _
      5  *               ERG      3  _
      6  e2              N        8  nmod
      7  ansze           N        8  nmod
      8  dur9{ur3}-ka-ni TN       16 ABS
      9  *               3-SG-H   8  _
      10 *               POSS     8  _
      11 *               ABS      8  _
      12 *               VEN      16 _
      13 *               3-SG-H   16 _
      14 *               DAT      16 _ 
      15 *               3-SG-H-A 16 _
      16 mu-na-du3       V        0  root
      17 *               3-SG-P   16 _
    
  • Replicate (and possibly, improve) a state-of-the-art approach on dependency parsing with empty categories (see https://www.aclweb.org/anthology/K17-1035/) on the resulting parses.

Update on available training and test data

We now provide training, development and test data for the GSoC 2021 task for conjoint morphological and syntactic parsing under https://github.com/cdli-gh/conjoint-parsing. Note that the expanded data (with one morpheme per line) is slightly different from the format described above:

  • Instead of creating empty nodes, we repeat the full word for every morpheme.
  • We include placeholders for all morphemes ("slots") that are possible for a given part of speech, including those that are empty. These are marked by the entry in their part-of-speech column (N1..N6 for nouns and proper nouns, V1..V16 for verbs), the morphological head carries the original part of speech (in place of N1 and V12) and the syntactic dependencies.
  • The motivation for providing the data also in this way is that we can experiment with standard dependency parsers (in addition to those that support empty category parsing). Again, this setting is slightly different from the one sketched above, as off-the-shelf parsers can be directly applied.

Proposals can either start either from the original task description above (and work along these lines, including expanding the data) or with the data and instructions in this GitHub directory (e.g., also using its expanded version). In either case, the base data for three subcorpora is now under data/classical and the associated Readme contains some instructions on evaluation and how this data was compiled. Feel free to reach out to us via Slack channel #annotation.

Journals Open Review Workflow and Integration

Mentor(s): Nisheal John

Topics: Full Stack, PHP, Docker & Integrations, Publishing, Reviewing, Journals.

This project concerns the CDLI journals workflow before we are ready to publish an article. Open Journals should be implemented in our stack as a standalone app or integrated into the framework.

The student should understand the work of the overall journal and contribute to the open issues w.r.t journals dashboard.

The workflow should handle:

  • Submission of articles for the journals
  • All steps of the (open) review (author and reviewers known to each other) process
  • Document and publish “article history”, listing original submission date (possible access to submitted version), review date, major revisions, first publication date, any later revisions. (See for example https://www.springer.com/gp/livingreviews?countryChanged=true)
  • Needed docker implementation for the project.

For a true open access open review journal see https://www.solid-earth.net/5/425/2014/se-5-425-2014.html and https://www.bmj.com/content/372/bmj.m4903 (which also has “responses” and a long peer-review process).

Additionally, further integration with the framework is required:

  • Reviewers should be linked with their author profile
  • Integrating endorsement of reviewers with published articles
  • Integrating (some) review comments and final reviewer comments with published articles entries

Also look into

  • DOI, ORCHID, and PUBLONS integration
  • Citation index (metrics)

Our schedule for moving to Open Peer Reviews

  • Make the first submission available
  • Then either make anonymised reviews public but not names of reviewers, or make names of reviewers public but not reviews
  • Make a full trail of peer review public

Do we also want public comments from logged-in individuals? See https://arxiv.org/
The software we intend to use: https://pkp.sfu.ca/ojs/ojs_download/

Skills

  • Fullstack PHP & Javascript.
  • Understanding of Docker, linux.
  • Understanding article publishing & review.

Initial contribution to the project and framework is required, please feel free to ping the mentors via #journals slack.

Digital Library Management

Mentor(s): Jacob L. Dahl
Topics: Full stack, PHP, images management

This project is about preparing a dashboard that can show an admin the visual assets of the digital library for each artifact but also add, edit, delete, images using our archival images serve as a source of better quality images to prepare their web counterpart. Access to images should also be managed there (some images are not public).

Objectives

Primary

  • Develop a dashboard and a series of workflows so admins can manage the digital library (dl)
  • Populate the tables with information from the current dl
  • Keep the system flexible so we can create an extension for crowdsourcing of images in the future

Secondary

  • Assist in setting up an automated grab of archival and raw images from the VM, to the archival server. These images will have been uploaded through the images manager or the minio interface.
  • Set up a granular access to images at the image level (instead of at the artifact level)

Operations handled by the images manager

  • Create web version of images for an artifact using the archival version
  • Change the type for an image (main, detail, envelope, lineart, lineart detail, etc)
  • Associate existing images with a different artifact (when an error was made)
  • Associate images with artifacts, images which are currently associated with no artifact
  • Overview of assets for one artifact
  • Overview of assets for all artifacts (stats, also per collection)
  • Granular access for images management

Skills

  • Familiarity with tiff / jpg and vector handling libraries
  • Familiarity with managing larger sets of files
  • Understanding of Docker, linux
  • PHP / CakePHP/HTML/CSS
  • Able to communicate complex ideas to non-specialists

Discovery search and advanced search features

Mentors(s): Vedant Wakalkar
Skills : CakePHP, ElasticSearch, JS, HTML.

Objectives

Main objectives

  • Enable search inscription with sign value permutation
  • Add the "IDs" search field to both Simple & Advanced Search.
  • “Keywords” in Simple (default) & Advanced Search
  • Filter search result by RTI, Image, Transliterations, 3D data.

Secondary objectives

  • Improve backend core

    1. Port elasticsearch to the cakephp elasticsearch plugin
    2. On Successful implementation of plugin, pagination implementation can be updated by using \Cake\Datasource\Paginator.
    3. Port request to ElasticSearch from cURL to HttpClient.
  • Search inscription with sign value permutation

    1. Provide a switch for the user to enable the feature
    2. Restrict sign values list to values with no periods associated and with the period(s) selected by the user
    3. On add or edit, save a sanitized version of an inscription as a list of sign names without word boundaries
    4. Sanitize search input to remove word boundaries and convert to sign names, to search against sign name list in inscriptions
    5. In search results, match sign name with sign value and highlight sign values in atf display
  • Input flexibility enhancement

    1. Convert inscription search input to C-ATF
    2. Convert metadata fields search input to UTF-8

Reference Links

(Refer comments for more clarity)

Design challenge

Mentor(s): Samarth Sharma, Amaan Iqbal

We have implemented parts of our new design (see Figma link below) but our full stack developers have prepared modules for which the design of the implementation diverges. We need to re-unify the design and simplify the interface overall. You can decide on your own objectives for this project, keeping in mind that our goal this summer is getting as close as possible to a beta version of the framework. Think about how to use your 175 hours wisly for maximum enhancements. The public facing interface has to be as accessible (AA WCAG) as possible and fallbacks for those without javascript need to be designed. In the administrative pages, JS can be used freely to increase functionality.

Objectives

  • Plan and implement the admin dashboard (https://gitlab.com/cdli/framework/-/issues/75)
  • Not all logged in users have access to the admin dashboard
  • Not all with access to dashboard can see and use all admin functionalities
  • Merge the admin and public templates
  • we are having menu discrepancies problems
  • Unify the presentation of all pages
  • View / Edit / Index pages for all entities should be similar
  • Remove the related actions block and add a local sub-menu instead
  • Enhance the display of a single artifact & expanded search results
  • Design the add and edit artifact pages (front end only, back end bonus)

Skills

  • Pro at HTML and CSS
  • Bootstrap
  • Accessibility
  • Javascript

On Zeplin: https://zpl.io/2GGKPwm
On Figma: https://www.figma.com/file/LA1138Ao8EqY1YcG5NLKhiZy/CDLI?node-id=0%3A1 Issues : https://gitlab.com/cdli/framework/-/issues?label_name%5B%5D=dev%3Afront+end

Try these issues while preparing your proposal:
https://gitlab.com/cdli/framework/-/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=good%20first%20issue&label_name[]=dev%3Afront%20end

Ancient compositions scores management

Mentor(s): Christie Carr
Technical mentor: Émilie Pagé-Perron

Transliterations of cuneiform texts that are witnesses to literary compositions have encoded line numbers for the associated composition. Based on those line numbers, it is possible to generate "scores" or list of the same line of a composition but from different documents in which they appear.

The current system can be found here: https://cdli.ucla.edu/tools/scores/partitur-index.html

This task entails preparing a main index page of compositions and a single composition view page which will display the score for that particular composition, and the associated translation. This requires writing a parser wich will assemble composite scores based on individual texts which are marked with associated line numbers for the composite version of the text.

Description of the challenge in technical terms: https://gitlab.com/cdli/framework/issues/147

Skills & Knowledge

  • Full stack development
  • CakePHP conventions and practice
  • Interest in ancient world literature

Meta data and linguistic data processing and curation

Mentor(s): Émilie Pagé-Perron

Description

This project consists in finalizing the Cuneiform Digital Library data migration to the new framework. From flat data to complex relational data, the task entails mapping programmatically the data from the previous model to the new and ensuring we can convert the dataset currently in use once we are ready to move to the new framework.

In the new framework, we are using a partially automated and partially manual converted version of the data. There is no way to update all the data from the flat database if needed, although the work does not start from scratch.

Here you can find the python scripts that were used to migrate data in the first place: https://gitlab.com/cdli/framework/-/tree/phoenix/feature/dbscripts/dev/db_scripts

The two databases to work from are not public, please ask on Slack for the download links.
For a csv dump of the public current database and have a look at the data, see here: https://github.com/cdli-gh/data

The first step would be to reverse engineer the new database and start looking at the model. The mentor will be available from now to the proposal submission deadline to answer questions.

Objectives

  • Prepare conversion scripts in PHP or Python
  • Prepare clean up rules (using regex and more complex processing)
  • Prepare mapping lists
  • Clean and convert all the data
  • Reuse our API to prepare bulk import export in the "new" flat format
  • Check that indexes & FKs are all set and that the DB is compliant to standards
  • Bonus: prepare an add and edit forms for artifacts
  • Bonus: prepare a bulk edit form for artifacts

Skills

  • MySQL & database modelisation
  • Python
  • PHP / CakePHP
  • Familiarity with data types
  • String manipulation pro
  • Regex pro

Mapping

  • Prospective students can look at this outdated map https://docs.google.com/spreadsheets/d/1HL5hXRGd6GqpXch6NiTvvNZ2x7R638tqR_kCaP0Jsy8/edit?usp=sharing to get a better understanding of the data
  • Some entities in the new database will have to remain the same as they have been curated/ cleaned; eg.: collections, authors, names. In those cases, the link between entities and artifacts must be updated, based on matching entity name in the new db and text field in the old database.
  • The main entities, artifacts, and inscriptions, should be overwritten or updated in the new framework.

Data Visualization

Mentor(s): Erica Scarpa Description The project aims at designing interactive visualizations based on the data available from the Cuneiform Digital Library. Data refers to ancient cuneiform tablets and their formal characteristics, such as their origin, chronology, or typology: the high number of documents (300.000+), however, prevents a human-friendly approach to the data itself. The intersection of variables such as chronology and provenience, for example, would provide invaluable insight on trends and long-term phenomena which otherwise could hardly be appreciated. This kind of intersection can be displayed, for example, with a hierarchical part-to-whole data visualization (a sunburst): such visualization would offer a new perspective on the quantities and typologies of documents that have been produced in specific historical periods and geographical areas. Similarly, a project can also tackle the complexity of the epigraphical record, showing the quantitative distribution of different textual typologies (poems, administrative documents, literary compositions) in different historical periods and geographical areas: this objective can be achieved with a dot plot showing how many and which textual typologies are attested in different places at different times.
Likewise, a bubble map can be designed to visualize not only the original geographical distribution of the ancient documents according to a quantitative perspective, but also the distribution in modern times of the same documents in different museums and collections. Using JavaScript libraries such as d3.js or chart.js, the goal of the project is to design and deploy a visualization capable of transforming non-visual and abstract data into a dynamic representation providing insight into specific historical problematics. The project shall thus consist into three phases:

  1. project design: purpose and objective of the visualization; definition of appropriate chart type (i.e. sunburst, dot plot, bubble chart, etc.);
  2. data preparation: once purpose and objective have been defined, data must be cleaned and prepared;
  3. creation of the data viz: from design to interactive aspects, the final step aims at preparing a completely functional data visualization. Skills JavaScript HTML SVG CSS

3D Viewer Integration

Mentor(s): Tim Collins, Sandra Woolley

https://github.com/virtualcuneiform

Description:

Skills & Knowledge:

  • JavaScript
  • Three.js
  • PHP
  • Usability testing