Skip to content

GSoC 2024 Projects

Jusong Yu edited this page Jan 31, 2024 · 19 revisions

Getting started with AiiDA AiiDA

AiiDA is a python framework for managing computational science workflows, with roots in computational materials science. It helps researchers manage large numbers of simulations (10k, 100k, 1M, ...) and complex workflows involving multiple executables. At the same time, it records the provenance of the entire simulation pipeline with the aim to make it fully reproducible.

AiiDA is used in research projects at universities, research institutes and companies (see SciPy 2020 talk, SciPy 2022 talk, publications, and testimonials).

To be considered as a GSoC student, we ask you to make a small pull request to aiida-core, or any active repositories in aiidateam and aiidalab organizations - could be a simple bug fix, improving the documentation, etc. See e.g. (for aiida-core) GitHub issues by-label

Say hi on our GSOC 2023 discussions page. (TODO: -> discourse)

Why work on AiiDA?

  • Help accelerate the transition to open (computational) science
  • Help fix the reproducibility crisis. Computational science is a good place to start.
  • Work with a team of computational scientists (mostly physics backgrounds) who are passionate about both science and coding.
    We have an active Slack workspace & biweekly developer meetings.

A background in materials science is not needed, but a basic interest in materials science topics will make things easier for you.

Project 1 - Explore the AiiDA node graph in the browser

level intermediate

Expected Size 350h

AiiDA automatically stores entities in its database and links them forming a directed graph. This directed graph automatically tracks the provenance of all data produced by calculations or returned by workflows. This project plan to provide a more intuitive tool for browsing AiiDA graphs using the interactive browser. We can use an open-source library for node graph (e.g. Rete, react-flow or similar) or build it from scratch. The node graph viewer will communicate with AiiDA with the REST API.

The current AiiDA Provenance Browser (e.g. the explore website) represents the data nodes with circles, calculation nodes with squares and workflow nodes with diamond shapes. There is not much information the user can get from these nodes. Besides, when the user selects a new node, the page redirects to a new page, thus losing the smooth transition from one node to another. In the new implementation, we will create a new node component with a preview to show the basic information of the node (e.g., label, type, value). And we want only to update the nodes instead of the page when selecting a new node, thus, the user can explore the AiiDA provenance smoothly along the provenance graph.

Expected outcomes

An AiiDA node graph viewer

  • allows the user to explore the AiiDA provenance dynamically, e.g. forward and backward along the provenance graph.
  • shows input and output nodes of a selected node.
  • allows preview of the node

Skills

Python, REST API, HTML, Javascript, React.

Mentors

Project 2 - Training an LLM to generate a queries from natural language prompts

level advanced

Expected Size 350h

One of the most powerful aspects of using AiiDA to run your workflows is that the automatically generated provenance can be used to flexibly query for the data that the user is interested in. However, using the QueryBuilder tool designed for this purpose can be somewhat challenging to learn and even be time consuming for experienced users.

This project aims to train a large language model (LLM) that allows users to easily request the query they are interested in by expressing their desired data in a few sentences. Existing LLM's such as ChatGPT already do a fair job at this, but the produced code is still often incorrect and outdated. Generating a diverse dataset of query prompts and corresponding Python code, and using this data to train a dedicated LLM will hopefully make the produced queries more accurate, creating a powerful tool for users to extract the results they are interested in.

Expected outcomes

At the end of the project, we aim to have a lightweight tool that can generate a correct QueryBuilder instance from a user prompt. This will require:

  • A database that maps natural language prompts to the corresponding queries, which can be easily and incrementally expanded.
  • LLM trained on this database that converts prompts into a QueryBuilder instance.
  • A user-friendly interface for that can be installed locally in the form of a Python package and optionally an online tool integrated with the Materials Cloud.

Skills

We expect you to be familiar with object-oriented programming in Python. You need to have experience in natural language processing and know how to train a model from scratch.

Note

This project poses an exciting challenge for both students and mentors. While the AiiDA team may not have extensive experience with LLM, we eagerly anticipate students bringing their knowledge to the table. We are ready to provide expertise for the actual queries and more, making this collaboration a dynamic and enriching opportunity.

Mentors

Mentorship

The mentors for GSOC 2024 are

Please use the GSOC 2024 discussion thread [TODO: use discourse!] to say hi and ask any questions you may have.