Skip to content

Commit

Permalink
adding mkdocs-caption dependency
Browse files Browse the repository at this point in the history
  • Loading branch information
Sahil590 committed Aug 12, 2024
1 parent 26850b0 commit f03e342
Show file tree
Hide file tree
Showing 5 changed files with 190 additions and 20 deletions.
30 changes: 12 additions & 18 deletions docs/posts/pydata_london_2024.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,19 +25,17 @@ With Artificial Intelligence, in particular Large Language Models (LLMs) being a

## Open source software and community building

![Keynote by Tania Allard at PyData London 2024](images/pydata_london_2024/tania_keynote.png){: style="height:300px;width:auto"}
<!-- markdownlint-disable-next-line MD036 -->
*Source: Keynote by Tania Allard at PyData London 2024*
![Keynote by Tania Allard at PyData London 2024](images/pydata_london_2024/tania_keynote.png)
Figure: *Source: Keynote by Tania Allard at PyData London 2024*

A few talks and sessions at PyData touched on Open Source Software. Tania Allard presented a [keynote](https://www.youtube.com/watch?v=9AuuhrQDv0E&list=PLGVZCDnMOq0rrhYTNedKKuJ9716fEaAdK&index=29) on “[The art of building and sustaining successful OSS tools and infrastructure](https://speakerdeck.com/trallard/2024-pydata-lndn)” discussing factors that contribute to an open-source project’s success and sustainability. She also touched upon how to empower developers, users, and maintainers in a sustainable way and not at the expense of the open-source ecosystem.

Cheuk Ting Ho led an interesting unconference style discussion on “How to define open source AI”. The participants of the unconference were asked to walk through the open source initiative’s conversation [explaining the concept of data information](https://discuss.opensource.org/t/explaining-the-concept-of-data-information/401) and they were asked to put their opinions and questions on it. It was a healthy discussion on Data Information as defined in the draft [Open Source AI](https://opensource.org/blog/open-source-ai-definition-weekly-update-june-17) definition: “Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data.”

There was also another lightning talk presented on [open source science](https://www.opensource.science/), a NumFOCUS initiative connecting scientists and OSS developers.

![Deb Nicholson’s talk at PyData London 2024](images/pydata_london_2024/deb_open_source_leadership.png){: style="height:300px;width:auto"}
<!-- markdownlint-disable-next-line MD036 -->
*Source: Deb Nicholson’s talk at PyData London 2024*
![Deb Nicholson’s talk at PyData London 2024](images/pydata_london_2024/deb_open_source_leadership.png)
Figure: *Source: Deb Nicholson’s talk at PyData London 2024*

Deb Nicholson presented her thoughts on “[Open source leadership: what to give away and what to bring in](https://www.youtube.com/watch?v=qqZP7OBTL70&list=PLGVZCDnMOq0rrhYTNedKKuJ9716fEaAdK&index=48)”. She provided guidance on steps that the open source leaders can take to establish a balance between the tasks that they do. For example, maintainers or dedicated resources can be responsible for the project’s admin work. Whereas tasks that require a casual timeline and are more enjoyable might be delegated to volunteers. Work that needs constant attention (say, it demands 30-40 hours of time commitment per week) or involves any security risks, should definitely be done by a dedicated staff. As open source projects evolve, they should look for strategies and action plans to reallocate their work in a sustainable way.

Expand All @@ -49,9 +47,8 @@ Due to the nuance and relative complexity of each application, many ML developer

For streaming and aggregation, attendants were spoilt for choice for low-latency data streaming solutions, with [Bytewax](https://bytewax.io/), [Hopsworks](https://www.hopsworks.ai/) and [CSP](https://docs.cloudera.com/csp-ce/latest/index.html) all presenting solutions. These projects provide reusable building blocks for integrating with various real-time (and offline, for model training) data sources, performing efficient and customisable preprocessing, and presenting to the final data sink as aggregated time-synchronised dataframes.

![Dask documentation](images/pydata_london_2024/dask_document.png){: style="height:400px;width:auto"}
<!-- markdownlint-disable-next-line MD036 -->
*Source: [Dask documentation](https://docs.dask.org/en/stable/?wvideo=l9sgt2saht)*
![Dask documentation](images/pydata_london_2024/dask_document.png)
Figure: *Source: [Dask documentation](https://docs.dask.org/en/stable/?wvideo=l9sgt2saht)*

As for parallel processing and analysis, the [DASK](https://www.dask.org/) team showcased their data table processing library; similar to Pandas, that is able to offload processing to multiple nodes on, say, a HPC cluster. With significant performance improvements and a dataframe-like interface, combined with the real-time streaming options above, real-time data analysis is seemingly easier than ever.

Expand All @@ -63,27 +60,24 @@ Crafting, calibrating and evaluating models for now-casting and forecasting as w

For Bayesian enthusiasts, a talk as well as a hackathon by two of the core developers of [PyMC5](https://www.pymc.io/welcome.html), Chris Fonnesbeck and Thomas Wiecki, was useful and illustrated how [Bayesian computing](https://www.youtube.com/watch?v=99Rmi_CjqME&list=PLGVZCDnMOq0rrhYTNedKKuJ9716fEaAdK&index=12) can be facilitated within the Python framework. There are plenty of well-documented Jupyter notebooks on their [website](https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/pymc_overview.html) with [examples](https://www.pymc.io/projects/examples/en/latest/gallery.html) from regression, model selection, factor analysis and reliability statistics.

![Figure obtained from PyMC documentation](images/pydata_london_2024/pymc_plot.png){: style="height:300px;width:auto"}
<!-- markdownlint-disable-next-line MD036 -->
*Source: Figure obtained from [PyMC documentation](https://www.pymc.io/projects/examples/en/latest/introductory/api_quickstart.html)*
![Figure obtained from PyMC documentation](images/pydata_london_2024/pymc_plot.png)
Figure: *Source: Figure obtained from [PyMC documentation](https://www.pymc.io/projects/examples/en/latest/introductory/api_quickstart.html)*

Furthermore, there was a talk on [Synthetic Data in Financial Time Series](https://www.youtube.com/watch?v=VXbRP2a0ABg&list=PLGVZCDnMOq0rrhYTNedKKuJ9716fEaAdK&index=39), where [Generative Adversarial Networks (GANs)](https://en.wikipedia.org/wiki/Generative_adversarial_network) were applied to model the evolution of the time series of prices for two types of crude oil. Various commonly used open-access financial datasets were mentioned. This was followed by introducing a machine-learning architecture based on TensorFlow that spans generator, discriminator, encoder and recovery networks. The network was then trained on generating statistically accurate time-series which is useful when data availability, privacy or ethical considerations are a concern.

## The Python ecosystem

Putting the 'Py' in PyData, some talks covered more general aspects of the Python ecosystem. In a humorous talk, Quazi Nafiul Islam gave an overview of the [evolutionary saga of Python packaging](https://youtu.be/95pi4210XAM?si=dY-6IBxAfZCuDojD), from the origins of [Eggs](https://python101.pythonlibrary.org/chapter38_eggs.html) to sophisticated modern tools such as [poetry](https://python-poetry.org/), [PDM](https://pdm-project.org/en/latest/) and [UV](https://astral.sh/blog/uv). He discussed some of the particular challenges relevant to Python packaging, including challenges combining source code and binaries, and cross-platform compatibility. We certainly came away appreciating the progress that has been made with modern tools, and feeling lucky to be Python developers now rather than 20 years ago!

![Quazi Nafiul Islam’s talk at PyData London 2024](images/pydata_london_2024/quazi_talk.png){: style="height:300px;width:auto"}
<!-- markdownlint-disable-next-line MD036 -->
*Source: Quazi Nafiul Islam’s talk at PyData London 2024*
![Quazi Nafiul Islam’s talk at PyData London 2024](images/pydata_london_2024/quazi_talk.png)
Figure: *Source: Quazi Nafiul Islam’s talk at PyData London 2024*

Particularly exciting to members of the RSE team, Andy Fundinger gave an [overview of the python package 'hypothesis'](https://youtu.be/NL7-eNPr_oI?si=WI7II3v5mt7Wz-b4), a [package](https://hypothesis.readthedocs.io/en/latest/) that allows developers to automatically generate unit tests for their functions, and automates the parameterisation of these tests to cover a wide range of inputs and edge cases. We're excited to implement this tool in current and future projects to increase the robustness of our software.

## Honourable mentions

![John Sandall’s talk at PyData London 2024](images/pydata_london_2024/john_talk.png){: style="height:300px;width:auto"}
<!-- markdownlint-disable-next-line MD036 -->
*Source: John Sandall’s talk at PyData London 2024*
![John Sandall’s talk at PyData London 2024](images/pydata_london_2024/john_talk.png)
Figure: *Source: John Sandall’s talk at PyData London 2024*

There were some unique talks that discussed less standard topics in the ML and Data Science space, including a talk by John Sandall about [creating a folk music recommendation system](https://www.youtube.com/watch?v=kifvWDrld2s) using an online sheet music database. The presenter played the violin to demonstrate the types of folk songs the clustering method found in each category, and used LLMs to algorithmically name the clusters.

Expand Down
4 changes: 4 additions & 0 deletions docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
:root {
--md-primary-fg-color: #0000CD;
}
.md-content__inner img {
box-sizing: content-box;
width: 80%;
}
4 changes: 3 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,5 +28,7 @@ plugins:
categories:
- categories
- tags
- caption:
additional_identifier: []
markdown_extensions:
- attr_list
- md_in_html
Loading

0 comments on commit f03e342

Please sign in to comment.