Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset chapter #328

Merged
merged 7 commits into from
Sep 25, 2024
Merged

Dataset chapter #328

merged 7 commits into from
Sep 25, 2024

Conversation

suvayu
Copy link
Member

@suvayu suvayu commented May 29, 2024

  • Remove trivial database section from Python guide

  • Add a chapter on datasets

  • I followed the CONTRIBUTING guidelines.

Below, describe what this Pull Request adds:

This PR removes the database section from the Python guide (as
discussed in #316), and introduces a new chapter on handling datasets.
It discusses using local databases, and other data processing
libraries, and respective trade-offs.

suvayu and others added 2 commits May 29, 2024 13:22
discuss trade-offs between:
- local databases like SQLite & DuckDB
- data processing libraries like Pandas, Vaex, & Polars

Co-authored-by: Flavio Hafner <[email protected]>
@suvayu suvayu requested a review from egpbos May 29, 2024 16:12
@maltelueken
Copy link
Member

Nice! I have also used DuckDB in combination with dplyr in R, so I might add something about using data bases in R to the R language guide.

@suvayu
Copy link
Member Author

suvayu commented May 30, 2024

Hi @maltelueken that would be amazing! This also addresses the last point in the DuckDB part about combining with other tools. We were also lacking R experience, so couldn't comment on R libraries.

@egpbos egpbos requested a review from Morrizzzzz June 19, 2024 08:29
@egpbos egpbos mentioned this pull request Jun 23, 2024
@bouweandela
Copy link
Member

@Morrizzzzz Would you be interested and have time to review this?

best_practices/datasets.md Outdated Show resolved Hide resolved
@recap
Copy link
Member

recap commented Sep 5, 2024

The chapter could be more about data engineering i.e. how to use these tools or best practices for ETL pipelines.

@egpbos
Copy link
Member

egpbos commented Sep 6, 2024

@recap do you have some resources to link to on data engineering and/or ETL pipelines? Sounds like a nice addition (for a new PR). We should try to restrict it to techniques/concepts we actually (can) use in projects. I think you have done some of that, no?

@egpbos
Copy link
Member

egpbos commented Sep 6, 2024

Also, @recap your suggested additions sound good, but did you also review what was already in the PR and whether it makes sense? Then we can merge this PR as it is now and do your additions in a next PR (or quickly add them to this PR if you want, I think @suvayu is on holiday anyway).

best_practices/datasets.md Outdated Show resolved Hide resolved
best_practices/datasets.md Outdated Show resolved Hide resolved
@egpbos
Copy link
Member

egpbos commented Sep 25, 2024

Thank you so much @suvayu & @f-hafner for taking this initiative and @recap for the great review and additions.

... One final thing before merging is to add it to the sidebar menu, though :) I'll do that right now...

@egpbos egpbos self-requested a review September 25, 2024 08:18
Copy link
Member

@egpbos egpbos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome addition, thanks all!

@egpbos egpbos merged commit e98e815 into NLeSC:main Sep 25, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants