Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Census 2020 example #430

Open
wants to merge 38 commits into
base: main
Choose a base branch
from
Open

Census 2020 example #430

wants to merge 38 commits into from

Conversation

Azaya89
Copy link
Collaborator

@Azaya89 Azaya89 commented Oct 17, 2024

Created a new example using the 2020 US census dataset.

@Azaya89 Azaya89 requested a review from hoxbro October 17, 2024 00:34
@maximlt
Copy link
Contributor

maximlt commented Oct 17, 2024

I suspect it is due to #429 but I'm not sure how to resolve it.

You need to re-create the conda environment locally following the contributing guide.

The test file added is a 0.1% sample of the full dataset but it is still about 8MB in size. I don't know if that is too large and should be reduced further.

It's still way too large. You should aim for the minimum dataset size possible, it's fine if it's just a few KB as long as it contains data that is representative of the whole dataset. For instance, if the code expects some data category, then it should be in the sample dataset to let the notebook run entirely.

Copy link
Contributor

@maximlt maximlt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an absolute need to rename the original census project census_one? Without doing anything else, this is going to break all the links to its web page and deployment.

I would also not call the new one census_two but census2020.

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Oct 17, 2024

Is there an absolute need to rename the original census project census_one? Without doing anything else, this is going to break all the links to its web page and deployment.

I would also not call the new one census_two but census2020.

I imagine renaming the original from census to something else makes sense seeing as there are now more than one census notebooks in the examples gallery (and possibly more in the future). However, I tried renaming both to census2010 and census2020 but the doit validate step emits a warning that only lower case characters and underscore allowed in the naming. I wasn't sure ignoring that warning was ideal that is why I now renamed both to the current names.

@maximlt
Copy link
Contributor

maximlt commented Oct 17, 2024

However, I tried renaming both to census2010 and census2020 but the doit validate step emits a warning that only lower case characters and underscore allowed in the naming

Sounds like a bug in the validation code, something like census2020 should be allowed.

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Oct 17, 2024

You need to re-create the conda environment locally following the contributing guide.

Done. Thanks

It's still way too large. You should aim for the minimum dataset size possible, it's fine if it's just a few KB as long as it contains data that is representative of the whole dataset. For instance, if the code expects some data category, then it should be in the sample dataset to let the notebook run entirely.

Reduced it to <1MB now.

@maximlt
Copy link
Contributor

maximlt commented Oct 21, 2024

Replying to your comment elsewhere:

Thank you. I'm still in favor of renaming the first one to census2010 though.

If you intend to rename it, then redirect links have to be set up:

Alternatively, we could just:

  • Change the title property in the project YAML to Census 2010
  • Change the notebook top-level heading to Census 2010

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Oct 21, 2024

Alternatively, we could just:

  • Change the title property in the project YAML to Census 2010
  • Change the notebook top-level heading to Census 2010

I already did these in this PR. Would that be enough to differentiate both examples eventually?

@maximlt
Copy link
Contributor

maximlt commented Oct 21, 2024

Would that be enough to differentiate both examples eventually?

I think so?

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Oct 21, 2024

I think so?

OK. I will revert the other renaming then

census2020/census2020.ipynb Outdated Show resolved Hide resolved
@maximlt
Copy link
Contributor

maximlt commented Dec 5, 2024

@Azaya89 just to let you know that I won't have the bandwidth to review this before the end of the year, in case it's urgent.

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Dec 5, 2024

@Azaya89 just to let you know that I won't have the bandwidth to review this before the end of the year, in case it's urgent.

No problem. I can't say it's urgent :)

@maximlt
Copy link
Contributor

maximlt commented Dec 8, 2024

@jbednar a thought I had after your comment in #477 (comment), wouldn't it be more appropriate to extend the original census example instead of adding a new one? I haven't yet run Census 2020 (this PR), from having a quick look at the code it doesn't look so different so I have the impression maintaining the two notebooks with the same environment should be possible.

@jbednar
Copy link
Contributor

jbednar commented Dec 9, 2024

I think there are three things to consider when deciding whether examples should be grouped into a single directory:

  1. Are the examples thematically linked, such that you'd reasonably want to learn about one once you click on the other? (✔️ )
  2. Do they share essentially the same environment, such that it makes sense to set up a single environment for both? (✔️ )
  3. Can they use the same datasets? If different, does adding a new notebook add only minimal file or download size? (❌ ).

Here census2010 and census2020 clearly meet the first two criteria, but for 3, they have large and equal file sizes, and most people are only likely to want one or the other. So unless we have a way to download the right data per notebook rather than per example (which we might well end up with as we move away from anaconda_project), then combining 2010 and 2020 would cause most users to require double the disk space, which seems problematic.

@maximlt
Copy link
Contributor

maximlt commented Dec 9, 2024

they have large and equal file sizes

I launched these two ZIP downloads and was surprised by the difference in file sizes, with 1.3GB for census2010 and 4GB for census2020.

  • We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?

@jbednar I'm asking this question as it is quite clear, to me at least, that examples are meant to be kept up-to-date, i.e. this site is not a blog. We often hear that we have too many APIs (last was during the last steering committee meeting on Friday), we should try to expose our users to the best practices only. It means that this growing number of examples needs to be maintained; this year, we have updated about half of them and it wasn't trivial. So I'm going to challenge adding census 2020 if it's just a slight variation of census 2010 with a slightly different dataset? To illustrate this, census 2010 (which was priority 4 on our NF SDG list, we almost got to it) has a bug in the legend and might be able to use hvPlot directly.

image

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Dec 9, 2024

  • We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?

I can point out here that the 2010 dataset contains ~306 million rows while the 2020 dataset contains over 334 million. That's an additional ~28 million rows. I'm sure that alone is not the reason for the large difference in file size but I think it also play a role?

@maximlt
Copy link
Contributor

maximlt commented Dec 9, 2024

  • We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?

I can point out here that the 2010 dataset contains ~306 million rows while the 2020 dataset contains over 334 million. That's an additional ~28 million rows. I'm sure that alone is not the reason for the large difference in file size but I think it also play a role?

That's not enough to make it 3x bigger.

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Dec 9, 2024

  • We need to check why the files differ so much, are there more columns in census2020? If so, are they used? Or is the data stored differently (dtype, compression)?

I can point out here that the 2010 dataset contains ~306 million rows while the 2020 dataset contains over 334 million. That's an additional ~28 million rows. I'm sure that alone is not the reason for the large difference in file size but I think it also play a role?

That's not enough to make it 3x bigger.

Sure!

Edit: I think I have found the issue: The 2010 dataset 'x' and 'y' columns are in Float32 dtype while the 2020 dataset are in Float64 dtype. I will update the 2020 dataset to use Float32 and have it re-uploaded.

@jbednar
Copy link
Contributor

jbednar commented Dec 9, 2024

Right; we explicitly chose 32 bits (and would use 16 bits if that were an easy option) to reduce file size.

Census 2020 is a special case of people expecting the data to be more up to date. The code should not be different from 2010, so the extra maintenance should not be prohibitive.

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Dec 10, 2024

The new dataset has been uploaded now. It's about 2.15 GB on disk.

@jbednar
Copy link
Contributor

jbednar commented Dec 10, 2024

Is the race field a categorical? Category should be much smaller than string

@maximlt
Copy link
Contributor

maximlt commented Dec 10, 2024

So I downloaded the two ZIP files (census 2010 and 2020) and compared them.

File sizes (zipped): old 1.44GB vs new 1.62GB
Parquet files in the folder: old 36 vs new 141
Race values:
image
Schemas:
image
Compression: both use SNAPPY

So while the file sizes are now closer, there are still a few differences:

  • way more partitions for the new one, what is the impact on performance?
  • I'm not sure why the x,y values aren't displayed as Float 32 in the new one (read with DuckDB)
  • race is not stored as categorical in any of them, I'm not sure why? see messages below
  • race values are different, shortened to one letter in the old file while complete in the new one

I'd also note that this example installs fastparquet. I think that in the end it's pyarrow that is used as the engine as it's installed (see the lockfile) and it has become the default engine in Dask (https://docs.dask.org/en/stable/changelog.html#fastparquet-engine-deprecated). Worth checking though.


I would still prefer if we had a single census example, the above shows that it's not hard for two similar examples to diverge enough to cause maintenance pain. How about updating the original census to use the 2020 dataset and dropping entirely the 2010 one? Pretty sure the text in the text in the original example still applies to the new dataset.

@Azaya89
Copy link
Collaborator Author

Azaya89 commented Dec 10, 2024

Is the race field a categorical? Category should be much smaller than string

Yeah (before I changed the 'x' and 'y' types)
image

@maximlt
Copy link
Contributor

maximlt commented Dec 10, 2024

Yeah (before I changed the 'x' and 'y' types)

Ok then it's just that I don't know how to find out whether a column is categorical using DuckDB :) Interestingly, the only place where I can see there might be some categories is in the old file. So, yes clearly my approach was wrong, at least for finding out whether a column is categorical or not.
image

@maximlt
Copy link
Contributor

maximlt commented Dec 10, 2024

@Azaya89 I see you closed #178. Tbh I'm not sure it should be closed. This issue is about implementing spatial indexing with spatialpandas for census. You wrote you updated the example in #459 but the changes this PR contains were unrelated to spatial indexing. @jbednar shouldn't we try to re-add spatial indexing to the census example? It seems like it could greatly benefit from it?

@jbednar
Copy link
Contributor

jbednar commented Dec 10, 2024

Yes, very much so! Spatial indexing is very important.

BTW, I've had a hard time being sure when it is in use and when not. I seem to recall there being some app we made that showed how many partitions were used in the current view, which would be really valuable for showing the effect of spatial indexing.

@jbednar
Copy link
Contributor

jbednar commented Dec 11, 2024

How about updating the original census to use the 2020 dataset and dropping entirely the 2010 one? Pretty sure the text in the text in the original example still applies to the new dataset.

I haven't seen the actual detailed results, but if you're right that the same conclusions apply, that might be a reasonable approach. I'd still want a separate notebook looking at the differences between the two, but that would be a bonus add-on that could load the older data and massage the newer data to match the older. If someone could render the old notebook with the new data and point me to a copy of both I'd be happy to make that call.

@hoxbro
Copy link
Contributor

hoxbro commented Dec 13, 2024

I propose having the 2010 census as a historical (static) document, with an admonition referring to the newer census, 2020.

Because the examples are not versioned, we lose some valuable knowledge by always pushing to the latest and greatest API we have decided to recommend. Census 2020 uses hvplot and panel, but I find it interesting that you can see how our tools developed in the current Census 2010, going from "pure" Datashader to using it along with HoloViews.

@jbednar
Copy link
Contributor

jbednar commented Dec 14, 2024

Because the examples are not versioned, we lose some valuable knowledge by always pushing to the latest and greatest API we have decided to recommend.

Well, not versioned apart from Git, that is! :-)

Copy link
Contributor

@maximlt maximlt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed briefly with Simon about this example this week and realized that you all (Jim, Simon and Isaiah) agreed it'd be a worthwhile example to add, so I was a little embarrassed getting into your way with my recent messages. Even more so as I've been aware this was in preparation for a while too. Though I have to say the timing of adding a new example while we were trying to wrap up the NF SDG work and I was reviewing the new examples from Demetris was unfortunate, I had no time to dive into this example more before now.

I gave the whole thing a bit more thought and reviewed the example, and, in fact, I have to say I am not so embarrassed about my view 🙃

Any new example comes with some non-negligible maintenance cost, even if it doesn't look like so:

  • Except a couple examples when we moved from pyviz.org to holoviz.org, examples are neither archived nor removed, they're all exposed equally and are meant to be maintained over a long time. We have updated this year ~15 examples, some of them being very old (before I knew any Python!)
  • It can have deployments which have to be monitored and fixed when something goes wrong; Jean-Luc and I are basically unofficially on-call for this
  • We already know that all the examples will have to be migrated from anaconda-project to a new tool (TBD). And, we also know they will all have to be re-deployed on a new platform. While we have some testing in place, our tests are rather simple and only cover "does the notebook run without error?" which is often not enough to fully test an example. As such, during these migrations, the examples will have to be run manually one by one. I'm well aware of that as I've had to do exactly that end of last year when we had to re-deploy all the examples and had to migrate to the new and stricter validation put in place to ensure the notebook/site/deployment were in sync. This easily made it to the podium of my most miserable professional experiences, and I'm already sad for the person who will have to take care of that (even more so if it's me!).
  • Fix potential bugs, e.g. example not working on Windows, bugs in the 3k lines dodo.py or the 9 Github workflows or the custom Sphinx extensions that could affect this particular example.

Of course, I'm not against adding new examples and I think Isaiah's FIFA example, Demetris' neuroscience ones, and the planned Grabcut and Stable Diffusion, are all great additions to the gallery as they all add something new. In its own right there's nothing wrong with this example, but I have a hard time seeing what it adds to the gallery:

  • The Geospatial domain is already overly represented
  • The approach used is already shown in other examples
  • The dataset itself is simply an updated version of a dataset already used in another example (original census)

On the last point, I realized that updating the dataset is a task we didn't integrate with @droumis in the Example Modernization Checklist that was guiding Isaiah and Jason while updating the examples (https://github.com/holoviz-topics/examples/wiki/Example-Modernization-Checklist-(2024)). I don't recall if it was an explicit decision or not. There are many examples that could be updated to use a newer dataset (too many to list them here!), I'd be concerned with the precedent Census 2020 would set.

To conclude:

  • I'm in favor of having a single census example
  • I have no preference for the dataset used (2010, 2020, 2010 and 2020)
  • It should follow our best API practices:
    • I have some questions about some of the choices made in this example (see review comments)
    • Note that I think that, in the original Census example, the usage of Datashader only is still worthwhile to demonstrate as being able to easily generate a static image of a large dataset is a common need

(Note that the site cannot be built as this notebook when turned into HTML is too heavy - 170MB - exceeds Github's file size limit of 100MB)

"\n",
"x_range, y_range = CITIES['USA']\n",
"\n",
"df.hvplot.points('x', 'y', rasterize=True, cnorm='linear', xlim=x_range, ylim=y_range, cmap=gray)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cmap accepts colorcet cmaps referenced as strings, i.g. cmap='gray'. Can you try that? And if it works, update your hvplot reference notebook on cmap/color_key/color to also include that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked. Thanks

"\n",
"def plot_city(longitude_range, latitude_range, w=plot_width, h=plot_height, bgcolor='black'):\n",
" \"\"\"Plots a map of the pop density aggregated by race within specified longitude and latitude ranges\"\"\"\n",
" return df.hvplot.points('x', 'y', rasterize=True, dynspread=True, cnorm='eq_hist',\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of instantiating an hv.Tiles element from an xyz layer, and overlaying it with the points plot, you can directly set tiles=xyz.<> in the hvplot call. This is documented there https://hvplot.holoviz.org/user_guide/Geographic_Data.html (it was implemented this year, motivated by one the examples that was updated part of the NF SDG work of this year).

Copy link
Collaborator Author

@Azaya89 Azaya89 Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This worked.

Edit: On second thoughts, it does not work well because the labels are now under the data points so they don't show well when zoomed in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Plus people often want both (background) tiles and (foreground) labels in the same plot. I guess that could be handled using a new option label_tiles to handle tiles that are meant to go on top, but at some point we have to stop adding more options to hvPlot. :-) Up to @maximlt .

"outputs": [],
"source": [
"%%time\n",
"display(plot_city(*CITIES['New York City']))"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why timing these ones and not the ones before too? I'm not asking you to do it, I would just like to understand why it's like this :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thinking then was that since the plots are now being called from a pre-defined function, it makes sense to time how long it takes to run but thinking about it further, there is no reason why the previous plots cannot also be timed. I guess I could also time them as well :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to think about mabe. Serving the app with panel serve ... executes the notebook content every time a visitor visits the page. Calling display(plot(...)) forces the HoloViews plot to be generated, making the app longer to be served. In the streaming example of Demetris I mentioned it not pn.state.served: ... which could be used here inside a function to only execute display() when the app is not served.

Copy link
Collaborator Author

@Azaya89 Azaya89 Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the streaming example of Demetris I mentioned if not pn.state.served: ... which could be used here inside a function to only execute display() when the app is not served.

Can you share a link to this example so I can see what it looks like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these examples I've generally put timing information on any cell that takes a surprising amount of time. Sometimes the surprise is "how can Datashader be so fast", and other times it's answering "how long does that dataset take to load".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to think about mabe. Serving the app with panel serve ... executes the notebook content every time a visitor visits the page. Calling display(plot(...)) forces the HoloViews plot to be generated, making the app longer to be served.

In these examples I've generally put timing information on any cell that takes a surprising amount of time. Sometimes the surprise is "how can Datashader be so fast", and other times it's answering "how long does that dataset take to load".

Looking at both these comments, I am now reconsidering timing the plots at all. The plots actually display in good time (<2.5 secs for all of them), so while it would have been nice to showcase how fast it is, I don't think it's worth the extra overhead of making the app slow when served.

" \"\"\"Plots a map of the pop density aggregated by race within specified longitude and latitude ranges\"\"\"\n",
" return df.hvplot.points('x', 'y', rasterize=True, dynspread=True, cnorm='eq_hist',\n",
" xlim=longitude_range, ylim=latitude_range,\n",
" cmap=color_key, bgcolor=bgcolor, colorbar=False, hover=False,\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, why is hover set to False? And when hover=False, is there an advantage of using rasterize=True instead of datashade=True? cc @jbednar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume both hover and colorbar should be True here, because indeed, without at least one of those, I can't think of a reason to use rasterize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes true I forgot about the colorbar.

Though I sort of understand why Isaiah would have disabled them, as they're both hard to parse.

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I sort of understand why Isaiah would have disabled them, as they're both hard to parse.

Yes, that was exactly why.

Copy link
Contributor

@jbednar jbednar Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The hover information is useful, but I agree it should be formatted better. I think @mattpap indicated to me long ago that this approach was a stopgap, if I'm remembering correctly, and that we might be able to format that as a set of key value pairs like x and y are instead. I think that's an issue to file separately.
  • The colorbar is not just hard to parse, but incorrect -- the brightness level indicates increasing population, but the colorbar suggest that it indicates decreasing population. That's either a bug in this notebook or another issue to file.
  • It would be great if we could easily select SI units for the tick formatting, so that 5.000e+5 would show up as 500K, but I don't know if anyone's implemented something like that.

Copy link
Collaborator Author

@Azaya89 Azaya89 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The colorbar is not just hard to parse, but incorrect -- the brightness level indicates increasing population, but the colorbar suggest that it indicates decreasing population. That's either a bug in this notebook or another issue to file.

The colorbar only becomes incorrect when I add the color_key options to the plot. The previous plots using uniform colors showed more accurate, even if still hard to parse, colorbars. I don't know if it's a bug or what but it was definitely inaccurate/confusing which informed my decision to remove it entirely.

  • It would be great if we could easily select SI units for the tick formatting, so that 5.000e+5 would show up as 500K, but I don't know if anyone's implemented something like that.

I think this is possible in bokeh using a NumeralTickFormatter but I don't know if it is simple to do in hvPlot. I could go some levels lower to implement that in these plots but I fell it's just introducing unnecessary complexity to the code.
See also https://docs.bokeh.org/en/latest/docs/first_steps/first_steps_4.html#formatting-axis-ticks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For the colorbar colors, there's not a question of degree of accuracy; the colorbar is simply completely wrong. It is not actually reporting how the data is mapping onto brightness or value; it's reversed. Unless there's some obvious mistake in the example code, please file a bug report rather than working around the issue.
  • The colorbar ticks are not very useful right now, so if it is possible to provide a simple option like .opts(colorbar_tick_format=‘0 a’) and get ticks like '500k', it would be great to do so. If we have to jump through tons of hoops to get there, like defining custom plot hooks, then I agree with you that doing so is too complex, and then I agree with Maxime that we should disable the colorbar as not being very useful.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hover information is useful, but I agree it should be formatted better. I think @mattpap indicated to me long ago that this approach was a stopgap, if I'm remembering correctly, and that we might be able to format that as a set of key value pairs like x and y are instead. I think that's an issue to file separately.

I recall there was a discussion about this probably in a meeting, but don't think any conclusions were ever recorded (in an issue or so). It would be best to start such issue/discussion with an example dataset (shape specifically) and expected output shape.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattpap , I agree a separate issue would be helpful here. What I'd personally like to see in such an issue is that instead of what the hover shows in the screenshot above, it shows:

lat: 30.2460° N
lon: 87.7008° W
Black: 157
Hispanic/Other: 3
Mixed: 12
Native American: 2
White: 91

(where I've omitted anything with zero count, since the more you zoom in the more sparsely the categories are populated and the less likely that most categories will need to be listed.)

Can you maybe take a look at this PR's code and set up a similar example in pure Bokeh? This one depends on the entire HoloViz stack, and is pretty unwieldy.

"dashboard.main.append(background)\n",
"dashboard.main.append(pop_map)\n",
"\n",
"dashboard.servable();"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When updating one of the widgets the map size briefly changes from the full window width to the original width. I recorded a GIF but it's too heavy for sharing on Github.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it similar to what I talked about here in point 2?

#430 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I missed/forgot about this message!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

" \n",
" self._update_plot()\n",
" \n",
" @param.depends('map_tile', 'map_alpha', 'city', watch=True)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By re-creating the plot everytime map_alpha is updated in the app, we have an app that is slow at updating the alpha compared to e.g. NYC Taxi (https://examples.holoviz.org/gallery/nyc_taxi/dashboard.html). Not sure this is the kind of approach we want to promote? cc @jbednar

Copy link
Contributor

@jbednar jbednar Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The datashader_dashboard code shows how to avoid that update hit for alpha (e.g. using .apply.opts) and for tiles (which would be done as a separate callback, not folded into a single one). I'm only seeing these notes here; I'm not sure what this app is, as the original census doesn't have an app.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For plots that are quick to render, I don't mind a Panel-based approach that just re-renders the plot when a widget is updated, as this API (Panel's) is a bit easier to apprehend. In this case, we're rendering a large dataset and should indeed be a bit more careful when updating the plot, relying on HoloViews' API more.

" def plot_view(self):\n",
" return self.plot\n",
" \n",
" def __panel__(self, **kwargs):\n",
Copy link
Contributor

@maximlt maximlt Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could remove plot_view in favor of the following (to be tested):

@param.depends('plot')
def __panel__(self):
    return self.plot

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"\n",
"dashboard = pn.template.BootstrapTemplate(\n",
" title=\"US population density map\",\n",
" sidebar = [widgets],\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can instantiate the template with main=[background, pop_map], instead of the two calls further down to append.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was following what was in the docs

"You can also zoom in and out or pan across the map to explore regions of interest in more detail.\n",
"\"\"\"\n",
" \n",
"background = pn.Accordion(('Background', pn.pane.Markdown(intro_text, renderer_options={'breaks': False})\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to improve Panel to be able to simplify this really ugly code :)

"source": [
"pop_map = USPopulation()\n",
"\n",
"widgets = pn.WidgetBox(\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the widgets aren't customized in this app, instead of WidgetBox(pop_map.param.<>, ...), you could instead call pn.Param(pop_map) and set the precedence value of each Parameter in the class to tweak how they're supposed to be displayed (see https://panel.holoviz.org/how_to/param/examples/precedence.html and other pages in the docs)

@jbednar
Copy link
Contributor

jbednar commented Dec 16, 2024

There are many examples that could be updated to use a newer dataset (too many to list them here!), I'd be concerned with the precedent Census 2020 would set.
I'm in favor of having a single census example
I have no preference for the dataset used (2010, 2020, 2010 and 2020)

I think the census example is in a different category; it's so iconic that it's nearly a logo for Datashader. We wouldn't take the time to update e.g. the gull_tracking example if it had new data, because it wouldn't change anything. But at least in the USA, people want to explore the actual data involved, e.g. to see how segregated their own neighborhood is. Having that information be 14 years out of date makes it much less effective in reaching people's actual interest. I don't think updating census will set any sort of precedent that would apply to other examples.

Also note that originally I was proposing that census 2020 be very basic compared to the 2010, as I haven't seen whether the data is suitable for doing all the sort of analyses we did for 2010. But if all the same analyses can be done and have qualitatively similar conclusions, then I'm happy for 2020 to replace 2010 for most purposes. Still, 2010 has been so widely promoted and referred to that it may be best to keep it at that same address. Anyway, I haven't seen any of the 2020 results myself, so I can't say anything about them!

@maximlt
Copy link
Contributor

maximlt commented Dec 16, 2024

Anyway, I haven't seen any of the 2020 results myself, so I can't say anything about them!

FWIW the example is pretty easy to run :) I also recommend running it directly instead of waiting for the dev site to be built as all the plots require a live kernel for full interactivity.

Interestingly, the original census example is the most visited of all over the last 6 months (for reference, that's about 10% of the pageviews of the Tabulator component). I'm not sure where the visitors come from, the highest referrer being a redirect link.

image

@jbednar
Copy link
Contributor

jbednar commented Dec 17, 2024

Interestingly, the original census example is the most visited of all over the last 6 months

Some fraction of those are me! I show those results to people all the time. But not 170 times a month, no. Anyway, yes, this is a popular example, and we should keep having that link work.

FWIW the example is pretty easy to run :)

I'm sure. I'm afraid to run the example until it's ready, because I either get stuck in some conda environment hell, or it works, and I notice a ton of stuff I want to update, and then I'm late for my next meeting or my family is asking where dinner is. :-) But if it's ready, I'll check it out.

@maximlt
Copy link
Contributor

maximlt commented Dec 17, 2024

Anyway, yes, this is a popular example, and we should keep having that link work.

Oh yeah keeping the link working is a must (I've tried to carefully set redirects to any page we've removed), the question is what content it should display.

I either get stuck in some conda environment hell

It worked perfectly for me! I can't help you with your other problems 🙃

@jbednar
Copy link
Contributor

jbednar commented Dec 18, 2024

FWIW the example is pretty easy to run :)

Ok, I did get it to run. I didn't have any dependency hell this time, but I did have nasty git merge conflicts to deal with thanks to the unpleasant interactions Jupyter has with revision control, for all the random notebooks in examples that I've run and happened to modify as I use them. I've dealt with all that, and now I can see that the 2020 data is not ready to replace 2010.

Here's Chicago in 2010:

image

and in 2020:

image

I don't know if any of the extra detail in 2020 is meaningful, but there are definitely patterns with the racial coloring that look entirely like artifacts rather than meaningful data. E.g. each little block seems to be sorted by race, with Asians living in the more Northern portion of the block, which surely isn't really true:

image

I would guess that we have bugs in how the individual points are being synthesized from the block data, e.g. maybe the data is sorted by race and then has locations generated in order? In any case the 2020 data is definitely not ready to replace the 2010 data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants