-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Census 2020 example #430
base: main
Are you sure you want to change the base?
Census 2020 example #430
Conversation
You need to re-create the conda environment locally following the contributing guide.
It's still way too large. You should aim for the minimum dataset size possible, it's fine if it's just a few KB as long as it contains data that is representative of the whole dataset. For instance, if the code expects some data category, then it should be in the sample dataset to let the notebook run entirely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an absolute need to rename the original census
project census_one
? Without doing anything else, this is going to break all the links to its web page and deployment.
I would also not call the new one census_two
but census2020
.
I imagine renaming the original from |
Sounds like a bug in the validation code, something like |
Done. Thanks
Reduced it to <1MB now. |
Replying to your comment elsewhere:
If you intend to rename it, then redirect links have to be set up:
Alternatively, we could just:
|
I already did these in this PR. Would that be enough to differentiate both examples eventually? |
I think so? |
OK. I will revert the other renaming then |
@Azaya89 just to let you know that I won't have the bandwidth to review this before the end of the year, in case it's urgent. |
No problem. I can't say it's urgent :) |
@jbednar a thought I had after your comment in #477 (comment), wouldn't it be more appropriate to extend the original |
I think there are three things to consider when deciding whether examples should be grouped into a single directory:
Here census2010 and census2020 clearly meet the first two criteria, but for 3, they have large and equal file sizes, and most people are only likely to want one or the other. So unless we have a way to download the right data per notebook rather than per example (which we might well end up with as we move away from anaconda_project), then combining 2010 and 2020 would cause most users to require double the disk space, which seems problematic. |
I launched these two ZIP downloads and was surprised by the difference in file sizes, with 1.3GB for census2010 and 4GB for census2020.
@jbednar I'm asking this question as it is quite clear, to me at least, that examples are meant to be kept up-to-date, i.e. this site is not a blog. We often hear that we have too many APIs (last was during the last steering committee meeting on Friday), we should try to expose our users to the best practices only. It means that this growing number of examples needs to be maintained; this year, we have updated about half of them and it wasn't trivial. So I'm going to challenge adding census 2020 if it's just a slight variation of census 2010 with a slightly different dataset? To illustrate this, census 2010 (which was priority 4 on our NF SDG list, we almost got to it) has a bug in the legend and might be able to use hvPlot directly. |
I can point out here that the 2010 dataset contains ~306 million rows while the 2020 dataset contains over 334 million. That's an additional ~28 million rows. I'm sure that alone is not the reason for the large difference in file size but I think it also play a role? |
That's not enough to make it 3x bigger. |
Sure! Edit: I think I have found the issue: The 2010 dataset 'x' and 'y' columns are in Float32 dtype while the 2020 dataset are in Float64 dtype. I will update the 2020 dataset to use Float32 and have it re-uploaded. |
Right; we explicitly chose 32 bits (and would use 16 bits if that were an easy option) to reduce file size. Census 2020 is a special case of people expecting the data to be more up to date. The code should not be different from 2010, so the extra maintenance should not be prohibitive. |
The new dataset has been uploaded now. It's about 2.15 GB on disk. |
Is the race field a categorical? Category should be much smaller than string |
So I downloaded the two ZIP files (census 2010 and 2020) and compared them. File sizes (zipped): old 1.44GB vs new 1.62GB So while the file sizes are now closer, there are still a few differences:
I'd also note that this example installs I would still prefer if we had a single census example, the above shows that it's not hard for two similar examples to diverge enough to cause maintenance pain. How about updating the original census to use the 2020 dataset and dropping entirely the 2010 one? Pretty sure the text in the text in the original example still applies to the new dataset. |
@Azaya89 I see you closed #178. Tbh I'm not sure it should be closed. This issue is about implementing spatial indexing with spatialpandas for census. You wrote you updated the example in #459 but the changes this PR contains were unrelated to spatial indexing. @jbednar shouldn't we try to re-add spatial indexing to the census example? It seems like it could greatly benefit from it? |
Yes, very much so! Spatial indexing is very important. BTW, I've had a hard time being sure when it is in use and when not. I seem to recall there being some app we made that showed how many partitions were used in the current view, which would be really valuable for showing the effect of spatial indexing. |
I haven't seen the actual detailed results, but if you're right that the same conclusions apply, that might be a reasonable approach. I'd still want a separate notebook looking at the differences between the two, but that would be a bonus add-on that could load the older data and massage the newer data to match the older. If someone could render the old notebook with the new data and point me to a copy of both I'd be happy to make that call. |
I propose having the 2010 census as a historical (static) document, with an admonition referring to the newer census, 2020. Because the examples are not versioned, we lose some valuable knowledge by always pushing to the latest and greatest API we have decided to recommend. Census 2020 uses |
Well, not versioned apart from Git, that is! :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed briefly with Simon about this example this week and realized that you all (Jim, Simon and Isaiah) agreed it'd be a worthwhile example to add, so I was a little embarrassed getting into your way with my recent messages. Even more so as I've been aware this was in preparation for a while too. Though I have to say the timing of adding a new example while we were trying to wrap up the NF SDG work and I was reviewing the new examples from Demetris was unfortunate, I had no time to dive into this example more before now.
I gave the whole thing a bit more thought and reviewed the example, and, in fact, I have to say I am not so embarrassed about my view 🙃
Any new example comes with some non-negligible maintenance cost, even if it doesn't look like so:
- Except a couple examples when we moved from pyviz.org to holoviz.org, examples are neither archived nor removed, they're all exposed equally and are meant to be maintained over a long time. We have updated this year ~15 examples, some of them being very old (before I knew any Python!)
- It can have deployments which have to be monitored and fixed when something goes wrong; Jean-Luc and I are basically unofficially on-call for this
- We already know that all the examples will have to be migrated from
anaconda-project
to a new tool (TBD). And, we also know they will all have to be re-deployed on a new platform. While we have some testing in place, our tests are rather simple and only cover "does the notebook run without error?" which is often not enough to fully test an example. As such, during these migrations, the examples will have to be run manually one by one. I'm well aware of that as I've had to do exactly that end of last year when we had to re-deploy all the examples and had to migrate to the new and stricter validation put in place to ensure the notebook/site/deployment were in sync. This easily made it to the podium of my most miserable professional experiences, and I'm already sad for the person who will have to take care of that (even more so if it's me!). - Fix potential bugs, e.g. example not working on Windows, bugs in the 3k lines dodo.py or the 9 Github workflows or the custom Sphinx extensions that could affect this particular example.
Of course, I'm not against adding new examples and I think Isaiah's FIFA example, Demetris' neuroscience ones, and the planned Grabcut and Stable Diffusion, are all great additions to the gallery as they all add something new. In its own right there's nothing wrong with this example, but I have a hard time seeing what it adds to the gallery:
- The Geospatial domain is already overly represented
- The approach used is already shown in other examples
- The dataset itself is simply an updated version of a dataset already used in another example (original census)
On the last point, I realized that updating the dataset is a task we didn't integrate with @droumis in the Example Modernization Checklist that was guiding Isaiah and Jason while updating the examples (https://github.com/holoviz-topics/examples/wiki/Example-Modernization-Checklist-(2024)). I don't recall if it was an explicit decision or not. There are many examples that could be updated to use a newer dataset (too many to list them here!), I'd be concerned with the precedent Census 2020 would set.
To conclude:
- I'm in favor of having a single census example
- I have no preference for the dataset used (2010, 2020, 2010 and 2020)
- It should follow our best API practices:
- I have some questions about some of the choices made in this example (see review comments)
- Note that I think that, in the original Census example, the usage of Datashader only is still worthwhile to demonstrate as being able to easily generate a static image of a large dataset is a common need
(Note that the site cannot be built as this notebook when turned into HTML is too heavy - 170MB - exceeds Github's file size limit of 100MB)
census2020/census2020.ipynb
Outdated
"\n", | ||
"x_range, y_range = CITIES['USA']\n", | ||
"\n", | ||
"df.hvplot.points('x', 'y', rasterize=True, cnorm='linear', xlim=x_range, ylim=y_range, cmap=gray)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cmap
accepts colorcet cmaps referenced as strings, i.g. cmap='gray'
. Can you try that? And if it works, update your hvplot reference notebook on cmap/color_key/color to also include that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked. Thanks
census2020/census2020.ipynb
Outdated
"\n", | ||
"def plot_city(longitude_range, latitude_range, w=plot_width, h=plot_height, bgcolor='black'):\n", | ||
" \"\"\"Plots a map of the pop density aggregated by race within specified longitude and latitude ranges\"\"\"\n", | ||
" return df.hvplot.points('x', 'y', rasterize=True, dynspread=True, cnorm='eq_hist',\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of instantiating an hv.Tiles
element from an xyz
layer, and overlaying it with the points plot, you can directly set tiles=xyz.<>
in the hvplot call. This is documented there https://hvplot.holoviz.org/user_guide/Geographic_Data.html (it was implemented this year, motivated by one the examples that was updated part of the NF SDG work of this year).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This worked.
Edit: On second thoughts, it does not work well because the labels are now under the data points so they don't show well when zoomed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Plus people often want both (background) tiles and (foreground) labels in the same plot. I guess that could be handled using a new option label_tiles to handle tiles that are meant to go on top, but at some point we have to stop adding more options to hvPlot. :-) Up to @maximlt .
census2020/census2020.ipynb
Outdated
"outputs": [], | ||
"source": [ | ||
"%%time\n", | ||
"display(plot_city(*CITIES['New York City']))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why timing these ones and not the ones before too? I'm not asking you to do it, I would just like to understand why it's like this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thinking then was that since the plots are now being called from a pre-defined function, it makes sense to time how long it takes to run but thinking about it further, there is no reason why the previous plots cannot also be timed. I guess I could also time them as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to think about mabe. Serving the app with panel serve ...
executes the notebook content every time a visitor visits the page. Calling display(plot(...))
forces the HoloViews plot to be generated, making the app longer to be served. In the streaming example of Demetris I mentioned it not pn.state.served: ...
which could be used here inside a function to only execute display()
when the app is not served.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the streaming example of Demetris I mentioned
if not pn.state.served: ...
which could be used here inside a function to only executedisplay()
when the app is not served.
Can you share a link to this example so I can see what it looks like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these examples I've generally put timing information on any cell that takes a surprising amount of time. Sometimes the surprise is "how can Datashader be so fast", and other times it's answering "how long does that dataset take to load".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the streaming state bit: https://github.com/holoviz-topics/examples/pull/469/files#r1880352906
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to think about mabe. Serving the app with panel serve ... executes the notebook content every time a visitor visits the page. Calling display(plot(...)) forces the HoloViews plot to be generated, making the app longer to be served.
In these examples I've generally put timing information on any cell that takes a surprising amount of time. Sometimes the surprise is "how can Datashader be so fast", and other times it's answering "how long does that dataset take to load".
Looking at both these comments, I am now reconsidering timing the plots at all. The plots actually display in good time (<2.5 secs for all of them), so while it would have been nice to showcase how fast it is, I don't think it's worth the extra overhead of making the app slow when served.
census2020/census2020.ipynb
Outdated
" \"\"\"Plots a map of the pop density aggregated by race within specified longitude and latitude ranges\"\"\"\n", | ||
" return df.hvplot.points('x', 'y', rasterize=True, dynspread=True, cnorm='eq_hist',\n", | ||
" xlim=longitude_range, ylim=latitude_range,\n", | ||
" cmap=color_key, bgcolor=bgcolor, colorbar=False, hover=False,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering, why is hover
set to False
? And when hover=False
, is there an advantage of using rasterize=True
instead of datashade=True
? cc @jbednar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume both hover
and colorbar
should be True
here, because indeed, without at least one of those, I can't think of a reason to use rasterize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I sort of understand why Isaiah would have disabled them, as they're both hard to parse.
Yes, that was exactly why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The hover information is useful, but I agree it should be formatted better. I think @mattpap indicated to me long ago that this approach was a stopgap, if I'm remembering correctly, and that we might be able to format that as a set of key value pairs like x and y are instead. I think that's an issue to file separately.
- The colorbar is not just hard to parse, but incorrect -- the brightness level indicates increasing population, but the colorbar suggest that it indicates decreasing population. That's either a bug in this notebook or another issue to file.
- It would be great if we could easily select SI units for the tick formatting, so that 5.000e+5 would show up as 500K, but I don't know if anyone's implemented something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The colorbar is not just hard to parse, but incorrect -- the brightness level indicates increasing population, but the colorbar suggest that it indicates decreasing population. That's either a bug in this notebook or another issue to file.
The colorbar only becomes incorrect when I add the color_key options to the plot. The previous plots using uniform colors showed more accurate, even if still hard to parse, colorbars. I don't know if it's a bug or what but it was definitely inaccurate/confusing which informed my decision to remove it entirely.
- It would be great if we could easily select SI units for the tick formatting, so that 5.000e+5 would show up as 500K, but I don't know if anyone's implemented something like that.
I think this is possible in bokeh using a NumeralTickFormatter but I don't know if it is simple to do in hvPlot. I could go some levels lower to implement that in these plots but I fell it's just introducing unnecessary complexity to the code.
See also https://docs.bokeh.org/en/latest/docs/first_steps/first_steps_4.html#formatting-axis-ticks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For the colorbar colors, there's not a question of degree of accuracy; the colorbar is simply completely wrong. It is not actually reporting how the data is mapping onto brightness or value; it's reversed. Unless there's some obvious mistake in the example code, please file a bug report rather than working around the issue.
- The colorbar ticks are not very useful right now, so if it is possible to provide a simple option like
.opts(colorbar_tick_format=‘0 a’)
and get ticks like '500k', it would be great to do so. If we have to jump through tons of hoops to get there, like defining custom plot hooks, then I agree with you that doing so is too complex, and then I agree with Maxime that we should disable the colorbar as not being very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hover information is useful, but I agree it should be formatted better. I think @mattpap indicated to me long ago that this approach was a stopgap, if I'm remembering correctly, and that we might be able to format that as a set of key value pairs like x and y are instead. I think that's an issue to file separately.
I recall there was a discussion about this probably in a meeting, but don't think any conclusions were ever recorded (in an issue or so). It would be best to start such issue/discussion with an example dataset (shape specifically) and expected output shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattpap , I agree a separate issue would be helpful here. What I'd personally like to see in such an issue is that instead of what the hover shows in the screenshot above, it shows:
lat: 30.2460° N
lon: 87.7008° W
Black: 157
Hispanic/Other: 3
Mixed: 12
Native American: 2
White: 91
(where I've omitted anything with zero count, since the more you zoom in the more sparsely the categories are populated and the less likely that most categories will need to be listed.)
Can you maybe take a look at this PR's code and set up a similar example in pure Bokeh? This one depends on the entire HoloViz stack, and is pretty unwieldy.
"dashboard.main.append(background)\n", | ||
"dashboard.main.append(pop_map)\n", | ||
"\n", | ||
"dashboard.servable();" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When updating one of the widgets the map size briefly changes from the full window width to the original width. I recorded a GIF but it's too heavy for sharing on Github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it similar to what I talked about here in point 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry I missed/forgot about this message!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
census2020/census2020.ipynb
Outdated
" \n", | ||
" self._update_plot()\n", | ||
" \n", | ||
" @param.depends('map_tile', 'map_alpha', 'city', watch=True)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By re-creating the plot everytime map_alpha
is updated in the app, we have an app that is slow at updating the alpha compared to e.g. NYC Taxi (https://examples.holoviz.org/gallery/nyc_taxi/dashboard.html). Not sure this is the kind of approach we want to promote? cc @jbednar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The datashader_dashboard
code shows how to avoid that update hit for alpha (e.g. using .apply.opts) and for tiles (which would be done as a separate callback, not folded into a single one). I'm only seeing these notes here; I'm not sure what this app is, as the original census doesn't have an app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For plots that are quick to render, I don't mind a Panel-based approach that just re-renders the plot when a widget is updated, as this API (Panel's) is a bit easier to apprehend. In this case, we're rendering a large dataset and should indeed be a bit more careful when updating the plot, relying on HoloViews' API more.
" def plot_view(self):\n", | ||
" return self.plot\n", | ||
" \n", | ||
" def __panel__(self, **kwargs):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could remove plot_view
in favor of the following (to be tested):
@param.depends('plot')
def __panel__(self):
return self.plot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"\n", | ||
"dashboard = pn.template.BootstrapTemplate(\n", | ||
" title=\"US population density map\",\n", | ||
" sidebar = [widgets],\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can instantiate the template with main=[background, pop_map]
, instead of the two calls further down to append
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I was following what was in the docs
census2020/census2020.ipynb
Outdated
"You can also zoom in and out or pan across the map to explore regions of interest in more detail.\n", | ||
"\"\"\"\n", | ||
" \n", | ||
"background = pn.Accordion(('Background', pn.pane.Markdown(intro_text, renderer_options={'breaks': False})\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to improve Panel to be able to simplify this really ugly code :)
census2020/census2020.ipynb
Outdated
"source": [ | ||
"pop_map = USPopulation()\n", | ||
"\n", | ||
"widgets = pn.WidgetBox(\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the widgets aren't customized in this app, instead of WidgetBox(pop_map.param.<>, ...)
, you could instead call pn.Param(pop_map) and set the precedence
value of each Parameter in the class to tweak how they're supposed to be displayed (see https://panel.holoviz.org/how_to/param/examples/precedence.html and other pages in the docs)
I think the census example is in a different category; it's so iconic that it's nearly a logo for Datashader. We wouldn't take the time to update e.g. the gull_tracking example if it had new data, because it wouldn't change anything. But at least in the USA, people want to explore the actual data involved, e.g. to see how segregated their own neighborhood is. Having that information be 14 years out of date makes it much less effective in reaching people's actual interest. I don't think updating census will set any sort of precedent that would apply to other examples. Also note that originally I was proposing that census 2020 be very basic compared to the 2010, as I haven't seen whether the data is suitable for doing all the sort of analyses we did for 2010. But if all the same analyses can be done and have qualitatively similar conclusions, then I'm happy for 2020 to replace 2010 for most purposes. Still, 2010 has been so widely promoted and referred to that it may be best to keep it at that same address. Anyway, I haven't seen any of the 2020 results myself, so I can't say anything about them! |
FWIW the example is pretty easy to run :) I also recommend running it directly instead of waiting for the dev site to be built as all the plots require a live kernel for full interactivity. Interestingly, the original census example is the most visited of all over the last 6 months (for reference, that's about 10% of the pageviews of the Tabulator component). I'm not sure where the visitors come from, the highest referrer being a redirect link. |
Some fraction of those are me! I show those results to people all the time. But not 170 times a month, no. Anyway, yes, this is a popular example, and we should keep having that link work.
I'm sure. I'm afraid to run the example until it's ready, because I either get stuck in some conda environment hell, or it works, and I notice a ton of stuff I want to update, and then I'm late for my next meeting or my family is asking where dinner is. :-) But if it's ready, I'll check it out. |
Oh yeah keeping the link working is a must (I've tried to carefully set redirects to any page we've removed), the question is what content it should display.
It worked perfectly for me! I can't help you with your other problems 🙃 |
Created a new example using the 2020 US census dataset.