diff --git a/examples/estimation_examples/BPZ_lite_demo.ipynb b/examples/estimation_examples/BPZ_lite_demo.ipynb index 78d1e50..ae5698f 100644 --- a/examples/estimation_examples/BPZ_lite_demo.ipynb +++ b/examples/estimation_examples/BPZ_lite_demo.ipynb @@ -9,7 +9,7 @@ "\n", "**Authors:** Sam Schmidt\n", "\n", - "**Last Successfully Run:** Sep 20, 2024" + "**Last Successfully Run:** Oct 01, 2024" ] }, { @@ -34,7 +34,7 @@ "import numpy as np\n", "import pandas as pd\n", "import desc_bpz\n", - "from rail.core.data import TableHandle\n", + "from rail.core.data import TableHandle, ModelHandle\n", "from rail.core.stage import RailStage\n", "from rail.utils.path_utils import RAILDIR\n", "from rail.estimation.algos.bpz_lite import BPZliteInformer, BPZliteEstimator" @@ -299,10 +299,97 @@ "If you compare the areas of color space with low todds you can see that it corresponds to portions of color space where multiple best-fit SED types lie very close in color, e.g. areas where Sbc, Scd, and Im galaxies all have similar g-r and r-i colors." ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d74fe87-580d-4dcf-bd8f-0bfa411b8697", + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "id": "e80d59ce-ca73-4bf3-9f3d-682c4dbdcb3e", "metadata": {}, + "source": [ + "## Running with the Polletta et al 2007 and BC03 SED template set\n", + "\n", + "We have also included a second SED template set, the one used to compute the COSMOS 30-band photo-z's (See Ilbert et al 2007, Dahlen et al 2013 for more information) within rail. This set consists of 31 total SEDs, the Elliptical and Spiral galaxy SEDs are empirical SEDs from Polletta et al 2007, with additional blue templates generated from Bruzual and Charlot (Bruzual & Charlot 2003) models. These SEDs can be accessed via a path relative to `RAILDIR` as done below. In order to tell `rail_bpz` to use these SEDs, we need to set the `spectra_file` configuration parameter, which will point to an ascii file containing a list of the SED names in the same directory that contains the SEDs. That list file can be found relative to RAILDIR at: `rail/examples_data/estimation_data/data/SED/COSMOS_seds.list`. We have also created a version of the HDFN prior used above corresponding to these templates, which we can specify with the `model` configuration parameter. The prior can be found relative to `RAILDIR` at `rail/examples_data/estimation_data/data/COSMOS31_HDFN_prior.pkl`.\n", + "\n", + "Let's set up a run with this template set:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a24d7b4c-fd6e-4171-8ba9-6be8c53fc046", + "metadata": {}, + "outputs": [], + "source": [ + "cosmospriorfile = os.path.join(RAILDIR, \"rail/examples_data/estimation_data/data/COSMOS31_HDFN_prior.pkl\")\n", + "cosmosprior = DS.read_file(\"cosmos_prior\", ModelHandle, cosmospriorfile)\n", + "sedfile = os.path.join(RAILDIR, \"rail/examples_data/estimation_data/data/SED/COSMOS_seds.list\")\n", + "cosmos_dict = dict(hdf5_groupname=\"photometry\", output=\"bpz_results_COSMOS_SEDs.hdf5\",\n", + " spectra_file=sedfile, bands=lsst_bands, err_bands=lsst_errs,\n", + " filter_list=lsst_filts, prior_band=\"mag_i_lsst\", no_prior=False)\n", + "run_newseds = BPZliteEstimator.make_stage(name=\"bpz_newseds\", model=cosmosprior, **cosmos_dict)" + ] + }, + { + "cell_type": "markdown", + "id": "3d656458-9cca-4ac3-84c1-131f79162c25", + "metadata": {}, + "source": [ + "Let's run the estimate stage, it should take almost four times as long, as we are now using 31 templates rather than 8, and the code should scale linearly in runtime with the number or SEDs. However, if this is the first timet that you are running the code, additional time will be taken in generating the \"AB\" files used to store the model fluxes, and so the relative runtimes may be skewed by that extra, one time computation (the AB files are computed once the first time that they are used and stored for future runs)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "05c54049-ecd6-481f-9bb0-5aeed3545d65", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "newsed_result = run_newseds.estimate(test_data)" + ] + }, + { + "cell_type": "markdown", + "id": "b0976cd3-469c-4dd9-9351-f93c50325809", + "metadata": {}, + "source": [ + "A run on a single processor on a Mac took 15.7 seconds for me, relative to 5.9 seconds for the CWWSB template set (not counting time used generating the AB files), so not far off of the rough factor of four predicted.\n", + "\n", + "Let's plot the point estimate vs true redshift for a quick comparison to the CWWSB plot above:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8f81add-04f5-4640-88d3-712bc089658b", + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure(figsize=(8,8))\n", + "plt.scatter(sz, newsed_result().ancil['zmode'].flatten(), s=2, c='k', label='default prior mode')\n", + "plt.plot([0,3], [0,3], 'b--')\n", + "plt.xlabel(\"redshift\")\n", + "plt.ylabel(\"Polletta/BC03 templates photo-z mode\")" + ] + }, + { + "cell_type": "markdown", + "id": "54a2295d-f467-4e7c-ab24-18eb222ad367", + "metadata": {}, + "source": [ + "We see some differences in scatter, bias, and outliers, as we would expect for different SEDs, as we are comparing our photometry to a different set of synthetic magnitudes. We will compare more quantitative stats later in the notebook." + ] + }, + { + "cell_type": "markdown", + "id": "73af7df4-01be-4f66-90ea-aba31d4cca69", + "metadata": {}, "source": [ "## BPZliteInformer: training a custom prior\n", "\n", @@ -620,6 +707,7 @@ "outputs": [], "source": [ "hdfn_sigma_eval = PointSigmaIQR(default_result.ancil['zmode'].flatten(), sz)\n", + "newsed_sigma_eval = PointSigmaIQR(newsed_result().ancil['zmode'].flatten(), sz)\n", "rerun_sigma_eval = PointSigmaIQR(rerun_res.ancil['zmode'].flatten(), sz)" ] }, @@ -631,6 +719,7 @@ "outputs": [], "source": [ "hdfn_sigma = hdfn_sigma_eval.evaluate()\n", + "newsed_sigma = newsed_sigma_eval.evaluate()\n", "rerun_sigma = rerun_sigma_eval.evaluate()" ] }, @@ -641,7 +730,7 @@ "metadata": {}, "outputs": [], "source": [ - "print(\"hdfn sigma: %.4f \\ncustom prior sigma: %.4f\" % (hdfn_sigma, rerun_sigma))" + "print(f\"hdfn sigma: {hdfn_sigma:.4f} \\nnewsed sigma: {newsed_sigma:.4f}\\ncustom prior sigma: {rerun_sigma:.4f}\")" ] }, { @@ -652,6 +741,7 @@ "outputs": [], "source": [ "hdfn_bias_eval = PointBias(default_result.ancil['zmode'].flatten(), sz)\n", + "newsed_bias_eval = PointBias(newsed_result().ancil['zmode'].flatten(), sz)\n", "rerun_bias_eval = PointBias(rerun_res.ancil['zmode'].flatten(), sz)" ] }, @@ -663,8 +753,9 @@ "outputs": [], "source": [ "hdfn_bias = hdfn_bias_eval.evaluate()\n", + "newsed_bias = newsed_bias_eval.evaluate()\n", "rerun_bias = rerun_bias_eval.evaluate()\n", - "print(\"hdfn bias: %.4f \\ncustom prior bias: %.4f\" % (hdfn_bias, rerun_bias))" + "print(f\"hdfn bias: {hdfn_bias:.4f}\\nnewsed bias: {newsed_bias:.4f}\\ncustom prior bias: {rerun_bias:.4f}\")" ] }, { @@ -672,7 +763,7 @@ "id": "c2d16539-73da-4e19-9965-c7ea468b08de", "metadata": {}, "source": [ - "We see very minor reductions, and overall similar behavior. Again, the prior should not affect high S/N observations very much. From our plot it looks like the outlier fraction may be the metric most affected by the prior, let's check this:" + "We see very minor reductions, and overall similar behavior, between the HDFN and custom prior, but a much smaller scatter in the larger COSMOS SED templates set. These SEDs are slightly newer than the CWWSB, and we are comparing against 31 rather than 8, which allows for better matches to photometry in some cases, so this is not unexpected, and in some sense we are trading extra compute-time for a reduction in scatter by increasing the number of SED templates. Comparing the HDFN prior to the custom prior, we note that the prior should not affect high S/N observations very much, and as we are in a fairly high S/N regime it is appropriate that we do not see much difference. From our plot it looks like the outlier fraction may be the metric most affected by the prior, let's check this:" ] }, { @@ -683,6 +774,7 @@ "outputs": [], "source": [ "hdfn_outlier_eval = PointOutlierRate(default_result.ancil['zmode'].flatten(), sz)\n", + "newsed_outlier_eval = PointOutlierRate(newsed_result().ancil['zmode'].flatten(), sz)\n", "rerun_outlier_eval = PointOutlierRate(rerun_res.ancil['zmode'].flatten(), sz)" ] }, @@ -694,8 +786,9 @@ "outputs": [], "source": [ "hdfn_outlier = hdfn_outlier_eval.evaluate()\n", + "newsed_outlier = newsed_outlier_eval.evaluate()\n", "rerun_outlier = rerun_outlier_eval.evaluate()\n", - "print(\"hdfn outlier rate: %.4f \\ncustom prior outlier rate: %.4f\" % (hdfn_outlier, rerun_outlier))" + "print(f\"hdfn outlier rate: {hdfn_outlier:.4f}\\nnewsed outlier rate: {newsed_outlier:.4f}\\ncustom prior outlier rate: {rerun_outlier:.4f}\")" ] }, { @@ -703,7 +796,7 @@ "id": "771fda8f-f89b-4d98-aedd-cfcde34389f4", "metadata": {}, "source": [ - "Not a dramatic effect, but a definite reduction in the number of outliers. This outlier rate is defined in terms of PointSigmaIQR, and thus varies depending on said sigma, and is thus harder to directly compare. For a direct comparison, let's compute the fraction of galaxies that have a delta(zmode - specz) larger than 0.15*(1+z), i.e. those with abs(zmode - specz) / (1 + specz) > 0.15:" + "Not a dramatic effect, but a definite reduction in the number of outliers between the HDFN and custom prior, and a slight increase in outliers for the COSMOS SEDs. This outlier rate is defined in terms of PointSigmaIQR, and thus varies depending on said sigma, and is thus harder to directly compare. For a direct comparison, let's compute the fraction of galaxies that have a delta(zmode - specz) larger than 0.15*(1+z), i.e. those with abs(zmode - specz) / (1 + specz) > 0.15:" ] }, { @@ -715,12 +808,15 @@ "source": [ "from rail.evaluation.metrics.pointestimates import PointStatsEz\n", "hdfn_ez_eval = PointStatsEz(default_result.ancil['zmode'].flatten(), sz)\n", + "newsed_ez_eval = PointStatsEz(newsed_result().ancil['zmode'].flatten(), sz)\n", "rerun_ez_eval = PointStatsEz(rerun_res.ancil['zmode'].flatten(), sz)\n", "hdfn_ez = hdfn_ez_eval.evaluate()\n", + "newsed_ez = newsed_ez_eval.evaluate()\n", "rerun_ez = rerun_ez_eval.evaluate()\n", "hdfn_outlier_frac = (np.sum((np.abs(hdfn_ez) > 0.15))) / len(sz)\n", + "newsed_outlier_frac = (np.sum((np.abs(newsed_ez) > 0.15))) / len(sz)\n", "rerun_outlier_frac = (np.sum((np.abs(rerun_ez) > 0.15))) / len(sz)\n", - "print(\"HDFN catastrophic outlier frac is: %.4f\\ncustom prior catastrophic oulier frac is: %.4f\" % (hdfn_outlier_frac, rerun_outlier_frac))" + "print(f\"HDFN catastrophic outlier frac is: {hdfn_outlier_frac:.4f}\\nnew sed outlier frac: {newsed_outlier_frac:.4f}\\ncustom prior catastrophic oulier frac is: {rerun_outlier_frac:.4f}\")" ] }, { @@ -728,8 +824,19 @@ "id": "9cc3e1a9-6431-4d14-b4a8-3b88f6259204", "metadata": {}, "source": [ - "So, our custom prior has some effect on results, but it does not dominate. That is a good thing, as again, we do not want our prior to dominate photo-z calculations for high signal-to-noise data. Also, in all cases above we are using the same template set, and the template set used is also part of the implicit prior of the code, and can have a much larger effect on the results: our chi^2 values, and thus likelihoods for each galaxy at each redshift, are measured relative to the fluxes predicted for the templates. The combination of the templates and prior, and optimization of both will influence resultant photo-z performance. However, optimization of SED template sets is beyond the scope of this simple demo notebook." + "We see that the COSMOS SED has a smaller fraction of absolute outliers now that we have removed the dependence on the reduced sigma, so overall it just performs better, but at the expense of a longer runtime. The template set used is also part of the implicit prior of the code, and can have a much larger effect on the results: our chi^2 values, and thus likelihoods for each galaxy at each redshift, are measured relative to the fluxes predicted for the templates. The combination of the templates and prior, and optimization of both will influence resultant photo-z performance. However, optimization of SED template sets is beyond the scope of this simple demo notebook.\n", + "\n", + "\n", + "In terms of the prior and comparing the HDFN prior, our custom prior has some effect on results, but it does not dominate. That is a good thing, as again, we do not want our prior to dominate photo-z calculations for high signal-to-noise data. " ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61d3e68d-c6be-4b66-9339-b9cab493a602", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": {