search.json

[
  {
    "objectID": "posts/make-smart-choices/index.html",
    "href": "posts/make-smart-choices/index.html",
    "title": "Make Smart Choices. Use Multilevel Models.",
    "section": "",
    "text": "Two of the more exciting areas of growth in philosophy of science are large-scale automatic text analysis studies and cultural evolutionary models of scientific communities.\nText analysis studies provide really useful ways of summarizing trends both in philosophy of science and the scientific literature. When we do intellectual history with classical methods, we tend to focus on the heavy hitting, widely cited papers. This makes good sense because people can only read so many papers in their short lives. So focus on the big ones. But most papers are not widely cited. They are just normal little contributions. These mundane papers get less attention in intellectual history. They tend to be boring. Understanding what ordinary work looks like is just as important for understanding a period as understanding the revolutionary papers. Machines don’t get bored of reading mundane papers. Hence the appeal of automated text analysis.\nCultural evolutionary models imagine science is shaped by a process akin to natural selection. Some methods spread through the population of scientists. Others die out. If we could figure out the mechanisms that reward certain kinds of work, then selection-based models could provide some understanding of what shapes scientific practice in the long run.\nA natural thought is that these two methods could be brought together. Text analysis provides the data to test, fit, or otherwise evaluate the cultural evolution models. So far, no one has been able to pull this off in a compelling manner. Rafael Ventura has an exciting new pre-print that is a good step in this direction. Ventura is looking at how formal methods have spread over time inside philosophy of science. The goal is to see whether there is any selection for model-based papers over other papers. In other words, is the incentive structure one that rewards modeling throughout philosophy of science?\nHis data collection has two steps. First, he organizes the published philosophy of science literature into 16 topics. The topics are based on co-citation networks. The intuitive picture (I’m no expert in bibliometric techniques so this is loose!) was that two papers that tend to cite the same references will tend to be in the same topic. If two papers cite a bunch of non-overlapping references, they will tend to be in different topics. Second, he classified papers by whether they used formal methods or not and tracked how the frequency of formal methods changes over time in the 16 topics. Full details on the data collection process are found here: http://philsci-archive.pitt.edu/20885/ .\nThe bit that interests me is that choice to put papers into discrete topics based on co-citation. The 16 topics are not, of course, really entirely isolated research programs. Papers across the topics will cite some of the same works, experience some of the same incentive structures, and cultural shifts in one area of philosophy of science tend to diffuse into other areas. So the choice of 16 topics is somewhat artificial. This isn’t to say it’s a bad choice - there will definitely be a lot of cultural selection concentrated inside each topic. But the choice does introduce some limitations on the results. It would be nice if we could find a way around that, a way of quantifying the causal influence between research communities as well as within them. That’s what this post is all about.\nThe past few months, I’ve been playing around with Bayesian statistical packages in Python. It is mostly just for curiosity’s sake. But now when I read papers with data analysis and their underlying data is public, it can be a lot of fun to reanalyze the results. That’s what I’ve done here. Specifically, I’m using a technique from Bayesian statistics called multi-level modeling.\nThe headline is that at least one of the major findings from Ventura’s paper reverses when the same data is reanalyzed with multi-level modeling. He suggests that there is no selection for formal models across all of philosophy of science. But there might be selection for modeling within particular subdisciplines. I find a stronger pattern of selection for formal models at both levels - the whole of philosophy of science and many of the local clusters."
  },
  {
    "objectID": "posts/make-smart-choices/index.html#prior-simulation",
    "href": "posts/make-smart-choices/index.html#prior-simulation",
    "title": "Make Smart Choices. Use Multilevel Models.",
    "section": "Prior simulation",
    "text": "Prior simulation\nWhen working with Bayesian methods, we have to select priors. I used a strategy known as weakly regularizing priors. The goal is to pick out priors that penalize really extreme effects but still allow for enough uncertainty that the data can drive the ultimate analysis. To calibrate the priors before including the data, I sample from my prior distributions and plots 50 lines at a time to give me a sense of what plausible relationships might look like.\nBelow is the priors I ended up using. For the slopes, I assumed a normal distribution centered around 0 with a standard deviation of 0.3. For the intercepts, I assumed a normal distribution centered around 0 with a standard deviation of 3.\n\n# prior simulation\n\nx_prior = np.arange(21)\n\nfor i in range(50):\n    \n    # sample a value from the priors\n\n    a = stats.norm(0,3).rvs()\n    b = stats.norm(0,0.3).rvs()\n    \n    # plug into the inverse logit function\n\n    p = np.exp(a + b*x_prior) / (1 + np.exp(a + b*x_prior))\n\n    plt.plot(p)\n\n\n\n\nLater, we’ll find that my results depart from those found by Ventura in a couple of places. I suspect that is partly due to the influence of the priors. So I want to spend a bit longer justifying mine.\nMany people are skeptical of the use of priors in statistics. Isn’t it cheating to build assumptions into the model, rather than letting the data do the work? The trouble with prior skepticism is that all analyzes use priors, it’s just that Bayesian analysis uses them explicitly. Other modeling techniques will often tacitly assume flat priors on the possible intercept and slopes. Let me show you what the models look like with much flatter priors.\n\n# prior simulation\n\nx_prior = np.arange(21)\n\nfor i in range(50):\n\n    a = stats.norm(0,5).rvs()\n    b = stats.norm(0,2).rvs()\n\n    p = np.exp(a + b*x_prior) / (1 + np.exp(a + b*x_prior))\n\n    plt.plot(p)\n\n\n\n\nHere I expanded the standard of deviation around the intercept and the slope. The irony is that increasing uncertainty at the level of the priors can decrease uncertainty at the level of predictions. Most predicted models here assume really sharp slopes and implausibly faster growth rates for formal methods in philosophy. Most of these lines shoot from the top to the bottom in the span of 5 or 6 years, suggesting that philosophy could have made a complete revolution in methodology. These predictions are implausible. (A careful explanation of why non-informative priors can be problematic can be found in Richard McElreath’s statistical rethinking book https://xcelab.net/rm/statistical-rethinking/). Hence my preference for the weakly informative priors described above."
  },
  {
    "objectID": "posts/make-smart-choices/index.html#sampling",
    "href": "posts/make-smart-choices/index.html#sampling",
    "title": "Make Smart Choices. Use Multilevel Models.",
    "section": "Sampling",
    "text": "Sampling\nThis code fits 16 logistic regressions, one for each topic.\n\nwith pm.Model() as no_pool:\n    \n    # priors\n    \n    a = pm.Normal('a',0,3,shape=16)\n    b = pm.Normal('b',0,0.3,shape=16)\n    \n    # link function\n\n    p = pm.invlogit(a + b*x)\n    \n    # outcome distribution\n\n    y = pm.Binomial('y',p=p,n=n,observed=k)\n    \n    # sampler\n    \n    trace_no_pool = pm.sample(progressbar=False);\n\n\ntrace_no_pool = az.from_json(\"trace_no_pool\")\naz.plot_trace(trace_no_pool);\n\nC:\\Users\\dsaun\\anaconda3\\envs\\pymc_env\\Lib\\site-packages\\arviz\\utils.py:184: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.\n  numba_fn = numba.jit(**self.kwargs)(self.function)\n\n\n\n\n\nThese plots provide a check on whether the computational estimation strategy was effective. There is little to discuss here, except the good sign that nothing blew up during sampling.\n\naz.plot_forest(trace_no_pool,var_names='b',combined=True,hdi_prob=0.95,quartiles=False);\n\n\n\n\nI’m just listing the slope parameters on each regression (ignoring the intercepts for now). Dots represent the highest posterior point estimate for each slope or, simply, the slope of best fit. The lines represent uncertainty - 95% of the posterior distribution fits within the line. This is like a 95% confidence interval in classic statistics.\nThe general pattern is that most topics experience a modest growth trend for formal methods. A few don’t: Confirmation (2), Metaphysics (8), Relativity (10), and Realism (12). (The numbers that index topics in Ventura’s paper and my analysis will be slightly different. I count up from 0 and go through 15. He counts from 1 and goes to 16. I do this just because the Bayesian fitting software uses this convention. So if you are switching between papers, just keep this in mind.)"
  },
  {
    "objectID": "posts/make-smart-choices/index.html#posterior-predictive-checking",
    "href": "posts/make-smart-choices/index.html#posterior-predictive-checking",
    "title": "Make Smart Choices. Use Multilevel Models.",
    "section": "Posterior predictive checking",
    "text": "Posterior predictive checking\nThese numbers are hard to interpret just as numbers. Let’s extract predictions from what the estimated models and compare them against the observed data.\n\npost_pred_no_pool = pm.sample_posterior_predictive(trace_no_pool,model=no_pool,progressbar=False)\n\n\npost_pred_no_pool = az.from_json(\"post_pred_no_pool\")\nf,ax = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,6))\n\nfor k in range(50):\n    predictions = np.array(post_pred_no_pool.posterior_predictive.y[0][k].T)\n    ceiling = n.T\n    \n    for i in range(4):\n        for j in range(4):\n            \n            proportions = predictions[i*4+j] / ceiling[i*4+j]\n            ax[i][j].plot(proportions,'o',alpha=0.2,color='tab:orange')\n            \nfor i in range(4):\n    for j in range(4):\n        ax[i][j].plot(proportion_per_year.iloc[i*4+j].values,'-',markersize=2)\n\n\n\n\nThis plot displays the predicted proportion of formal papers in each year. Uncertainty is represented by the spread of the dots and their opacity.\nOne good check is whether the model captures a few plausible intuitive stories. For example, topic 11 is on row 3, column 4. It experiences huge growth in formal methods. This is decision theory. So it makes good sense that nearly half of papers published in this area are in the business of building models. (The other half are likely papers reflecting on the methodology or concepts of decision theory.)\nOne thing to notice is that uncertainty at the beginning of the period tends to be pretty large. This makes sense because we have very little data for the beginning of each period. Some of these topics only have a handful of papers published in them in the year 2000. So it’s like the sample size is very small for the beginning. But it tends to grow in the later years so the estimate clamps down.\nThe high beginning uncertainty is also a reason why splitting the community into 16 topics might introduce some artifacts into the statistical analysis. In the year 2000, each subtopic was not as clearly distinguished as they were in, say, 2010. The common narrative is that philosophy of science sorta splintered into many sub-specialties at the end of the very end of the 20th century and then these subdisciplines consolidated during the early 2000s. So we should also expect more overlap in causal selective forces for the early years, something not reflected in this analysis. Instead, we get big initial uncertainty because each subtopic is not well-established yet.\n\na_means = [np.array(trace_no_pool.posterior['a'][0][:,i].mean()) for i in range(16)]\nb_means = [np.array(trace_no_pool.posterior['b'][0][:,i].mean()) for i in range(16)]\na_means = np.array(a_means)\nb_means = np.array(b_means)\n\nf,ax = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,6))\n\nx_pred = np.arange(21)\n\nfor i in range(4):\n    for j in range(4):\n        a = a_means[i*4+j] \n        b = b_means[i*4+j]\n        p = np.exp(a + b*x_pred) / (1 + np.exp(a + b*x_pred))\n        ax[i][j].plot(p)\n        ax[i][j].plot(proportion_per_year.iloc[i*4+j].values,'o',markersize=3)\n\n\n\n\nA second way of visualizing the model is to strip away the uncertainty and plot the best performing logistic curves for each community. These are what the model thinks the likely trend is in each place."
  },
  {
    "objectID": "posts/make-smart-choices/index.html#prior-simulation-1",
    "href": "posts/make-smart-choices/index.html#prior-simulation-1",
    "title": "Make Smart Choices. Use Multilevel Models.",
    "section": "Prior simulation",
    "text": "Prior simulation\nAgain I conduct a prior simulation. This time is tricker though because I have priors representing the overall population effects and the those priors shape the subsequent topic effects. So we want to get a prior simulation that allows for a wide range of possible population-topic relationships plus the old goal of giving each topic a range of plausible effects.\n\n# prior simulation\n\nf,ax = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,6))\n\nhamu = stats.norm(0,2).rvs(16)\nhasig = stats.expon(scale=2).rvs(16)\nhbmu = stats.norm(0,0.2).rvs(16)\nhbsig = stats.expon(scale=0.7).rvs(16)\n\nx_prior = np.arange(21)\n\nfor i in range(4): \n    for j in range(4):\n        \n        amu = hamu[i*4+j]\n        asig = hasig[i*4+j]\n        bmu = hbmu[i*4+j]\n        bsig = hbsig[i*4+j]\n        \n        for l in range(30):\n\n            a = stats.norm(amu,asig).rvs()\n            b = stats.norm(bmu,bsig).rvs()\n\n            p = np.exp(a + b*x_prior) / (1 + np.exp(a + b*x_prior))\n\n            ax[i][j].plot(p)"
  },
  {
    "objectID": "posts/make-smart-choices/index.html#sampling-1",
    "href": "posts/make-smart-choices/index.html#sampling-1",
    "title": "Make Smart Choices. Use Multilevel Models.",
    "section": "Sampling",
    "text": "Sampling\n\n#n.T\n\nwith pm.Model() as partial_pool:\n    \n    # hyperparameters\n    \n    hamu = pm.Normal('hamu',0,2)\n    hasig = pm.Exponential('hasig',2)\n    hbmu = pm.Normal('hbmu',0,0.2)\n    hbsig = pm.Exponential('hbsig',0.7)\n    \n    # regular parameters\n\n    a = pm.Normal('a',hamu,hasig,shape=16)\n    b = pm.Normal('b',hbmu,hbsig,shape=16)\n    \n    # link function\n\n    p = pm.invlogit(a + b*x)\n    \n    # outcome distribution\n\n    y = pm.Binomial('y',p=p,n=n,observed=k)\n    \n    # sampler\n    \n    trace_partial_pool = pm.sample(progressbar=False);\n\nLet’s compare the slopes estimated with multi-level models with those estimated by 16 independent models.\n\ntrace_partial_pool = az.from_json(\"trace_partial_pool\")\naz.plot_forest([trace_partial_pool,trace_no_pool],var_names='b',combined=True,hdi_prob=0.95,quartiles=False);\n\n\n\n\nOne thing to notice is that the level of uncertainty has shrunk across the board. In basically every place, the blue line (multi-level / partial pooling) is shorter than the orange line (no pooling). This is because multi-level models can use more information to inform the estimate in each topic. Information is pooled across topics to form better expectations about each individual topic. It’s like we’ve increased our sample size but we never had to collect new data. We just had to use it more efficiently.\nA second thing to notice is that that several topics either switched from a negative to positive slope or they moved closer to positivity. This suggests that the multi-level model is detecting even stronger selection for formal methods than the previous analysis is. Why would this be? I suspect it largely depends on the interaction effect between intercepts and slopes. When the intercept starts really low, we are more likely to estimate a positive slope. When the intercept starts higher up, we are more likely to estimate a neutral or negative slope. The multi-level model now starts most intercepts fairly low. Here’s a comparison of the two estimates by intercept. The blue lines tend to be more starkly negative for topics 2, 8 and 12, allowing their slopes to head toward positivity.\nI plot the intercept comparisons below.\n\naz.plot_forest([trace_partial_pool,trace_no_pool],var_names='a',combined=True,hdi_prob=0.95,quartiles=False);"
  },
  {
    "objectID": "posts/make-smart-choices/index.html#posterior-predictive-checking-1",
    "href": "posts/make-smart-choices/index.html#posterior-predictive-checking-1",
    "title": "Make Smart Choices. Use Multilevel Models.",
    "section": "Posterior predictive checking",
    "text": "Posterior predictive checking\nLet’s look at the prediction plots now and see what changed once we introduced partial pooling.\n\npost_pred_partial_pool = pm.sample_posterior_predictive(trace_partial_pool,model=partial_pool,progressbar=False)\n\n\npost_pred_partial_pool = az.from_json(\"post_pred_partial_pool\")\nf,ax = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,6))\n\nfor k in range(50):\n    predictions = np.array(post_pred_partial_pool.posterior_predictive.y[0][k].T)\n    ceiling = np.transpose(n)\n    \n    for i in range(4):\n        for j in range(4):\n            \n            proportions = predictions[i*4+j] / ceiling[i*4+j]\n            ax[i][j].plot(proportions,'o',alpha=0.2,color='tab:orange')\n            \nfor i in range(4):\n    for j in range(4):\n        ax[i][j].plot(proportion_per_year.iloc[i*4+j].values,'-',markersize=2)\n\n\n\n\nOne thing to notice that there is less initial uncertainty in many of these plots. This highlights on the advantages of multi-level models - less uncertainty in the intercepts means more plausible estimates of the slope.\n\na_means_pp = [np.array(trace_partial_pool.posterior['a'][0][:,i].mean()) for i in range(16)]\nb_means_pp = [np.array(trace_partial_pool.posterior['b'][0][:,i].mean()) for i in range(16)]\na_means_pp = np.array(a_means_pp)\nb_means_pp = np.array(b_means_pp)\n\nf,ax = plt.subplots(4,4,sharex=True,sharey=True,figsize=(8,6))\n\nx_pred = np.arange(21)\n\nfor i in range(4):\n    for j in range(4):\n        a = a_means_pp[i*4+j] \n        b = b_means_pp[i*4+j]\n        p = np.exp(a + b*x_pred) / (1 + np.exp(a + b*x_pred))\n        ax[i][j].plot(p)\n        ax[i][j].plot(proportion_per_year.iloc[i*4+j].values,'o',markersize=3)\n\n\n\n\nThere are small divergences in the trend lines. I’ll zoom in one difference in the last section.\nFinally, we have the trend line for the entire population of studies. It is subtly but confidently positive, suggesting an overall tendency to select for formal methods in philosophy of science over the last several years. The observed trends are plotted and faded behind it.\n\na = np.array(trace_partial_pool.posterior['hamu'].mean((\"chain\",\"draw\")))\nb = np.array(trace_partial_pool.posterior['hbmu'].mean((\"chain\",\"draw\")))\n\np = np.exp(a + b*x_pred) / (1 + np.exp(a + b*x_pred))\n\nfor i in range(4):\n    for j in range(4):\n        plt.plot(proportion_per_year.iloc[i*4+j].values,'-',alpha=0.4)\n\nplt.plot(p,color=\"black\")\nplt.ylim([0,1]);\n\n\n\n\nVentura’s analysis noted that there was no overall selection for formal methods across philosophy of science. Interestingly, my analysis finds a small and very confident positive slope for the population-level effect. The population-level slope is 0.018 with the bottom of the credibility interval at 0.002 and the top at 0.034. So under classical statistics, this would be like a statistically significant effect.\nShould we prefer the multi-level trend estimate? I suspect so but the reason lies in how multi-level models handle each individual cluster. So I’ll turn to that next."
  },
  {
    "objectID": "posts/break-your-toys-and-put-them-back-together/index.html",
    "href": "posts/break-your-toys-and-put-them-back-together/index.html",
    "title": "Break your toys and glue them back together",
    "section": "",
    "text": "To me, the most interesting question in the social sciences is what statistical practice will look like in 20 years. Criticisms of null hypothesis significance testing has become mainstream. Even if p-values still dominant the publication landscape, many of other approaches to analysis are making it into major journals. At the same time, the software that lets people build more customizable, interesting models has been getting faster and more accessible. Sometimes it feels like our biggest challenge is that we don’t quite know what to do with all this power. As a community we need more examples of what another statistical practice might look like.\nI previously wrote about a case where I took a toy mechanistic model and fit it to the small dataset with 10 islands in Oceania. What’s cool about this sort of analysis is that you don’t need to substitute out your mechanistic model for a linear regression when performing data analysis. That sort of substitution can introduce all sorts of complications in reasoning from one model to another. The approach where you fit a mechanistic model is much clearer and, when the model inevitably fails, the failures take on a importance within the framework of the theory.\nThis post is a sequel. I’m going to continue working with that case study and show a small bit about how to iteratively tailor a toy mechanistic model to the messiness of the world."
  },
  {
    "objectID": "posts/break-your-toys-and-put-them-back-together/index.html#fitting",
    "href": "posts/break-your-toys-and-put-them-back-together/index.html#fitting",
    "title": "Break your toys and glue them back together",
    "section": "Fitting",
    "text": "Fitting\nWe’ll import the same data.\n\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport pymc as pm\nimport numpy as np\nimport arviz as az\nimport networkx as nx\nimport pytensor.tensor as pt\nfrom scipy import stats\n\nseed = 1235\nrng = np.random.default_rng(seed)\n\ntry:\n    dk = pd.read_csv(\"Kline\",sep=\";\")\nexcept FileNotFoundError:\n    dk = pd.read_csv(\"https://raw.githubusercontent.com/pymc-devs/pymc-resources/main/Rethinking_2/Data/Kline\",sep=\";\")\n\ntimes = [3000,3000,3000,3500,3000,3350,600,3350,2845,800]\ndk['initial_settlement (BP)'] = times\ndk\n\n\n\n\n\n\n\n\nculture\npopulation\ncontact\ntotal_tools\nmean_TU\ninitial_settlement (BP)\n\n\n\n\n0\nMalekula\n1100\nlow\n13\n3.2\n3000\n\n\n1\nTikopia\n1500\nlow\n22\n4.7\n3000\n\n\n2\nSanta Cruz\n3600\nlow\n24\n4.0\n3000\n\n\n3\nYap\n4791\nhigh\n43\n5.0\n3500\n\n\n4\nLau Fiji\n7400\nhigh\n33\n5.0\n3000\n\n\n5\nTrobriand\n8000\nhigh\n19\n4.0\n3350\n\n\n6\nChuuk\n9200\nhigh\n40\n3.8\n600\n\n\n7\nManus\n13000\nlow\n28\n6.6\n3350\n\n\n8\nTonga\n17500\nhigh\n55\n5.4\n2845\n\n\n9\nHawaii\n275000\nlow\n71\n6.6\n800\n\n\n\n\n\n\n\nRename our variables to T for time, N for population size, and tools for the total tools present in a society.\n\nT = dk['initial_settlement (BP)'].values\nN = dk.population.values\ntools = dk.total_tools.values\nlabel_id = dk.culture.values\n\nThe model can be expressed compactly with:\n\nwith pm.Model() as m0:\n    \n    a = pm.Gamma('a',mu=0.05,sigma=0.1)\n    b = pm.Gamma('b',mu=0.005,sigma=0.01)\n    \n    mu = T * (-a + b*(np.euler_gamma + np.log(N)))\n\n    Y = pm.Poisson(\"Y\", mu=mu,observed=tools)\n\nThe priors need some justication. First, the Gamma distributions have really small means. That is because the population and time values are quite large. If you push their values up a bit, you’d have explosive growth in technology. Second, I tuned the priors by looking at prior predictive simulations. I want the bulk of the prior predictions to be below 1000. Some of our islands only have about 1000 people. It is certainly possible that everyone on an island is an inventor but its unlikely. I simply took draws from the prior and kept adjusting the parameters until the growth rates were fairly stable and represented wide uncertainty.\n\npp = pm.draw(mu,draws=100)\npp\n\nshort_label = [i[:3] for i in label_id]\n\nfor val in pp[:,:]:\n    plt.plot(short_label,val,'o',alpha=0.3)\n\n\n\n\n\nwith m0:\n    trace0 = pm.sample(2000,target_accept=0.99,progressbar=False,idata_kwargs={\"log_likelihood\":True},random_seed=rng)\n\ntrace0.to_json(\"trace0\")\n\nWe face a bit of challenge in fitting the model - the parameters \\(a\\) and \\(b\\) are collinear. Each represents a certain kind of difficulty and they can counterbalance each other. If \\(a\\) is high, the skill is uniformly difficult to learn. But if \\(b\\) is also high, it makes up for that because it provides more opportunities for novel innovation. So there are several combinations of \\(a\\) and \\(b\\) parameters that provide the same accuracy in predicting the oceania dataset. The figure below illustrates how the MCMC algorithm explores the parameter space. It travels back and forth on this narrow band of roughly equivalent posterior probability.\n\ntrace0 = az.from_json(\"trace0\")\naz.plot_pair(trace0);\n\n\n\n\nCollinearity is a medium-sized problem. It will inflate the uncertainty on our parameters. If we really cared to pin down the value of \\(a\\), we’d need extra information to first pin down the value of \\(b\\), or visa-versa. Or, we should scrape this model and build a new one where there is only one difficulty parameter. In our case, I don’t particularly care what value each parameter takes on. I’m more interested in whether our model can explain the available data.\nBelow I plot the predictions the model makes about island societies, given the estimated parameter values. I’ve plotted 50 random predictions to help represent the uncertainty propogating through the model. The model performs pretty well on 6 of the island societies, suggesting it could be an candidate explanation for dynamics of technical development in those societies. However, in Hawaii, Manus, Trobriand and Chuuk, the model doesn’t have a good way to predict them. Hawaii and Chuuk both have more tools that they are supposed, given their development time and population size. Meanwhile, Manus and Trobriand have too few tools.\nAt this point, it doesn’t look like the demographic theory of technology is very plausible. The worry persists even if you dealt with the collinearity problem. If you pushed the growth rate up a little bit, you’d over shoot Trobriand and Manus even more. If you pushed the growth rate down a little, the problem of Hawaii and Chuuk gets worse.\n\npost_pred_m0 = pm.sample_posterior_predictive(model=m0,trace=trace0,progressbar=False)\npredictions = post_pred_m0.posterior_predictive['Y'].values.reshape((8000,10))\n\nfig, ax = plt.subplots(figsize=(9,5))\n\nfor index in range(len(label_id)):\n    ax.text(np.log(N)[index], tools[index], label_id[index], size=12)\n\nfor k in range(50):\n    i = np.random.randint(low=0,high=4000)\n    plt.plot(np.log(N),predictions[i],'o',alpha=0.2,color=\"tab:blue\",markersize=5)\n    \nplt.plot(np.log(N),tools,'o',color=\"black\")\n    \nplt.xlabel(\"Log population size\");\n\nSampling: [Y]\n\n\n\n\n\nWhen it comes to fitting toy mechanistic models we have to be careful. They break more easily than linear regression. Frankly, I think that’s a good thing. If your model fits every dataset, you cannot learn from it. However! If a toy model breaks for opaque reasons, then it is also hard to learn from. In our case, it’s not initially easy to tell whether the demographic theory of technology is bunk or whether there are some simple features of the Oceanic societies that we have to add before the model works well.\nThe world is a complicated place. Many of those complications are case-specific - there are unique features of the history and geography of oceania that influence the distribution of technology. Toy models are usually developed to capture some core general features that should recur across a range of cases. Out-of-the-box they usually don’t fit any cases well.\nThe way I envision a productive scientific workflow is this: first we fit the toy model. Then we slowly layer on extra features to tailor it to some specific domain. If the toy model actually does capture some core mechanism in the world, it should start to fit nicely after a bit of tailoring. If the fit of the model doesn’t improve too much once we add the extra features, we should think there is something wrong with the core theory."
  },
  {
    "objectID": "posts/break-your-toys-and-put-them-back-together/index.html#a-surprisingly-compact-expression",
    "href": "posts/break-your-toys-and-put-them-back-together/index.html#a-surprisingly-compact-expression",
    "title": "Break your toys and glue them back together",
    "section": "A surprisingly compact expression",
    "text": "A surprisingly compact expression\nFirst, we’ll expand the second term. That means multiply the original differential equation bit by \\(T_{f} - T_{s}\\)\n\\[ (-a + b*(\\gamma + \\ln(n_{s})))*T_{s} + (-a + b*(\\gamma + \\ln(n_{f})))*T_{f} - (-a + b*(\\gamma + \\ln(n_{f})))*T_{s} \\]\nNotice that the first and the last terms of the equations are almost the same - they have the core difference equation multiplied by \\(T_{s}\\). The only difference is that one uses the population of the settled society, \\(\\ln(n_{s})\\), and the other uses the population of the founding society, \\(\\ln(n_{f})\\). It will turn out we can exploit this fact to perform a lot of simplification. What is left will show that we only need to care about the ratio of population sizes to incorporate founder effects. It will also mean that we can leave the middle term completely untouched.\nLet’s distribute the time terms on the left and right sides of this expression.\n\\[ (-aT_{s}  + bT_{s}(\\gamma + \\ln(n_{s})))+ (-a + b(\\gamma + \\ln(n_{f})))T_{f} - (-aT_{s} + bT_{s}(\\gamma + \\ln(n_{f}))) \\]\nMultiply the third term by negative 1 to take care of the minus sign in front.\n\\[ -aT_{s}  + bT_{s}(\\gamma + \\ln(n_{s})) + (-a + b(\\gamma + \\ln(n_{f})))T_{f} + aT_{s} - bT_{s}(\\gamma + \\ln(n_{f})) \\]\nThe \\(-aT_{s}\\) term in the front cancels out the \\(aT_{s}\\) term toward the end.\n\\[  bT_{s}(\\gamma + \\ln(n_{s})) + (-a + b(\\gamma + \\ln(n_{f})))T_{f} - bT_{s}(\\gamma + \\ln(n_{f}))\\]\nDistribute the \\(bT_{s}\\) terms at the front and back.\n\\[  bT_{s}\\gamma + bT_{s}\\ln(n_{s}) + (-a + b*(\\gamma + \\ln(n_{f})))T_{f} - bT_{s}\\gamma - bT_{s}\\ln(n_{f})\\]\nThe \\(bT_{s}\\gamma\\)’s cancel.\n\\[  bT_{s}\\ln(n_{s}) + (-a + b*(\\gamma + \\ln(n_{f})))T_{f} - bT_{s}\\ln(n_{f})\\]\nRearrange the expression to move the \\(bT_{s}\\ln(n_{f})\\) term to the front.\n\\[  bT_{s}\\ln(n_{s}) - bT_{s}\\ln(n_{f}) + (-a + b(\\gamma + \\ln(n_{f})))T_{f} \\]\nFactor out \\(bT_{s}\\) from the front.\n\\[  bT_{s}(\\ln(n_{s}) - \\ln(n_{f})) + (-a + b(\\gamma + \\ln(n_{f})))T_{f} \\]\nLastly, use the fact that the difference between logarithms is just the logarithm of the quotient.\n\\[  bT_{s}(\\ln(n_{s} / n_{f})) + (-a + b(\\gamma + \\ln(n_{f})))T_{f} \\]\nWe’re done. We’ve arrived at what I call the founder model. The second half of the expression is the familiar expression for the founder society - it tells us what happens if we just let the founder society do all the innovating. The first half of the expression adjusts the original model based on how much time the settled society has been around and a ratio of their population sizes. When the settled society is bigger, the adjustment factor is positive. When the settled society is smaller, the adjustment factor is negative."
  },
  {
    "objectID": "posts/break-your-toys-and-put-them-back-together/index.html#fitting-the-founder-model",
    "href": "posts/break-your-toys-and-put-them-back-together/index.html#fitting-the-founder-model",
    "title": "Break your toys and glue them back together",
    "section": "Fitting the founder model",
    "text": "Fitting the founder model\nAppreciably, this model assumes that each island immediately takes on its mature population size upon being settled. That’s silly. Population size is also a dynamic process. So some fairly important features of the world are left out. If you feel like that’s too big of an omission, I can understand. At the very least it makes salient exactly why this sort of analysis cannot succeed without substantially more data and thinking. I care a lot more about understanding the techniques and capacities of mathematical modeling than actually answering the question of whether the demographic theory is any good.\nTo program it up, it’s a bit tedious. We’ll need a unique equation for each island and lot of indexing. Regardless, we get the luxury of keeping just two free parameters.\n\nwith pm.Model() as m1:\n    # Specify prior distributions\n    a = pm.Gamma('a',mu=0.05,sigma=0.1)\n    b = pm.Gamma('b',mu=0.005,sigma=0.01)\n\n    y_yap = T[3] * (-a + b*(np.euler_gamma + np.log(N[3])))\n    y_manus = T[7] * (-a + b*(np.euler_gamma + np.log(N[7])))\n    y_trobriand = T[5] * (-a + b*(np.euler_gamma + np.log(N[5])))\n\n    # descendents of trobriand\n\n    y_santa_cruz = b*T[2]*pt.log(N[2] / N[5]) + y_trobriand\n    y_tikopia = b*T[1]*pt.log(N[1] / N[5]) + y_trobriand\n    y_malekula = b*T[0]*pt.log(N[0] / N[5]) + y_trobriand\n\n    # descendents of malekula\n\n    y_fiji = b*T[4]*pt.log(N[4] / N[0]) + y_malekula\n\n    # descendents of fiji\n\n    y_chuuk = b * T[6] * pt.log(N[6] / N[4]) + y_fiji\n    y_tonga = b * T[8] * pt.log(N[8] / N[4]) + y_fiji\n\n    # descendents of tonga\n\n    y_hawaii = b * T[9] * pt.log(N[9] / N[8]) + y_tonga\n\n    mu = pt.as_tensor([y_malekula,y_tikopia,y_santa_cruz,y_yap,y_fiji,y_trobriand,y_chuuk,y_manus,y_tonga,y_hawaii])\n\n    Y = pm.Poisson(\"Y\", mu=mu,observed=tools)\n\nDespite the strange functional specification, the model fits as easily as the first.\n\nwith m1:\n    trace1 = pm.sample(2000,target_accept=0.99,progressbar=False,idata_kwargs={\"log_likelihood\":True},random_seed=rng)\ntrace1.to_json('trace1')\n\nWe can now visualize the predictions. The model does a lot better explaining those islands that previously had too many tools. Chuuk is now squarely within the range of model uncertainties. Hawaii is closer. Manus and Trobriand are still a bit stubborn. This makes good sense - the structural changes we made meant that islands will, in general, have more tools than before. So if Trobriand and Manus previously had surprisingly few tools, we don’t have many new modeling tricks to explain that.\n\ntrace1 = az.from_json(\"trace1\")\npost_pred_m1 = pm.sample_posterior_predictive(model=m1,trace=trace1,progressbar=False)\npredictions = post_pred_m1.posterior_predictive['Y'].values.reshape((8000,10))\n\nfig, ax = plt.subplots(figsize=(8,5))\n\nfor index in range(len(label_id)):\n    ax.text(np.log(N)[index], tools[index], label_id[index], size=12)\n\nfor k in range(50):\n    i = np.random.randint(low=0,high=4000)\n    plt.plot(np.log(N),predictions[i],'o',alpha=0.2,color=\"tab:blue\",markersize=5)\n    \nplt.plot(np.log(N),tools,'o',color=\"black\")\n    \nplt.xlabel(\"Log population size\");\n\nSampling: [Y]\n\n\n\n\n\nOur visual intuition that the founder model is doing better is validated by model comparison statistics.\n\naz.compare({\"Standard model\":trace0,\n            \"Founder model\":trace1})\n\n\n\n\n\n\n\n\nrank\nelpd_loo\np_loo\nelpd_diff\nweight\nse\ndse\nwarning\nscale\n\n\n\n\nFounder model\n0\n-45.805932\n5.748673\n0.000000\n0.96632\n6.236006\n0.000000\nTrue\nlog\n\n\nStandard model\n1\n-115.594256\n20.337151\n69.788323\n0.03368\n34.748214\n32.368356\nTrue\nlog\n\n\n\n\n\n\n\nWhat I really like about this strategy of modeling is that we’ve improve the fit just by incorporating domain knowledge. Often, in statistical modeling, the techniques to improve fit involve making the model more flexible. With more adjustable parameters, a model will always fit better. But the improvement in fit doesn’t mean we are discovering the true mechanical structure behind the data. It often only means our model is more flexible. But here we didn’t increase the parametric flexibility. All we did was thought about the problem and did algebra. I think that’s an underappreciated but highly powerful technique.\nI hope my position has become clear - toy models are easy to break but we should break them. Only by breaking them do we get our most informative analyses. When we slowly try to put them back together, we can often come up with clever structural adjustments to the model that improve fit without introducing inappropriate sorts of flexibility."
  },
  {
    "objectID": "about.html",
    "href": "about.html",
    "title": "About",
    "section": "",
    "text": "I write about philosophy of science, cultural evolution and Bayesian statistics."
  },
  {
    "objectID": "index.html",
    "href": "index.html",
    "title": "imagination machine",
    "section": "",
    "text": "Break your toys and glue them back together\n\n\n\n\n\n\n\n\n\n\n\n\nSep 13, 2023\n\n\nDaniel Saunders\n\n\n\n\n\n\n  \n\n\n\n\nIf none of the above, then what?\n\n\n\n\n\n\n\n\n\n\n\n\nJul 19, 2023\n\n\nDaniel Saunders\n\n\n\n\n\n\n  \n\n\n\n\nMake Smart Choices. Use Multilevel Models.\n\n\n\n\n\n\n\n\n\n\n\n\nDec 22, 2022\n\n\nDaniel Saunders\n\n\n\n\n\n\nNo matching items"
  },
  {
    "objectID": "posts/if-none-of-the-above-then-what/index.html",
    "href": "posts/if-none-of-the-above-then-what/index.html",
    "title": "If none of the above, then what?",
    "section": "",
    "text": "A couple days ago, Richard McElreath had a lovely blog post on why the Bayes factor isn’t the cure to what ails us. No p-values or No confidence intervals, either. No metric of any kind can replace the need for clear model-first thinking as the justification for scientific claims. A common reaction to Richard’s post is to ask the obvious question, “if none of the above, then what?” The question underscores just how difficult it is to imagine another kind of science.\nA casual aquaintance with the p-value wars of the last decade leaves one with the impression that we are searching for some metric which we can stamp on a scientific paper to let everyone know we did a good job. If not the p-value, then surely something else: a cross-validation score, a low AIC score, a big effect size, a big likelihood ratio. Maybe a good paper is the one that has (N = 1,000,000,000) in the abstract! There is a pretense to these conversations: if we just tweak our metric correctly, we can also tweak the publication pipeline and stabilize science. Bayesians in Richard’s camp are asking a provocative question - what if the pretense is wrong? How could we justify a scientific claim in the absence of any of the metrics?\nIt turns out the answer is not mysticism. A humble cycle of model-building and fitting is usually all we need. The metrics listed above have their proper place. It’s just that they are rarely what is called for. A model-first approach makes it obvious when they are useful and how to deploy them. The philosophical arguments for this approach have been spelled out before (Gelman and Shalizi 2013). But it is still quite hard to imagine what it looks like from these abstract methodological descriptions. The rest of this post provides a compact and easy to follow example of the full model-first workflow. The promise is that, by the end, we won’t have needed a metric to assess our model and you won’t miss them. We’ll learn something important without ever reaching for them.\nWe are going to study a model of technological innovation. The central claim is that big populations innovate more than small populations. We’ll build up the model from basic intuitions and evaluate it against a historical dataset of island societies in Oceania. It will feel a bit too easy and that’s the point. Once we carefully model our problem, there is no need to argue about p-values, bayes factors, confidence intervals or the like."
  },
  {
    "objectID": "posts/if-none-of-the-above-then-what/index.html#footnotes",
    "href": "posts/if-none-of-the-above-then-what/index.html#footnotes",
    "title": "If none of the above, then what?",
    "section": "Footnotes",
    "text": "Footnotes\n\n\nFor many technologies, there will not be some continuous improvement that can be equated directly with a skill level. Instead, there is some further function that maps skillfulness onto discrete stages of improvement in a gadget or even the variety of gadgets a person can produce. Henrich sets this complexity aside to illustrate a general relationship that should hold approximately in the complex case and we’ll do the same I think this is a pretty big simplification. Simplifications like this serve as an opportunity for future work, fleshing out the theory and seeing what is plausible in light of empirical evidence. To be fair, people have challenged Henrich on this one (Vaesen et al. 2016).↩︎\nIf you are wondering whether we can reliably measure the complexity of a tool based on ethnographic reports written during the period of European colonization, you are not alone (Jerven 2011). But the part of this case study that is interesting for our purposes does not concern data quality. I will assume these measures are appropriate for the sake of the larger argument.↩︎\nWe don’t have to use a differential equation solver. We could just integrate the differential equation to get an analytical expression for the number of tools as function of \\(a,b,n,t\\). It would be faster and more accurate too. But I had already worked out the differential equation approach by the time I realized it was an easy integration problem so here’s what we get. Still nice to know the fitting differential equations ain’t too hard.↩︎"
  }
]