Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lesson10 - Difference between the OLS results and sns.lmplot graph #745

Open
wz2000xx opened this issue Nov 20, 2023 · 7 comments
Open

Lesson10 - Difference between the OLS results and sns.lmplot graph #745

wz2000xx opened this issue Nov 20, 2023 · 7 comments

Comments

@wz2000xx
Copy link

Hi,

In lesson 10, we fitted the relationship between child mortality and children per woman and income group, and get the following results.

Screenshot 2023-11-19 at 19 29 14

In my understanding, for example, for the Lower-middle income group, the intercept should be -22.8577 + 15.7559 = -7.1018, however, the graph created using sns.lmplot shows an intercept of around 6 for that income group as shown below.

Screenshot 2023-11-19 at 19 32 06

I don't understand why this is the case. This problem also exists for other income groups and categorical variables, such as species in the iris dataset.

Thank you very much!

@RayNele
Copy link

RayNele commented Nov 22, 2023

pretty sure the coefficient column in the table is the slope, not the intercept.

it's only the intercept for the row labelled intercept.

source

@wz2000xx
Copy link
Author

pretty sure the coefficient column in the table is the slope, not the intercept.

it's only the intercept for the row labelled intercept.

source

Yes, but if I understand this correctly, the calculation for the child mortality would be:

child mortality = 16.3017*(children per woman) + 22.2875*(income group - low) + 15.7559*(income group - lower middle) - 5.337*(income group - upper middle) - 22.8577

and if we are calculating for the Lower-middle income group, for example, then the calculation above becomes:

child mortality = 16.3017*(children per woman) + 22.2875*(0) + 15.7559*(1) - 5.337*(0) - 22.8577
child mortality = 16.3017*(children per woman) + 15.7559 - 22.8577
child mortality = 16.3017*(children per woman) - 7.1018

and the intercept becomes -7.1018, which is not the case in the plot. Also since the slope would all become 16.3017 in the front, why the lines in the plot do not have the same slope?

@RayNele
Copy link

RayNele commented Nov 23, 2023

Before I answer, I'm going to rebuttal with the fact that + 22.2875*(income group - low) doesn't make any sense because income group is a categorical variable and low is one of its categories, so 'income group - low' is not a continous variable with which you can assign numbers, such as 0. i.e. what does income group - low = 0 mean to you?

I'm also going to disclaimer with the fact that stats is way out of field for me, but this is my understanding:

child_mortality = (lower-middle coef)(children_per_woman) - (intercept)?(children_per_woman_coef)
child_mortality(lower-middle) = 15.75(children_per_woman) - ?

I put a question mark there because I'm going to admit that I don't know exactly how this intercept value is calculated. So to show you that at least the claim about the slop I made is true, I plotted two lines on the lmplot facegrit using the map_dataframe function

fig.map_dataframe(lambda data, **kws: plt.axline((0, 0), slope=15.75,  color='green'))
fig.map_dataframe(lambda data, **kws: plt.axline((0, 0), slope=22.28, color='blue'))

which you will be able to see below:
image

you can see that the lines I plotted, blue for low, and green for lower-middle, are both parallel to their respective lines.

again, I'm going to admit I am unsure, and this is what I came up with from my existing understanding, I'm happy to hear any corrections though. As for the intercept, I found evidence that there may be some estimations going on as to what their intercept value should be based on those two numbers, but I'm not sure enough to make claims about that

@wz2000xx
Copy link
Author

Hi, thank you for your explanation.

However, I think I have some other opinions. I understand that "income group - low" is a categorical variable and is not continuous, I am just saying that if the sample comes from the "low" group, then the "income group - low" variable would be 1, and otherwise 0. So the calculation for the specific term "22.2875*(income group - low)" would be 22.2875 for samples in the "low" group, or it will just be zero.

I said that the values are calculated as

child mortality = 16.3017*(children per woman) + 22.2875*(income group - low) + 15.7559*(income group - lower middle) - 5.337*(income group - upper middle) - 22.8577

with rather high certainty because I tried to re-calculate the fitted values in the above way and got the identical results successfully. I did that by first fitting the model and retrieve the parameters:

resuls = smf.ols('child_mortality ~ children_per_woman + income_group', world_data_2014).fit()
parameters = resuls.params

and subset the dataset to only include the data from lower-middle group, and calculated using the equation I assumed above and compare it to the fitted values of the ols results:

W2014_subset = world_data_2014[world_data_2014.income_group == "Lower middle"]
LW_child_mor = W2014_subset["children_per_woman"]*parameters[4]+parameters[2]*1+parameters[0]
LW_fitted = resuls.fittedvalues[world_data_2014.income_group == "Lower middle"]

compare_table = pd.DataFrame({'My_cal': LW_child_mor, 'Fitted Values': LW_fitted}
compare_table.head(10)

and got the following results:

image

Same for the low group:

W2014_subset2 = world_data_2014[world_data_2014.income_group == "Low"]
LW_child_mor2 = W2014_subset2["children_per_woman"]*parameters[4]+parameters[1]*1+parameters[0]
LW_fitted2 = resuls.fittedvalues[world_data_2014.income_group == "Low"]
compare_table = pd.DataFrame({'My_cal': LW_child_mor2, 'Fitted Values': LW_fitted2})
compare_table.head(10)

and got the following results:

image

Therefore, I think my assumed calculation equation is correct. And if that's the case, I don't understand why the graph plotted using sns.Implot doesn't match up with the slope and intercept in the ols results. Maybe it's because they are using different underlying fitting mechanisms?

Lastly, I used the markup function in Macbooks and drew lines to cover the original blue and green lines on the graph plotted using sns.Implot, and move the line to match the lines you made, I don't think the are really parallel, as shown below:

original plot:
image

green line:
image

blue line:
image

@RayNele
Copy link

RayNele commented Nov 23, 2023 via email

@RayNele
Copy link

RayNele commented Nov 23, 2023

Ya unfortunately there's no way to access the underlying stats.

However, your suggested formula still doesn't really make sense to me since as you mentioned, it would result in all four lines having the same slope. I'm not saying you're wrong, since you've checked your work and it seems correct, but I just can't understand how the mortality-child# relationship could be the same across the four conditions.

@wz2000xx
Copy link
Author

Hi, sorry I forgot to reply, I was reading it on my phone on my way to a lecture....Sorry!

Yes, I understand what you said. I will try to figure it out and do more research. Thank you anyway and have a great night!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants