-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lesson10 - Difference between the OLS results and sns.lmplot graph #745
Comments
pretty sure the coefficient column in the table is the slope, not the intercept. it's only the intercept for the row labelled intercept. |
Yes, but if I understand this correctly, the calculation for the child mortality would be: child mortality = 16.3017*(children per woman) + 22.2875*(income group - low) + 15.7559*(income group - lower middle) - 5.337*(income group - upper middle) - 22.8577 and if we are calculating for the Lower-middle income group, for example, then the calculation above becomes: child mortality = 16.3017*(children per woman) + 22.2875*(0) + 15.7559*(1) - 5.337*(0) - 22.8577 and the intercept becomes -7.1018, which is not the case in the plot. Also since the slope would all become 16.3017 in the front, why the lines in the plot do not have the same slope? |
Hi, thank you for your explanation. However, I think I have some other opinions. I understand that "income group - low" is a categorical variable and is not continuous, I am just saying that if the sample comes from the "low" group, then the "income group - low" variable would be 1, and otherwise 0. So the calculation for the specific term "22.2875*(income group - low)" would be 22.2875 for samples in the "low" group, or it will just be zero. I said that the values are calculated as child mortality = 16.3017*(children per woman) + 22.2875*(income group - low) + 15.7559*(income group - lower middle) - 5.337*(income group - upper middle) - 22.8577 with rather high certainty because I tried to re-calculate the fitted values in the above way and got the identical results successfully. I did that by first fitting the model and retrieve the parameters:
and subset the dataset to only include the data from lower-middle group, and calculated using the equation I assumed above and compare it to the fitted values of the ols results:
and got the following results: Same for the low group:
and got the following results: Therefore, I think my assumed calculation equation is correct. And if that's the case, I don't understand why the graph plotted using sns.Implot doesn't match up with the slope and intercept in the ols results. Maybe it's because they are using different underlying fitting mechanisms? Lastly, I used the markup function in Macbooks and drew lines to cover the original blue and green lines on the graph plotted using sns.Implot, and move the line to match the lines you made, I don't think the are really parallel, as shown below: |
haha yeah you got me there. I'm not sure why lmplot deviates from your
calculations. Maybe the OLS fit is different from the one used in lmplot.
tried to find documentation on this but came up blank. great question!
maybe someone more experienced has a better answer.
…On Wed, Nov 22, 2023, 10:31 PM Amber ***@***.***> wrote:
Hi, thank you for your explanation.
However, I think I have some other opinions. I understand that "income
group - low" is a categorical variable and is not continuous, I am just
saying that if the sample comes from the "low" group, then the "income
group - low" variable would be 1, and otherwise 0. So the calculation for
the specific term "22.2875*(income group - low)" would be 22.2875 for
samples in the "low" group, or it will just be zero.
I said that the values are calculated as
child mortality = 16.3017*(children per woman) + 22.2875*(income group -
low) + 15.7559*(income group - lower middle) - 5.337*(income group - upper
middle) - 22.8577
with rather high certainty because I tried to re-calculate the fitted
values in the above way and got the identical results successfully. I did
that by first fitting the model and retrieve the parameters:
resuls = smf.ols('child_mortality ~ children_per_woman + income_group', world_data_2014).fit()
parameters = resuls.params
and subset the dataset to only include the data from lower-middle group,
and calculated using the equation I assumed above and compare it to the
fitted values of the ols results:
W2014_subset = world_data_2014[world_data_2014.income_group == "Lower middle"]
LW_child_mor = W2014_subset["children_per_woman"]*parameters[4]+parameters[2]*1+parameters[0]
LW_fitted = resuls.fittedvalues[world_data_2014.income_group == "Lower middle"]
compare_table = pd.DataFrame({'My_cal': LW_child_mor, 'Fitted Values': LW_fitted}
compare_table.head(10)
and got the following results:
[image: image]
<https://user-images.githubusercontent.com/145036115/285092140-bbfbf854-b683-44d5-b2d4-3dbf31a29f80.png>
Same for the low group:
W2014_subset2 = world_data_2014[world_data_2014.income_group == "Low"]
LW_child_mor2 = W2014_subset2["children_per_woman"]*parameters[4]+parameters[1]*1+parameters[0]
LW_fitted2 = resuls.fittedvalues[world_data_2014.income_group == "Low"]
compare_table = pd.DataFrame({'My_cal': LW_child_mor2, 'Fitted Values': LW_fitted2})
compare_table.head(10)
and got the following results:
[image: image]
<https://user-images.githubusercontent.com/145036115/285093021-3a4c76e0-f44e-42b8-af35-b38f7f031577.png>
Therefore, I think my assumed calculation equation is correct. And if
that's the case, I don't understand why the graph plotted using sns.Implot
doesn't match up with the slope and intercept in the ols results. Maybe
it's because they are using different underlying fitting mechanisms?
Lastly, I used the markup function in Macbooks and drew lines to cover the
original blue and green lines on the graph plotted using sns.Implot, and
move the line to match the lines you made, I don't think the are really
parallel, as shown below:
original plot:
[image: image]
<https://user-images.githubusercontent.com/145036115/285093459-8336946e-a9b5-4c0b-823d-c935d93f04e2.png>
green line:
[image: image]
<https://user-images.githubusercontent.com/145036115/285093518-eac2e00c-e0a8-461c-9c02-c327c2e4c3dc.png>
blue line:
[image: image]
<https://user-images.githubusercontent.com/145036115/285093584-3d13f350-c321-432d-a201-e2cf612c682e.png>
—
Reply to this email directly, view it on GitHub
<#745 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJLCXQGUNSCTK4XU73D275TYF27R3AVCNFSM6AAAAAA7SDVK7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRTG44DCNJTGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Ya unfortunately there's no way to access the underlying stats. However, your suggested formula still doesn't really make sense to me since as you mentioned, it would result in all four lines having the same slope. I'm not saying you're wrong, since you've checked your work and it seems correct, but I just can't understand how the mortality-child# relationship could be the same across the four conditions. |
Hi, sorry I forgot to reply, I was reading it on my phone on my way to a lecture....Sorry! Yes, I understand what you said. I will try to figure it out and do more research. Thank you anyway and have a great night! |
Hi,
In lesson 10, we fitted the relationship between child mortality and children per woman and income group, and get the following results.
In my understanding, for example, for the Lower-middle income group, the intercept should be -22.8577 + 15.7559 = -7.1018, however, the graph created using sns.lmplot shows an intercept of around 6 for that income group as shown below.
I don't understand why this is the case. This problem also exists for other income groups and categorical variables, such as species in the iris dataset.
Thank you very much!
The text was updated successfully, but these errors were encountered: