머신러닝 정리.canvas

{
	"nodes":[
		{"id":"2969f5ddb96bf49b","type":"text","text":"- 구간\n- 비율\n- 이산\n","x":-342,"y":-500,"width":141,"height":107},
		{"id":"5222d78e38d88a93","type":"text","text":"- 순서\n- 명목","x":-186,"y":-500,"width":139,"height":107},
		{"id":"281815e60a582d94","type":"text","text":"데이터","x":-240,"y":-225,"width":103,"height":50,"color":"4"},
		{"id":"34bb680a05836778","type":"text","text":"수치형\n","x":-323,"y":-340,"width":103,"height":60},
		{"id":"5c88d52ce34aeaae","type":"text","text":"범주형","x":-173,"y":-335,"width":112,"height":50},
		{"id":"ab4d9e7ca63e7486","type":"text","text":"시각화","x":-420,"y":-178,"width":104,"height":60,"color":"4"},
		{"id":"a118fa474c14b969","type":"text","text":"바플롯 - 범주형, 이산형","x":-736,"y":-246,"width":250,"height":60},
		{"id":"8ad1c6fa167d7a35","type":"text","text":"박스플롯 ","x":-807,"y":-172,"width":250,"height":60},
		{"id":"b0c9f43252f1e408","type":"text","text":"히스토그램 - 수치형","x":-877,"y":-78,"width":250,"height":60},
		{"id":"ccca9dd81d60974a","type":"text","text":"파이차트","x":-934,"y":27,"width":250,"height":60},
		{"id":"231c162dbbecb5e1","type":"text","text":"확률","x":-47,"y":-67,"width":82,"height":60,"color":"4"},
		{"id":"71ecbdfd793dbabf","type":"text","text":"통계","x":-237,"y":-67,"width":97,"height":60,"color":"4"},
		{"id":"a57da3e1a70f6ebe","type":"text","text":"추론통계 - 검증","x":-566,"y":340,"width":250,"height":60},
		{"id":"ed980d6d0388996a","type":"text","text":"기술통계 - 분석","x":-573,"y":120,"width":250,"height":60},
		{"id":"2d4e4aa3b848a4c5","type":"text","text":"- 평균, 중앙값, 최빈값,중심경향치\n- 분산\n- 첨도, 외도\n- 스펙트럼","x":-933,"y":144,"width":306,"height":136},
		{"id":"14023f0cf1e85fbd","type":"text","text":"- P value\n- T/F 카이제곱 value\n- 신뢰구간\n- 가설검정","x":-914,"y":370,"width":232,"height":130},
		{"id":"409fe67de148e573","type":"text","text":"- 정확도\n- 정밀도\n- 해상도","x":-619,"y":490,"width":251,"height":90},
		{"id":"3c1d2086c07223ce","type":"text","text":"분포\n- 가우시안\n- t\n- F\n- 카이제곱\n .\n- 단봉형\n- 이봉형","x":-665,"y":609,"width":245,"height":251},
		{"id":"345c5ead364b2b84","type":"text","text":"- 중앙값, 평균값\n- 산포\n- 사분범위\n- QQ플롯\n- 엔트로피","x":-762,"y":954,"width":220,"height":186},
		{"id":"66fa53b713badf0f","type":"text","text":"- 데이터 정제","x":102,"y":103,"width":197,"height":99},
		{"id":"8baafc495bd276c3","type":"text","text":"스케일링","x":174,"y":-46,"width":250,"height":60},
		{"id":"cba44d1b74847bf4","type":"text","text":"- min-max\n- z-score (정규분포)\n- modified z-score (비정규분포)","x":492,"y":-120,"width":228,"height":134},
		{"id":"a5eb80b88a01d1de","type":"text","text":"이상치 제거","x":521,"y":-279,"width":250,"height":60},
		{"id":"d890a4060b61fe30","type":"text","text":"- 다차원 이상치 제거\n- 데이터 트리밍\n- ","x":293,"y":-375,"width":250,"height":60},
		{"id":"175649a14f111021","type":"text","text":"비선형 데이터 변환","x":-102,"y":308,"width":250,"height":60},
		{"id":"6c16b372f693d2fe","type":"text","text":"- 확률 vs 비율\n- 이산, 순서, 명목\n- 확률 vs 승산\n- 확률함수, 확률질량함수, 확률밀도함수\n- 누적확률함수","x":-47,"y":-310,"width":350,"height":162},
		{"id":"089fd6758057dcea","type":"text","text":"- 표본분포 추정\n- 몬테카를로 샘플링\n- 표본변동성\n- 기댓값\n- 조건확률","x":6,"y":-620,"width":234,"height":156},
		{"id":"7e3e9268b61e5a13","type":"text","text":"일반화","x":-227,"y":123,"width":250,"height":60,"color":"4"},
		{"id":"3a70afecaf8ae196","type":"text","text":"대수의 법칙\n- 표본평균들의 평균","x":271,"y":-620,"width":250,"height":108},
		{"id":"e22951f6e95680e0","type":"text","text":"중심극한정리(CLT)\n- 어떤 데이터든 무작위 표본 평균들을 구하고 그 분포를 보면 가우시안 분포를 따른다.\n- 독립된 변수 쌍이라면 두 데이터가 가우시안이든 아니든 두 데이터를 합친 후 분포를 보았을 때, 가우시안분포가 된다.","x":308,"y":-920,"width":372,"height":212},
		{"id":"7dc27cb2f60da8cc","type":"text","text":"가설검증","x":-43,"y":558,"width":250,"height":60,"color":"4"},
		{"id":"0e393383ecc6ce9f","type":"text","text":"모델\n- 가능한 한 단순하게\n- 필요한 만큼만 복잡하게\n\n- 잔차 : 모델의 정확도","x":-100,"y":766,"width":248,"height":154},
		{"id":"60ce1c75c56393c6","type":"text","text":"- 강한가설\n- 약한가설\n- 가설이 아닌것\n\n\n- 귀무가설\n- 대립가설\n\n우리가 할 일 \n- 귀무가설이 틀림을 증명해야함\n","x":174,"y":735,"width":292,"height":285},
		{"id":"7730212384478361","type":"text","text":"p-value : 영가설 하에, 현재 데이터가 존재할 확률","x":65,"y":1178,"width":250,"height":60,"color":"1"},
		{"id":"fe5fe5d330c186f0","type":"text","text":"z 비율 \n- 1 : 68.3%\n- 2: 95.5%\n- 3: 99.7%\n\np-z 쌍(단측)\n- p-value .05 = 1.64\n- p-value .01 = 2.32\n- p-value .001 = 3.09\n\np-z 쌍(양측)\n- p-value .05 = 1.96\n- p-value .01 = 2.58\n- p-value .001 = 3.29","x":-33,"y":1372,"width":253,"height":428,"color":"1"},
		{"id":"4373f07b891fca9f","type":"text","text":"자유도 : 변동가능한 값의 수\n\n통계적으로 자유도는 영가설 분포의 모양을 결정\n\n자유도가 높을수록 영가설을 기각할 확률이 높다","x":520,"y":736,"width":280,"height":224},
		{"id":"2e19ce1f0f2a8ce5","type":"text","text":"참 긍정 : 임계점보다 작은 값. 대립가설에 부합하고, 영가설에는 포함되지 않는 범위\n거짓 긍정(1종오류) : 임계점보다 작은 값. 영가설에 부합하는 범위\n참 부정 : \b임계점보다 큰 값. 영가설에 부합하고, 대립가설에 포함되지 않는 범위 \n거짓 부정 (2종오류): 임계점보다 작은 값. 대립가설에 부합하는 범위","x":353,"y":1235,"width":607,"height":137},
		{"id":"84f0f08170c49d34","type":"file","file":"머신러닝 강의/pic/9.Hypothesis testing/109.all.png","x":353,"y":1440,"width":686,"height":300},
		{"id":"93e92bc5bc7dfc49","type":"text","text":"1,2종 오류를 줄이는 방법\n- 두 분포 자체의 거리를 넓힌다.\n- 두 분포의 너비를 줄인다.\n\t- 자유도를 키움 -> 샘플 크기 상승  -> 너비감소\n\t- 변동성을 줄임 -> 분산을 줄여 좁은 분포 생성","x":1082,"y":1319,"width":358,"height":221},
		{"id":"5e9b2df1422c85e0","type":"text","text":"- 모수적 통계\n\t- 추정 기반\n\t- 큰 표본사이즈 적합\n- 비모수적 통계\n\t- 추정 기반하지 않음\n\t- 작은 표본 사이즈 적합\n\t- 비수치형 데이터 적합","x":624,"y":222,"width":376,"height":238},
		{"id":"6d70b5b00c09a622","type":"text","text":"다중 비교 - 여러 가설을 동시에 테스트할 때 생기는 문제\n\nFWE = 여러 표본을 동시에 테스트할 때 타입1 에러가 나올 확률 \n\n솔루션 - 하지만, 실제 오류율을 정확히 예측하거나, 조정하는 방법은 아님.  통제하려고 시도하는 방법일 뿐\n- 각각 독립된 데이터일 경우 : 각각의 개별 확률 합\n- 연관된 데이터일 경우 : $FWE \\leq 1-(1-\\alpha)^n \\leq n\\alpha$ 로 그 범위를 추정\n- 본페로니 교정 : a/N\n\n$\\alpha$ 와 타입1에러가 동일한 이유 : 영가설이 참이라는 가정 하에 추론하기 때문에, 임계값 이후의 모든 범위는 전부 타입1 에러가 된다.","x":1018,"y":800,"width":422,"height":406},
		{"id":"2a48f668ae88620f","type":"text","text":"교차검증\n- 분산 추정, 데이터 편향을 피할때, 분류 정확도를 계산할 때 사용\n\n- K-fold cross validation\n\t- 테스트 데이터각 10%라면 k=10으로 설정. 보통 k= 10~20\n\n 테스트와 훈련  데이터가 이상적으로는 독립적이어야 하지만, 반드시 그러리란 법은 없다.","x":1320,"y":403,"width":505,"height":310},
		{"id":"9d46a84c291b7f8c","type":"text","text":"t-test-family","x":235,"y":2025,"width":250,"height":60,"color":"4"},
		{"id":"12abecb377e3f208","type":"text","text":"t-test\n\n두 그룹간의 평균을 비교하는 통계적 검정 방법\n- 대립가설 : 두 그룹간의 평균은 다르다는 가설\n- 영가설 : 두 그룹을 섞어서 비교했을때 큰 차이가 없다는 가설\n- 0 : 그룹의 평균이 0과 유의미하게 다른지 테스트\n\n기본 공식\n# $$t_k = \\frac{(\\overline{x}-\\overline{y})}{s/\\sqrt{n}}$$\n\nt값을 극대화 하는 방법 : \n- 샘플의 개수(n)을 늘리거나\n- 분산(s)를 줄이거나\n- 그룹의 차이를 늘리거나($\\overline{x}-\\overline{y}$)\n\n이 세가지 방법은 통제하지 못하는 경우도 존재함.","x":19,"y":2269,"width":521,"height":611},
		{"id":"c5a87d762d153961","type":"text","text":"모수적 검정","x":-132,"y":3021,"width":250,"height":60},
		{"id":"c6a330fa810d7c4f","type":"text","text":"one-sample t-test\n\n단일 표본집단과 모집단을 비교\n\n# $$t_{n-1} = \\frac{\\textcolor[rgb]{0.39, 0.58, 0.93}{\\overline{x}-\\mu}}{\\textcolor[rgb]{0.35,0.9,0.9}{s/\\sqrt{n}}}$$\n\n$\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}: 표본 평균}$\n$\\textcolor[rgb]{0.39, 0.58, 0.93} {\\mu : H_o값}$\n\n$\\textcolor[rgb]{0.35,0.9,0.9}  {s:표준편차}$\n$\\textcolor[rgb]{0.35,0.9,0.9}  {n:데이터 포인트의 수}$\n\n$n-1 :자유도$\n-> 즉, 우리가 구하려는 공식이 제한된 범위의 t값이라는것을 의미","x":-440,"y":3235,"width":396,"height":491},
		{"id":"f90a930e7decd24d","type":"text","text":"two-sample-t-test\n\n<span style=\"color:rgb(118, 147, 234)\">두 숫자 집합이 동일한 분포에서 추출되었는지 테스트</span> \n\n분자는 언제나 동일\n<span style=\"color:rgb(205, 205, 81)\">분모는 그룹이 짝을 이루는지 아닌지, 분산이 동일한지 아닌지, 표본의 크기가 동일한지 아닌지에 따라 다르다. </span> \n# $$t_{\\textcolor[rgb]{1, 0.6, 0.2}{df}} = \\frac{\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x_1}}-\\textcolor[rgb]{0.35,0.9,0.9} {\\overline{x_2}}}{\\sqrt{\\frac{(\\textcolor[rgb]{0.39, 0.58, 0.93} {n_1}-1)\\textcolor[rgb]{0.39, 0.58, 0.93} {s_1^2}+(\\textcolor[rgb]{0.35,0.9,0.9} {n_2}-1)\\textcolor[rgb]{0.35,0.9,0.9} {s_2^2}}{\\textcolor[rgb]{0.39, 0.58, 0.93} {n_1}+\\textcolor[rgb]{0.35,0.9,0.9} {n_2}-2}}\\sqrt{\\frac{1}{\\textcolor[rgb]{0.39, 0.58, 0.93} {n_1}}+\\frac{1}{\\textcolor[rgb]{0.35,0.9,0.9} {n_2}}}}$$\n\n# $$\\textcolor[rgb]{1, 0.6, 0.2}{df} = \\textcolor[rgb]{0.39, 0.58, 0.93} {n_1} + \\textcolor[rgb]{0.35,0.9,0.9} {n_2}-2$$","x":19,"y":3231,"width":586,"height":429},
		{"id":"ca6c48b0324ef94b","type":"text","text":"비모수적 검정","x":680,"y":2991,"width":250,"height":60},
		{"id":"bbeb403717779fb4","type":"text","text":"Wilcoxon signed-rank test\n\n- 비정규분포 데이터에 사용\n- 단일 표본\n- <span style=\"color:rgb(236, 158, 111)\">Two dependent (paired) samples</span>\n\n동일한 짝 제거 | 영가설 값과 동일값 제거\n중심값 대신 중심값의 차이에 순위를 매김\nx<y인 데이터 순위만 합침 -> W\nw값을 이용하여 z값을 도출\n","x":1201,"y":2837,"width":378,"height":308},
		{"id":"ebe205ad6c00e530","type":"text","text":"Mann-Whitney U test\n- Independent two-samples 에 적합\n- 큰 데이터, 작은 데이터를 구분하여 계산\n- U값을 도출","x":1201,"y":2575,"width":374,"height":160},
		{"id":"668f5a92425f848e","type":"text","text":"순열검정\n\n실제 데이터를 사용하여 영가설 분포를 계산, 관찰된 값과 비교하여 t-test 값을 해석","x":1201,"y":3231,"width":250,"height":179},
		{"id":"5dad725776417e94","type":"text","text":"영가설 분포가 정규분포에 가까울 때\n# $$Z = \\frac{\\textcolor {yellow}{obs}-E[\\textcolor[rgb]{1, 0.6, 0.2}{H_0}]}{std[\\textcolor[rgb]{1, 0.6, 0.2}{H_0}]}$$","x":1594,"y":3227,"width":286,"height":183},
		{"id":"f7975a8865713391","type":"text","text":"어느 모양에서나 사용 가능(꼬리를 유의)\n# $$p_c = \\frac{\\sum(\\textcolor[rgb]{1, 0.6, 0.2}{H_0}>\\textcolor {yellow}{obs})}{N_\\textcolor[rgb]{1, 0.6, 0.2}{H_o}}$$","x":1586,"y":3512,"width":294,"height":214},
		{"id":"b986d0c7d6610687","type":"text","text":"confidence intervals on parameters\n","x":-1108,"y":1673,"width":348,"height":67,"color":"4"},
		{"id":"fd96fd05a96cf49f","type":"text","text":"신뢰구간 \n- <span style=\"color:rgb(205, 205, 81)\">동일한 실험의 반복된 표본에서 미지의 모집단 파라미터가 특정 값 범위 내에 속할 확률</span>\n\n# $$P(L<\\mu<U) = c$$\n$L=신뢰 구간의\\ 하한$\n$U = 신뢰구간의 \\ 상한$\n$c = 신뢰구간\\ 비율$\n\n<span style=\"color:rgb(116, 195, 194)\">신뢰구간은 <span style=\"color:rgb(205, 205, 81)\">표본 크기</span>와 <span style=\"color:rgb(236, 158, 111)\">분산</span>에 의해 영향을 받는다</span> ","x":-1199,"y":1884,"width":439,"height":356},
		{"id":"038e0d07d23e5ee9","type":"text","text":"공식\n- 단일 샘플로 신뢰구간을 계산하는 방법\n\n# $$C.I =\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}} \\pm \\textcolor[rgb]{0.35,0.9,0.9} {t^*(k)}\\frac{\\textcolor[rgb]{1, 0.6, 0.2}{s}}{\\sqrt{\\textcolor {yellow}{n}}}$$\ns(표준편차)는 변동성을 잘 설명할수 있어야 함","x":-1483,"y":2364,"width":403,"height":356},
		{"id":"9f3c6f07539d9378","type":"text","text":"비모수적","x":-1627,"y":1833,"width":250,"height":60},
		{"id":"a4d794af84de6967","type":"text","text":"모수적","x":-880,"y":1777,"width":250,"height":60},
		{"id":"0c6067c38acd5e13","type":"text","text":"부트스트래핑을 이용한 신뢰구간 계산\n\n- 단일 표본 데이터만 존재할 때\n- 모집단이 아닌 단일표본으로 모집단의 값을 예측할 떄의 신뢰구간 계산\n\n표본 데이터 포인트를 무작위로 골라 부스트 스트래핑 데이터셋을 생성, 여러번 반복\n\n<span style=\"color:rgb(116, 195, 194)\">원본 데이터셋</span>  $\\textcolor[rgb]{0.39, 0.58, 0.93} {X_1,X_2,X_3,X_4,X_5,X_6,X_7,X_8,X_9,X_{10}}$  $\\textcolor[rgb]{1, 0.6, 0.2}{\\overline{x}=y_\\circ}$\n           Resample with replacement(복원 추출)\n\n복원추출 : 표본 중 하나를 무작위로 선택한다는 의미\n\n<span style=\"color:rgb(116, 195, 194)\">부트스트랩 데이터셋1</span>  $\\textcolor[rgb]{0.39, 0.58, 0.93} {X_6,X_6,X_1,X_8,X_9,X_7,X_8,X_{10},X_1,X_{2}}$  $\\textcolor[rgb]{1, 0.6, 0.2}{\\overline{x}=y_1}$\n\n<span style=\"color:rgb(116, 195, 194)\">부트스트랩 데이터셋2</span>  $\\textcolor[rgb]{0.39, 0.58, 0.93} {X_8,X_7,X_3,X_3,X_{10},X_1,X_8,X_4,X_9,X_5}$  $\\textcolor[rgb]{1, 0.6, 0.2}{\\overline{x}=y_2}$\n\t\t\t\t\t\t\t.\n\t\t\t\t\t\t\t.\n\t\t\t\t\t\t\t.\n\t\t\t\t\t\t500 times\n\t\t\t\t\t\t$\\textcolor[rgb]{1, 0.6, 0.2}{y_1,y_2,...,y_{500}}$\n\n이들의 평균을 구해 분포를 생성 후, 신뢰구간 계산\n![[132.Pasted image 20240917100828.png]]","x":-2280,"y":2061,"width":653,"height":739},
		{"id":"cb4b50ab3c27ed51","type":"text","text":"correlation","x":-2318,"y":1160,"width":250,"height":60,"color":"4"},
		{"id":"fbf9f3f247c02df2","type":"text","text":"###  상관관계\n\n\n- 양의 상관관계 : 한 변수가 커지면, 다른 변수도 커짐\n- 음의 상관관계 : 한 변수가 작아지면, 다른 변수는 커짐\n\n상관관계가 없어보여도 존재하는 경우\n- 시각화 했을 때 전혀 관계가 없어보이더라도, 상관계수가 0이 아니고, p-value가 유의미하다면, 유의미한 관계로 볼 수 있다.\n\n상관계수가 0인것? 없는것?\n- 상관계수가 0인것과, 없는것은 다르다.\n- 상관 계수가 0일 때, 두 변수 사이에 선형관계가 없을 수는 있지만, 비선형 관계는 존재할 수 있다.\n\n상관관계와 인과관계\n- 상관관계는 변수간의 관계만을 나타내며, 이를 통해 원인과 결과를 알수는 없다.\n\n- 인과관계와 상관관계는 다르다.\n","x":-2599,"y":1351,"width":599,"height":512},
		{"id":"5de28f8fa8ef2405","type":"text","text":"표준편차\n $$\\sigma = \\sqrt\\sigma^2 = \\sqrt{\\frac{1}{n-1}\\sum^n_{i=1}(x_i-\\overline{x})^2}$$\n\n분산\n $${\\frac{1}{n}\\sum^n_{i=1}(x_i-\\overline{x})^2}$$\nz-score\n $$z_i = \\frac{x_i-\\overline{x}}{\\sigma_x}$$","x":-1600,"y":57,"width":480,"height":352},
		{"id":"fbb1160be4f269a2","type":"text","text":"모집단이면 N, 표본집단이면 N-1","x":-1497,"y":-87,"width":297,"height":50},
		{"id":"ff916cc57e332c7f","type":"text","text":"상관관계 vs 공분산\n\n상관계수는 공분산의 스케일링 된 버전일 뿐임\n\n공분산\n# $$\\textcolor[rgb]{0.39, 0.58, 0.93} {c} = \\frac{1}{n-1}\\sum^n_{i=1}(\\textcolor[rgb]{1, 0.6, 0.2}{x_i}-\\textcolor[rgb]{1, 0.6, 0.2}{\\overline{x}})(\\textcolor {yellow}{y_i}-\\textcolor {yellow}{\\overline{y}})$$\n\n상관계수 (피어슨)\n# $$\\textcolor[rgb]{0.35,0.9,0.9} {r} = \\frac{\\sum^n_{i=1}(\\textcolor[rgb]{1, 0.6, 0.2}{x_i}-\\textcolor[rgb]{1, 0.6, 0.2}{\\overline{x}})({\\textcolor {yellow}{y_i}-\\textcolor {yellow}{\\overline{y}}})}{\\sqrt{\\sum^n_{i=1}(\\textcolor[rgb]{1, 0.6, 0.2}{x_i}-\\textcolor[rgb]{1, 0.6, 0.2}{\\overline{x}})^2\\sum^n_{i=1}(\\textcolor {yellow}{y_i}-\\textcolor {yellow}{\\overline{y}})^2}}$$\n\n\n상관계수에서 p-value 계산\n# $$t_{n-2} = \\frac{\\textcolor[rgb]{0.35,0.9,0.9} {r}\\sqrt{n-2}}{1-\\textcolor[rgb]{0.35,0.9,0.9} {r}^2}$$\n데이터 크기와, 상관계수가 크면 t값도 커진다. 즉 p-value는 낮아지므로 유의성이 있을 확률이 올라간다.\n","x":-3600,"y":2218,"width":920,"height":765},
		{"id":"2b7a8fd1c0cb34fa","type":"text","text":"부분상관\n\n다른 변수들의 영향(공분산)을 제거한 후 두 변수간의 순수한 상관관계를 측정하는 방법\n\n![[144.Pasted image 20240922135455.png]]\n\n공식은 몰라두댐","x":-4160,"y":2431,"width":400,"height":340},
		{"id":"13ce34d4d27e6b49","type":"text","text":"Code : 상관계수를 지정한 데이터를 만드는 방법\n\n```python\n# data simulation parameters\nN = 100\nr = .6 # desired correlation coefficient\n\n# start with random numbers\nx = np.random.randn(N)\ny = np.random.randn(N)\n\n# impose the correlation on y\n# 원하는 상관계수에 맞는 y값 데이터로 변경\ny = x*r + y*np.sqrt(1-r**2)\n```\n\n어느정도 오차는 있다.","x":-3440,"y":3200,"width":465,"height":380},
		{"id":"df12c19288f30521","type":"text","text":"피어슨 상관계수\n\n- 어느정도 정규분포의 선형관계를 측정하는것에 적합함\n- 이상치에 민감함\n- 비선형 관계는 적절하지 않음\n- 그렇기에 실제 데이터 분포는 확연히 다르나, 피어슨 계수로는 동일한 경우가 존재","x":-3140,"y":1920,"width":500,"height":240},
		{"id":"d8265769904608ac","type":"text","text":"스피어맨 상관계수\n\n- 데이터를 랭크로 변환\n- 변환된 랭크에서 피어슨 계수를 구함 -> 이상치에 강함\n- p-value를 구함\n\n단조 관계에 적절함","x":-3535,"y":1675,"width":395,"height":245},
		{"id":"6a0704b5c66e223f","type":"text","text":"Fisher - Z 변환\n\n균일분포인 상관계수를 가우시안 분포로 변환함\n\n# $$z_\\textcolor[rgb]{0.35,0.9,0.9} {r} =\\frac{1}{2}\\ln(\\frac{1+\\textcolor[rgb]{0.35,0.9,0.9} {r}}{1-\\textcolor[rgb]{0.35,0.9,0.9} {r}}) =\\text{arctanh}(\\textcolor[rgb]{0.35,0.9,0.9}{r})$$\n\nFisher-z가 꼭 필요한것은 아님\n\n필요할 때 : \n- 여러 상관관계를 계산하거나\n- 많은 개체에 대해 상관관계를 계산할 때\n- 그 상관계수르들을 종합적으로 분석할 때\n- t-test, ANOVA 같은 분석에서 정규 분포를 가정할 때","x":-3640,"y":1080,"width":480,"height":447},
		{"id":"8b1640a5516d3759","type":"text","text":"켄달 상관계수\n\n- 순서형 데이터에 사용\n- <span style=\"color:rgb(118, 147, 234)\">데이터를 랭크일치(각 변수의 값 간의 상대적인 부호)로 변환</span>\n\n# $$ \\textcolor[rgb]{1, 0.6, 0.2}{\\tau} = K^{-1}\\sum{\\mathrm{sgn}({\\textcolor[rgb]{0.39, 0.58, 0.93} {\\tilde{x}}_i-\\textcolor[rgb]{0.39, 0.58, 0.93} {\\tilde{x}}_{i:}})\\mathrm{sgn}({\\textcolor[rgb]{0.35,0.9,0.9} {\\tilde{y}}_i-\\textcolor[rgb]{0.35,0.9,0.9} {\\tilde{y}}_{i:}})}$$","x":-4535,"y":1684,"width":535,"height":276},
		{"id":"86ce7860d2619526","type":"text","text":"코사인 유사도\n\n피어슨 공식과 매우 유사함.\n중심이 0인 데이터셋 -> 피어슨과 코사인 유사도가 동일함\n\n# $$COS(\\textcolor[rgb]{0.35,0.9,0.9} {\\theta})= \\frac{\\sum^n_{i=1}\\textcolor[rgb]{1, 0.6, 0.2}{x_i}\\textcolor {yellow}{y_i}}{\\sqrt{\\sum^n_{i=1}\\textcolor[rgb]{1, 0.6, 0.2}{x_i}^2}\\sqrt{\\sum^n_{i=1}\\textcolor {yellow}{y_i}^2}}$$","x":-4380,"y":2100,"width":505,"height":280},
		{"id":"459d47079d9191d6","type":"text","text":"- 데이터셋에서 분산의 패턴을 분석하는것이 핵심 개념\n- 데이터셋의 전체 분산을 분석하고, 특정 요인에 의한 분산이 얼마나 되는지를 알아보는 것\n\n## Setting up an ANOVA in four steps\n\n<span style=\"color:rgb(118, 147, 234)\">Step 1:</span>  <span style=\"color:rgb(116, 195, 194)\">실험 설계를 검토하고 ANOVA가 정말 적합한 방법인지 결정하는 것</span>\n\n<span style=\"color:rgb(118, 147, 234)\">Step 2:</span>  <span style=\"color:rgb(116, 195, 194)\">\b독립 변수와 종속 변수를 식별하는 것</span>\n\n<span style=\"color:rgb(118, 147, 234)\">Step 3:</span>  <span style=\"color:rgb(116, 195, 194)\">요인(factor)과 수준(level)의 표를 작성 하는 것</span>\n\n<span style=\"color:rgb(118, 147, 234)\">Step 4:</span>  <span style=\"color:rgb(116, 195, 194)\">모델을 계산하고 결과를 해석하는 것</span>\n","x":-3008,"y":660,"width":690,"height":340},
		{"id":"9dd67eb31b8ae32f","type":"text","text":"ANOVA","x":-2788,"y":440,"width":250,"height":60,"color":"4"},
		{"id":"66fd1f8c0e2af2f3","type":"text","text":"## The way of the ANOVA\n\n<span style=\"color:rgb(118, 147, 234)\">\bn-way:</span> <span style=\"color:rgb(116, 195, 194)\"> 요인의 \u001c수</span> \n\n<span style=\"color:rgb(236, 158, 111)\">Examples:</span> \n<span style=\"color:rgb(118, 147, 234)\">One-way ANOVA:</span>  <span style=\"color:rgb(116, 195, 194)\">요일이 아이폰 구매에 미치는 영향</span>\n\n---\n## Repeated-measures ANOVA\n\n<span style=\"color:rgb(118, 147, 234)\">rmANOVA:</span>  <span style=\"color:rgb(116, 195, 194)\">동일한 개체에서 최소한 하나의 요인이 여러번의 측정을 포함하는 경우</span> \n\n<span style=\"color:rgb(236, 158, 111)\">Example:</span>\n\n<span style=\"color:rgb(118, 147, 234)\">Research question:</span><span style=\"color:rgb(116, 195, 194)\"> 과자의 종류가 무드에 끼치는 영향</span> \n<span style=\"color:rgb(118, 147, 234)\">Experiment:</span><span style=\"color:rgb(116, 195, 194)\"> 지원자들은 초콜릿을 2일동안 먹고, 감자칩을 2일, 아이스크림을 2일동안 먹는다 (무작위 순서) </span> \n\n---\n## Balanced vs. unbalanced ANOVA\n\nBalanced :  각 셀에 동일한 수의 데이터 포인트가 존재\nUnbalanced : 각 셀에 서로 다른 수의 데이터 포인트가 존재\n\n---\n## Dummy-coding variables\n\n<span style=\"color:rgb(118, 147, 234)\">Dummy-coding:</span> <span style=\"color:rgb(116, 195, 194)\">범주형 변수를 숫자로 변환하는 것. 가장 잘 적용되는 경우는 {0,1} </span>\n\n---\n## ANOVA vs. MANOVA\n\n<span style=\"color:rgb(118, 147, 234)\">ANOVA:</span>  <span style=\"color:rgb(116, 195, 194)\">단일 종속 변수(독립변수가 많은 만큼 적절)</span>\n- 독립변수는 얼마든지 가질 수 있지만 종속 변수는 하나뿐\n<span style=\"color:rgb(118, 147, 234)\">MANOVA(multivariate(다변) ANOVA):</span> <span style=\"color:rgb(116, 195, 194)\">다중 종속 변수(독립 변수가 많은 만큼 적절)</span>\n- 여러개의 독립 변수 뿐만 아니라, 여러개의 종속 변수도 가짐\n<span style=\"color:rgb(118, 147, 234)\">Example:</span> <span style=\"color:rgb(116, 195, 194)\">약물 종류와 나이대가 Covid-19 증상과 총 의료비용에 미치는 영향</span>\n- 종속 변수 : 증상 중증도, 의료비\n\n---\n## Fixed vs. random effects ANOVAs\n\n<span style=\"color:rgb(118, 147, 234)\">Fixed effects(고정효과):  </span> <span style=\"color:rgb(116, 195, 194)\">요인의 수준 수가 고정됨(e.g., 주거 유형 : 기숙사, 아파트, 주택) \n</span>\n- 수준 수가 실제로 고정되어 있거나, 적어도 어느 정도 합리적으로 고정되어 있다는 것을 의미\n\n<span style=\"color:rgb(118, 147, 234)\">Random effects(랜덤효과):</span> <span style=\"color:rgb(116, 195, 194)\">요인의 수준이 무작위거나, 연속적 (e.g., 나이, 간호사, 봉급)</span>\n- 모집단의 요인이 랜덤임\n- 랜덤요인에 속한 개별 수준의 변동을 의미 \n\t- 학급의 예를 든다면, 학급이 랜덤요인이고, 개별 학급마다 학생들의 성적 차이가 존재할텐데, 이 성적의 차이의 변동을 설명하는것이 랜덤효과임.\n\n<span style=\"color:rgb(118, 147, 234)\">Mixed effects(혼합효과):</span> <span style=\"color:rgb(116, 195, 194)\">어떤 요인들은 고정; 어떤 요인들은 무작위</span> \n\n---\n## Assumptions of ANOVA\n\n<span style=\"color:rgb(118, 147, 234)\">Independence:</span>  <span style=\"color:rgb(116, 195, 194)\">데이터가 일반화하려는 모집단에서 서로 독립적으로 추출됨</span>\n\n<span style=\"color:rgb(118, 147, 234)\">Normality:</span>  <span style=\"color:rgb(116, 195, 194)\">\"잔차\"(모델을 피팅한 이후의 설명되지 않은 분산)가 가우시안 분포를 따른다.</span>\n\n<span style=\"color:rgb(118, 147, 234)\">Homogeneity of variance/분산의 동질성(a.k.a. <span style=\"color:rgb(230, 122, 122)\">heteroscedasticity</span>):</span> <span style=\"color:rgb(116, 195, 194)\">(분산의 동질성은)각 셀 내의 분산이 대략적으로 동일하다는 의미</span>\n","x":-2260,"y":-643,"width":593,"height":1753},
		{"id":"8fb5b2e3921e0976","type":"text","text":"# sum of squares\n## 제곱합\n\n## Competing hypothesis of ANOVA\n\n<span style=\"color:rgb(116, 195, 194)\">영가설은 모든 그룹의 평균(ANOVA 테이블의 모든 셀)이 통계적으로 구분되지 않는다는것을 의미</span>\n<span style=\"color:rgb(236, 158, 111)\">대립가설은 적어도 한 그룹의 평균이 적어도 다른 한 그룹 또는 셀의 평균과 다르다는것을 의미</span>\n\n## ANOVA as a partition of sum of squares\n# $$F = \\frac{\\textcolor[rgb]{0.35,0.9,0.9} {\\text{\"Explained\" variance}}}{\\textcolor[rgb]{1,.45,0}{\\text{\"Unexplained\" variance}}} = \\frac{\\textcolor[rgb]{0.35,0.9,0.9} {\\text{Due to factors}}}{\\textcolor[rgb]{1,.45,0}{\\text{Natural variation}}}$$\n설명된 분산  - 그룹 간 제곱합 : 그룹이 얼마나 다른 그룹과 구분되는지\n- 값이 클 수록 다른 그룹과 구분된다는 의미\n설명되지 않은 분산 - 그룹 내 제곱합 : 그룹이 얼마나 일관된 패턴을 가지는지. 즉, 유의미한 패턴을 가지는지\n- 값이 작을수록 일관된 패턴을 보인다는 의미\n\n## ANOVA as a partition of sum of squares\n# $$SS_{\\text{Total}} = \\sum^{\\textcolor[rgb]{1,.45,0}{\\text{levels}}}_{\\textcolor[rgb]{1,.45,0}{j}=1}\\sum^{\\textcolor {yellow}{\\text{individuals}}}_{\\textcolor {yellow}{i}=1}(\\textcolor[rgb]{0.39, 0.58, 0.93} {x}_{\\textcolor {yellow}{i}\\textcolor[rgb]{1,.45,0}{j}}-\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}})^2\\ \\ \\ \\ \\ df_\\text{Total} = N -1$$\n- 전체 데이터의 평균을 기준으로 한 변동성. 모든 개별 변동성을 합한다. \n# $$SS_{\\text{Between}} = \\sum^{\\textcolor[rgb]{1,.45,0}{\\text{levels}}}_{\\textcolor[rgb]{1,.45,0}{j}=1}(\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}}_{\\textcolor[rgb]{1,.45,0}{j}}-\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}})^2n_\\textcolor[rgb]{1, 0.6, 0.2}{j}\\ \\ \\ \\ \\ df_\\text{Between}=\\textcolor[rgb]{1,.45,0}{k}-1$$\n- 전체 데이터의 평군을 기준으로 한 변동성. 총 제곱합과는 달리, 개별이 아닌 그룹(수준)의 평균과, 각 그룹 개별의 숫자를 곱한다. 즉, 그룹의 크기마다 그 가중치를 다르게 설정함\n\n# $$SS_\\text{{Within}} = \\sum^{\\textcolor[rgb]{1,.45,0}{\\text{levels}}}_{\\textcolor[rgb]{1,.45,0}{j}=1}\\sum^{\\textcolor {yellow}{\\text{individuals}}}_{\\textcolor {yellow}{i}=1}(\\textcolor[rgb]{0.39, 0.58, 0.93} {x}_{\\textcolor {yellow}{i}\\textcolor[rgb]{1,.45,0}{j}}-\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}}_{\\textcolor[rgb]{1,.45,0}{j}})^2\\ \\ \\ \\ \\ df_\\text{Within}= N-\\textcolor[rgb]{1,.45,0}{k}$$\n- 모든 개별들의 변동을 구한다는 점은 총제곱합과 동일하나, 그 변동의 주체는 전체 평균이 아닌, 개별적인 그룹의 평균임. 즉, 각 개별값들의 상대적인 변동성을 구함\n# $$SS_\\text{total} = SS_\\text{between} + SS_\\text{Within}$$\n\n","x":-3860,"y":-320,"width":820,"height":1321},
		{"id":"fb94604147d2460b","type":"text","text":"# F-test and ANOVA table\n\n## Sum of squares to mean square\n# $$MS_\\text{Between} = \\frac{SS_\\text{Between}}{df_\\text{Between}}=\\frac{\\sum^\\textcolor[rgb]{1,.45,0}{\\text{levels}}_{\\textcolor[rgb]{1,.45,0}{j}=1}(\\textcolor[rgb]{0.39, 0.58, 0.93}{\\overline{x}}_\\textcolor[rgb]{1,.45,0}{j}-\\textcolor[rgb]{0.39, 0.58, 0.93}{\\overline{x}})^2n_\\textcolor[rgb]{1,.45,0}{j}}{\\textcolor[rgb]{1,.45,0}{k}-1}$$\n# $$MS_\\text{Within} = \\frac{SS_\\text{Within}}{df_\\text{Within}}=\\frac{\\sum^\\textcolor[rgb]{1,.45,0}{\\text{levels}}_{\\textcolor[rgb]{1,.45,0}{j}=1}\\sum^\\textcolor {yellow}{\\text{individuals}}_{\\textcolor {yellow}{i}=1}(\\textcolor[rgb]{0.39, 0.58, 0.93}{{x}}_{\\textcolor {yellow}{i}\\textcolor[rgb]{1,.45,0}{j}}-\\textcolor[rgb]{0.39, 0.58, 0.93}{\\overline{x}}_\\textcolor[rgb]{1,.45,0}{j})^2}{N-\\textcolor[rgb]{1,.45,0}{k}}$$\n## The F-test\n# $$F_{\\textcolor[rgb]{1, 0.6, 0.2}{k}-1,N-\\textcolor[rgb]{1, 0.6, 0.2}{k}}=\\frac{MS_\\text{Between}}{MS_{\\text{Within}}}$$\n## The ANOVA table\n![](161.Pasted%20image%2020240928093556.png)\n\n","x":-4720,"y":100,"width":737,"height":980},
		{"id":"1e9b1208243a7a98","type":"text","text":"## omnibus F-test and post-hoc comparisons\n\n## The 'problem' with the ANOVA F-test\n\n![](162.Pasted%20image%2020240928100347.png)\n\bANVOA 에서 p-value가 유의하다는 것은 최소한 하나 이상의 그룹 평균이 다른 그룹과 통계적으로 유의하게 다르다는것을 의미함.\n\t\n\t그러나 p-value는 그 차이가 무엇인지는 알려주지 못함 -> 이것이 ANOVA F-test의 한계\n\t따라서 ,어떤 그룹이 실제로 서로 다른지 알아내기 위해서는 데이터 시각화와 사후 t-test(post-hoc t-test)의 조합이 필요함\n---\n## Thanks, Tukey, for the test\n\n<span style=\"color:rgb(205, 205, 81)\">Solution:</span> <span style=\"color:rgb(236, 158, 111)\">Tukey 검정은 가족 내 오류율을 제어하면서 사후 비교를 가능하게 함.</span> \n\n\t사후 비교 : ANVOA가 유의하다는 것을 이미 검정한 후에 개별 조건을 테스트 하는 것 \n\nTukey test는 어떻게 동작하는가?\n\n\t두 그룹의 평균을 비교하고자 하는것\n\t이는 t-test와 개념적으로 유사하다 -> 평균 간의 차이를 살펴보는것\n# $$\\textcolor[rgb]{0.35,0.9,0.9} {q}=\\frac{\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}_b}-\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}_s}}{\\sqrt{MS_{\\text{Within}}}\\sqrt{2/n}}$$\n<span style=\"color:rgb(116, 195, 194)\">q</span>는 j,n-j의 자유도를 가짐\nj : 는 총 비교 수\nn :  데이터 값의 총 수","x":-5480,"y":0,"width":648,"height":1080},
		{"id":"85d82c08db0ba93b","type":"text","text":"## two -way ANOVA\n\n## Extension to two-way ANOVA\n# $$SS_\\text{Total}=\\sum^\\textcolor[rgb]{1, 0.6, 0.2}{\\text{levels B}}_{\\textcolor[rgb]{1, 0.6, 0.2}{k}=1}\\sum^\\textcolor[rgb]{1,.4,0}{\\text{levels A}}_{\\textcolor[rgb]{1,.4,0}{j}=1}\\sum^\\textcolor {yellow}{\\text{individuals}}_{\\textcolor {yellow}{i}=1}(\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}}_{\\textcolor {yellow}{i}\\textcolor[rgb]{1,.4,0}{j}\\textcolor[rgb]{1, 0.6, 0.2}{k}}-\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{{x}}})^2\\ df_{\\text{Total}}=N - 1$$\n# $$SS_\\text{Btwn\\textcolor[rgb]{1,.4,0}{A}}=\\textcolor[rgb]{1, 0.6, 0.2}{b}n\\sum^\\textcolor[rgb]{1,.4,0}{\\text{levels A}}_{\\textcolor[rgb]{1,.4,0}{j}=1}(\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}}_{\\textcolor[rgb]{1,.4,0}{j}}-\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{{x}}})^2\\ df_{\\text{Btwn\\textcolor[rgb]{1,.4,0}{A}}}=\\textcolor[rgb]{1,.4,0}{a} - 1$$\n# $$SS_\\text{Btwn\\textcolor[rgb]{1, 0.6, 0.2}{B}}=\\textcolor[rgb]{1,.4,0}{a}n\\sum^\\textcolor[rgb]{1, 0.6, 0.2}{\\text{levels B}}_{\\textcolor[rgb]{1, 0.6, 0.2}{k}=1}(\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}}_{\\textcolor[rgb]{1, 0.6, 0.2}{k}}-\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{{x}}})^2\\ df_{\\text{Btwn\\textcolor[rgb]{1, 0.6, 0.2}{B}}}=\\textcolor[rgb]{1, 0.6, 0.2}b - 1$$\ninteraction\n서로 상호작용하는 요인에 기인할 수 있는 변동성을 측정\n# $$SS_{\\text{\\textcolor[rgb]{1,.4,0}{A}}\\times\\text{\\textcolor[rgb]{1, 0.6, 0.2}{B}}}=\\sum^\\textcolor[rgb]{1, 0.6, 0.2}{\\text{levels B}}_{\\textcolor[rgb]{1, 0.6, 0.2}{k}=1}\\sum^\\textcolor[rgb]{1,.4,0}{\\text{levels A}}_{\\textcolor[rgb]{1,.4,0}{j}=1}(\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}}_{\\textcolor[rgb]{1,.4,0}{j}\\textcolor[rgb]{1, 0.6, 0.2}{k}}-\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{{x}}}_\\textcolor[rgb]{1, 0.6, 0.2}{k}-\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}}_\\textcolor[rgb]{1,.4,0}{j}+\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}})^2\\ df_{\\text{\\textcolor[rgb]{1,.4,0}{A}}\\times\\text{\\textcolor[rgb]{1, 0.6, 0.2}{B}}}=(\\textcolor[rgb]{1,.4,0}{a}-1) (\\textcolor[rgb]{1, 0.6, 0.2}{b}-1)$$\n# $$SS_\\text{Within}=\\sum^\\textcolor[rgb]{1, 0.6, 0.2}{\\text{levels B}}_{\\textcolor[rgb]{1, 0.6, 0.2}{k}=1}\\sum^\\textcolor[rgb]{1,.4,0}{\\text{levels A}}_{\\textcolor[rgb]{1,.4,0}{j}=1}\\sum^\\textcolor {yellow}{\\text{individuals}}_{\\textcolor {yellow}{i}=1}(\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{x}}_{\\textcolor {yellow}{i}\\textcolor[rgb]{1,.4,0}{j}\\textcolor[rgb]{1, 0.6, 0.2}{k}}-\\textcolor[rgb]{0.39, 0.58, 0.93} {\\overline{{x}}}_{\\textcolor[rgb]{1,.4,0}{j}\\textcolor[rgb]{1, 0.6, 0.2}{k}})^2\\ df_{\\text{Within}}=N - \\textcolor[rgb]{1,.4,0}{a}\\textcolor[rgb]{1, 0.6, 0.2}{b}$$\n계산하지 말아라.","x":-6240,"y":120,"width":581,"height":840},
		{"id":"9a571e05bdf7ed58","type":"text","text":"## ANOVA vs. regression\n\n<span style=\"color:rgb(118, 147, 234)\">ANVOA는 <span style=\"color:rgb(116, 195, 194)\">모든 IV가 이산형(보통 범주형)</span> 일 때 사용한다.</span> \n\n<span style=\"color:rgb(118, 147, 234)\">회귀는 <span style=\"color:rgb(116, 195, 194)\">적어도 일부 IV가 연속형</span> 일때 사용된다</span> \n\n\t회귀는 상관관계의 확장과 같은 것\n\t독립 변수 중 일부 또는 적어도 하나가 연속형일 경우 회귀를 사용하는것이 좋다.\n---\n## The five steps to model-fitting\n\n<span style=\"color:rgb(118, 147, 234)\">Step1:</span>  <span style=\"color:rgb(116, 195, 194)\">모델의 기초가 되는 방정식을 정의</span>\n<span style=\"color:rgb(118, 147, 234)\">Step2:</span>  <span style=\"color:rgb(116, 195, 194)\">데이터를 모델 방정식에 매핑</span>\n<span style=\"color:rgb(118, 147, 234)\">Step3:</span>  <span style=\"color:rgb(116, 195, 194)\">방정식을 행렬-벡터 방정식으로 변환</span>\n<span style=\"color:rgb(118, 147, 234)\">Step4:</span> <span style=\"color:rgb(116, 195, 194)\">매개변수를 계산</span>\n<span style=\"color:rgb(118, 147, 234)\">Step5:</span> <span style=\"color:rgb(116, 195, 194)\">모델의 통계적 평가</span>\n\n\n---\n### General **LINEAR** model\n\n<span style=\"color:rgb(118, 147, 234)\">GLM(ANOVA, 회귀, 상관)은 선형 모델이다</span>\n\n<span style=\"color:rgb(116, 195, 194)\">선형이란 회귀 변수의 사클라 곱셈과 덧셈을 의미</span>\n\n<span style=\"color:rgb(205, 205, 81)\">로그, 제곱근, 제곱, 삼각 함수 등과 같은 것들은 비선형이다.</span> \n\n<span style=\"color:rgb(236, 158, 111)\">데이터는 비선형성을 가질 수있음; 단지 모델의 매개변수는 선형이어야만 함</span> \n\t매개변수에 비선형성이 존재하면, 더 이상 선형 모델이 아니며, 비선형 모델이 됨\n\n<span style=\"color:rgb(230, 122, 122)\">때때로 데이터의 비선형성을 해석가능하게 선형화 할 수있음.</span> ","x":-6144,"y":-891,"width":784,"height":791},
		{"id":"01ff75e28d9cdefd","type":"text","text":"15. Regression","x":-5877,"y":-1140,"width":250,"height":60,"color":"4"},
		{"id":"47ced4cca55e1cd8","type":"text","text":"## Least-squares solution to GLM\n## -최소자승법 \n\n## Linear least-squares via left inverse\n\n회귀 방정식에서 우리가 구해야하는것은 베타이다\n단, **X**는 숫자가 아닌 행렬이므로, 나누기가 아닌, 역행렬을 곱해야 한다.\n\n![](170.Pasted%20image%2020240930141153.png)\n\n## Conditions on X for left inverse to exist\n![](170.Pasted%20image%2020240930142034.png)\n1. 설계 행렬은 행보다 열이 많아야 한다.\n2. 각 독립변수들은 독립적이어야 한다.(다중공산성이어서는 안된다.)\n\t우리의 목표는 독립변수들로 종속변수를 예측하는것\n\t독립변수로 독립변수를 예측하는 것이 아님","x":-6800,"y":-891,"width":656,"height":1131},
		{"id":"9ae120d2585d703e","type":"text","text":"## Evaluating regression models- R2 and F\n\n## Evaluating a model fit with $R^2$\n![](171.Pasted%20image%2020241001084423.png)\n$R^2$ : 모델 결과의 변동성이 얼마나 실제 결과의 변동성을 얼마나 잘 설명하는지 나타내는지표\n\n---\n## Evaluating model statistical significance with F\n![](171.Pasted%20image%2020241001093315.png)\nF 통계량 : 모델의 모든 독립변수들이 함께 종속변수에 영향을 미치는지 검정\n\n---\n## Significance of individual $\\beta$ coefficients\n![](171.Pasted%20image%2020241001102054.png)\n$\\beta$ 계수 : 각 독립변수가 종속변수에 미치는 영향의  크기와 방향을 보여주며, p-value를 통해 그 유의성을 평가","x":-7380,"y":-891,"width":580,"height":1171},
		{"id":"e14e55c3f716ec68","type":"text","text":"## Simple regression\n\n## What is a \"Simple regression\"?\n\n<span style=\"color:rgb(205, 205, 81)\">단순 회귀</span> <span style=\"color:rgb(230, 122, 122)\">는 하나의 독립변수와 하나의 종속변수를 가진다</span>\n<span style=\"color:rgb(230, 122, 122)\">(엄밀히, 절편을 포함하여 두 개의 독립변수)</span> \n\n![](172.Pasted%20image%2020241001134747.png)\n\n![](172.Pasted%20image%2020241001135013.png)\n잔차 = 관찰된 데이터에서 해당 데이터포인트의 모델 예측값을 뺀 결과\n\n\n----\n## Actual and predicted data\n\n![](172.Pasted%20image%2020241001141151.png)\n\n절편 : 63, 0시간을  잔다면 63 달러를 사용할 것이라는 이야기\n\t강사에 따르면 실제로 해석할 수 있는것은 아니고, 단지 이 최적 적합 선을 절편에서부터 시작할 수 있도록 포함된것이라고 설명함\n$\\beta_1xi$(기울기) : -2.5 , 수면시간이 1시간 늘어날때마다 2.5달러씩 덜 구매함\n\n$R^2$ : .36 \nF(1,8) : 4.54\np = .066, 시각적 직관이 이 관계에 대해 말해주는 것과는 달리 수면 시간은 사람들의 음식에 지출하는 금액에 유의한 영향을 미친다고 결론지을 수 없음. 약간 유의미함\n\n<span style=\"color:rgb(118, 147, 234)\">conclusion:</span>  <span style=\"color:rgb(116, 195, 194)\">모델이 데이터에 통계적으로 유의미하게 적합하지 않음. 작은 표본 크기 때문일 수 있음 (N=10, p=.066 -> N=40, p-.039)\n</span> \n\n샘플 크기가 매우 작다면 효과가 실제 모집단에 존재하더라도 유의하지 않은 F 통계량을 얻을 가능성이 높음","x":-8000,"y":-891,"width":620,"height":1271},
		{"id":"d304adf823cba61d","type":"text","text":"## Multiple regression\n\n## Regression table\n\n![](175.Pasted%20image%2020241002101545.png)\n#### df = N-k = 30-4 =26\n\nk 파라미터 : 베타 계수의 수\n\n---\n## Visualizing the results\n\n\t단순회귀에서는 2차원으로 표현이 가능하지만,\n\t다중회귀에서는(수면,공부,시험점수) 3차원 공간에 플로팅해야 하지만, 시각화하기 어려움 \n\t그래서 변수 중 하나를 이산화 해야 함\n\t즉, 회귀변수 중 하나를 선택하여 이산화함\n\n![](175.Pasted%20image%2020241002103636.png)\n\n\n---\n## Precise interpretation of $\\beta$ in regression\n\n![](175.Pasted%20image%2020241002121258.png)\n\n<span style=\"color:rgb(205, 205, 81)\">Interpretation:</span> $\\beta_2$<span style=\"color:rgb(230, 122, 122)\">는 다른 모든 변수가 고정되어 있을 때</span>, <span style=\"color:rgb(116, 195, 194)\"><span style=\"color:rgb(236, 158, 111)\">h</span> 의 변화가 <span style=\"color:rgb(118, 147, 234)\">y</span>에 미치는 영향을 반영</span>\n","x":-8720,"y":-891,"width":720,"height":1331},
		{"id":"eec2ee1300722629","type":"text","text":"## Standardized regression coefficients\n\n## Difficulties with interpreting $\\beta$'s\n\n<span style=\"color:rgb(118, 147, 234)\">비표준화된 </span>$\\beta$ <span style=\"color:rgb(118, 147, 234)\">계수는 독립 변수(IV)의 스케일에 따라 변함</span>\n\n<span style=\"color:rgb(116, 195, 194)\">비표준화된</span> $\\beta$<span style=\"color:rgb(116, 195, 194)\">계수는 변수 간 비교가 어렵거나 불가능 할 수 있음</span>\n\n<span style=\"color:rgb(205, 205, 81)\">\n이러한 어려움 때문에</span> $\\beta$<span style=\"color:rgb(205, 205, 81)\">표준화가 필요</span>\n\n![](176.Pasted%20image%2020241002133954.png)\n\n![](176.Pasted%20image%2020241002134755.png)\n\n---\n## The scales of the $\\beta$'s\n\n<span style=\"color:rgb(118, 147, 234)\">비표준화된</span> $\\beta$ <span style=\"color:rgb(118, 147, 234)\">계수는 데이터(독립변수와 종속변수)의 스케일을 반영함. 이는 해석을 용이하게 할 수있지만, 변수나 모델간 비교를 방해할 수 있음\n</span>\n\n<span style=\"color:rgb(116, 195, 194)\">표준화된</span> $\\beta$<span style=\"color:rgb(116, 195, 194)\">계수는 데이터의 스케일과 무관하게 표준 편차 단위로 표현된다.</span>\n\n<span style=\"color:rgb(205, 205, 81)\">둘다 올바르며, 어느 쪽이 더 좋은것은 아니다. 상황에 따라 하나가 더 자연스럽거나 해석하기 쉬울 수 있음</span>\n\n<span style=\"color:rgb(236, 158, 111)\">즁요한것은, 표준화는 통계 자체에 영향을 주지않음!</span> \n","x":-9460,"y":-891,"width":740,"height":1191},
		{"id":"1bc5d6bcd3910548","type":"text","text":"## Polynomial regression models\n\n## Polynomial regression\n\n![](178.Pasted%20image%2020241003084112.png)\n\tk차 다항 회귀\n\n---\n## Model order selection, thanks to Bayes\n\n![](178.Pasted%20image%2020241003091649.png)\n\n$n\\ln(SS_\\epsilon)$ : 잔차제곱합의 자연로그에 n을 곱함\nn : 데이터 포인트의 개수\n$\\ln$ : 자연로그 \n\n$k\\ln(n)$ -> 여기서는 왜 자연로그에 k를 곱합\nk : 파라미터 개수\n\n\t이제 우리가 할 일은 다양한 차수으ㅔ 모델을 모두 실행하여 적합시키는것.\n\t그리고 이 모델들의 BIC를 플로팅\n\n![](178.Pasted%20image%2020241003093248.png)\n\t여기서 우리가 찾아야 하는것은 BIC값이 최소로 나오는 곳\n\t최소값이 최적의 모델 차수임\n","x":-10080,"y":-891,"width":620,"height":1251},
		{"id":"0f3700fce98adede","type":"text","text":"## 회귀분석의 가장 중요수식\n\n# $$\\boldsymbol{y = X\\beta}$$","x":-6800,"y":-1017,"width":656,"height":126,"color":"1"},
		{"id":"51ec9868b695bbd3","type":"text","text":"## Logistic regression\n\n## Why take the log of probabilities?\n\n![](181.Pasted%20image%2020241004092725.png)\n\n$\\frac{p}{1-p}$ : 승산비\n\n<span style=\"color:rgb(236, 158, 111)\">Main point:</span> <span style=\"color:rgb(205, 205, 81)\">작은 값의 로그는 더 큰 동적 범위를 가지며, 최적화 문제에서 다루기 쉬움</span>\n\n\t파란색 선인 승산비는 그 차이가 너무 작아 그 차이를 구분 할수 없음. \n\t하지만 자연로그를 취하면 그 차이가 명확해진다.\n\n---\n## Example logistic regression\n\n\t각 개인이 합격할 확률\n\n![](181.Pasted%20image%2020241004102951.png)\n\n\t중앙선 기준 왼쪽 부분은 시험에 탈락한 인원 오른쪽 부분은 합격한 인원\n\n2,3사분면 : 실제 예측대로 통과/탈락한 인원","x":-10660,"y":-891,"width":580,"height":1311},
		{"id":"f5a611e3bdb96dd9","type":"text","text":"## Under- and over-fitting\n\n## Over-and under-fitting : summary\n\n#### <span style=\"color:rgb(118, 147, 234)\">Overfitting</span>\n- <span style=\"color:rgb(230, 122, 122)\">전체적으로 노이즈에 민감하다.</span>\n- <span style=\"color:rgb(116, 195, 194)\">미묘한 효과에 대한 민감성 증가</span>\n- <span style=\"color:rgb(230, 122, 122)\">일반화 능력 감소</span>\n- <span style=\"color:rgb(230, 122, 122)\">파라미터가 많은 모델은 추정하기 어려움</span>\n#### <span style=\"color:rgb(118, 147, 234)\">Underfitting</span> \n- <span style=\"color:rgb(116, 195, 194)\">이상치에 덜 민감함</span>\n- <span style=\"color:rgb(230, 122, 122)\">진짜 효과를 감지할 가능성 감소</span>\n- <span style=\"color:rgb(230, 122, 122)\">일반화 능력 감소</span>\n- <span style=\"color:rgb(116, 195, 194)\">파라미터가 더 잘 추정됨</span>\n- <span style=\"color:rgb(116, 195, 194)\">적은 데이터로도 좋은 결과</span>","x":-11180,"y":-891,"width":520,"height":511},
		{"id":"3c18b91d4f472713","type":"text","text":"## Comparing \"nested\" models\n\n## F test for model comparison\n\n# $$ SS_{\\epsilon} = \\sum(y_i - \\hat{y}_i)^2$$\n<span style=\"color:rgb(205, 205, 81)\">제곱합 오차 :</span> 모델이 데이터에 맞는지 측정한다\n\n![](185.Pasted%20image%2020241006092440.png)\np , k = 파라미터의 수\n\n\t모델이 데이터를 더 잘맞출수록 제곱합은 실제로 더 작아진다\n\t따라서 분모가 양수가 될것으로 예상됨\n\n<span style=\"color:rgb(205, 205, 81)\">F 는 통계적으로 유의함</span>: 많은 파라미터는 모델을 개선시킨다. \n<span style=\"color:rgb(116, 195, 194)\">더 복잡한 모델을 선호.</span>\n\n<span style=\"color:rgb(205, 205, 81)\">F는 유의미하지 않음:</span> 더 적은 매개변수를 가진 모델이 더 많은 매개변수를 가진 모델만큼 잘 맞음.\n<span style=\"color:rgb(118, 147, 234)\">더 간단한 모델을 선호</span> \n","x":-11760,"y":-891,"width":580,"height":791},
		{"id":"74ebdb5ffbd351f5","type":"text","text":"## What to do about missing data\n\n## Missing data, option 1: complete removal\n\n![](186.Pasted%20image%2020241006101825.png)\n\n<span style=\"color:rgb(118, 147, 234)\">Used for:</span>\n- <span style=\"color:rgb(116, 195, 194)\">짝지어진 데이터</span>\n- <span style=\"color:rgb(116, 195, 194)\">분석적 변화</span> \n\n\t\b충분한 양의 데이터가 있을 때 적절한 방법\n\n---\n## Missing data , option 2: Selective removal\n\n![](186.Pasted%20image%2020241006102235.png)\n\n<span style=\"color:rgb(118, 147, 234)\">Used for:</span>\n- <span style=\"color:rgb(116, 195, 194)\">짝지어지지 않은 데이터(unpaired data)</span>\n- <span style=\"color:rgb(116, 195, 194)\">변수 내 분석</span> \n\n---\n## Missing data, option 3: replacement\n\n![](186.Pasted%20image%2020241006102613.png)\n\n<span style=\"color:rgb(118, 147, 234)\">Used for:</span>\n- <span style=\"color:rgb(116, 195, 194)\">작은 데이터셋</span> \n\n반드시 좋은 접근은 아님. 평균으로 대체하는 경우 데이터셋에 추가적인 노이즈나 불필요한 변동성을 더할 수 있음\n\n---\n## Missing data, option 4: prediction\n\n![](186.Pasted%20image%2020241006102908.png)\n\n<span style=\"color:rgb(118, 147, 234)\">Used for:</span>\n- <span style=\"color:rgb(116, 195, 194)\">예측모델이 유용하게 생성할 수 있는 충분한 데이터와 컬럼</span> \n\n만약 컬럼이 많지 않다면 평균 대체보다 더 나을것이 없음\n","x":-12320,"y":-891,"width":560,"height":2311},
		{"id":"1fa6220ffd6a62d4","type":"text","text":"16. Statistical power and sample sizes","x":-5968,"y":-2400,"width":250,"height":100,"color":"4"},
		{"id":"6a09c05663bc590a","type":"text","text":"## What is statistical power and why is it important?\n\n## Statistical decisions\n\n![](187.Pasted%20image%2020241006120839.png)\n\n통계적 검정력 = 1-$\\beta$\n\n<span style=\"color:rgb(205, 205, 81)\">통계적 검정력이란 무엇인가? </span> \n<span style=\"color:rgb(118, 147, 234)\">검정력은 영가설이 실제로 거짓일 때, 영가설을 기각할 확률</span> \n<span style=\"color:rgb(118, 147, 234)\">a.k.a., 실제 세상에 효과가 있을 때, 그 효과를 발견할 확률</span> \n\n<span style=\"color:rgb(116, 195, 194)\">확률로 표현(0~1, 0%~100%)</span>\n\n<span style=\"color:rgb(116, 195, 194)\">당신은 통계적 검정력을 최대한 높게 하고싶다</span> \n\n\n---\n## How to maximize power(statistical power)\n\n<span style=\"color:rgb(116, 195, 194)\">Power increases with:</span>\n1. <span style=\"color:rgb(116, 195, 194)\">Sample size</span>\n2. <span style=\"color:rgb(116, 195, 194)\">Effect size</span>\n3. <span style=\"color:rgb(116, 195, 194)\">Lower</span>$\\alpha$<span style=\"color:rgb(116, 195, 194)\">(e.g., p</span><<span style=\"color:rgb(116, 195, 194)\">.1)</span> \n\n<span style=\"color:rgb(230, 122, 122)\">Power decreases with:</span>\n1. <span style=\"color:rgb(230, 122, 122)\">Variability</span>\n2. <span style=\"color:rgb(230, 122, 122)\">Higher</span> $\\alpha$<span style=\"color:rgb(230, 122, 122)\">(e.g., p</span><<span style=\"color:rgb(230, 122, 122)\">.01)</span> \n","x":-6663,"y":-2400,"width":695,"height":1020},
		{"id":"4ad3be5559780e91","type":"text","text":"## Estimating statistical power and sample size\n## Is it all so simple?\n\n<span style=\"color:rgb(118, 147, 234)\">더 복잡한 통계모델(ANOVA, 다중회귀)의 공식들은 복잡하지만</span> <span style=\"color:rgb(116, 195, 194)\">원리는 동일하다</span>\n\n<span style=\"color:rgb(205, 205, 81)\">실제로는 온라인 통계적 검정력 계산기를 사용하는것이 좋음</span>","x":-7102,"y":-2400,"width":439,"height":284},
		{"id":"b575bb375d9dcdf9","type":"text","text":"## K-means clustering\n\nBasic k-means algorithm\n\n1. <span style=\"color:rgb(118, 147, 234)\">k 값을 정한다.</span> \n2. <span style=\"color:rgb(116, 195, 194)\">k개의 중심점을 데이터셋의 무작위 위치에 생성</span> \n3. <span style=\"color:rgb(205, 205, 81)\">각 데이터 포인트에서 각 k 중심까지의 제곱거리합(오류)을 계산</span> \n4. <span style=\"color:rgb(236, 158, 111)\">각 데이터 포인트를 가장 가까운 중심점에 할당</span> \n5. <span style=\"color:rgb(230, 122, 122)\">모든 데이터 포인트의 평균에 새로운 중심점을 생성</span> \n6. 3-5단계를 수렴할때까지 반복\n\n---\n## Difficulties with k-means\n\n1. <span style=\"color:rgb(118, 147, 234)\">적절한 k 값을 사전에 알기 어려울 수 있음</span>\n2. <span style=\"color:rgb(116, 195, 194)\">적절한 k 값을 평가하는것도 어려움</span>\n3. <span style=\"color:rgb(205, 205, 81)\">다차원 클러스터링은 시각화하기 어렵거나 불가능할 수 있다.</span>\n4. <span style=\"color:rgb(236, 158, 111)\">알고리즘을 반복할 때마다 다른 결과를 낼 수 있다.</span>\n5. <span style=\"color:rgb(230, 122, 122)\">클러스터 크기가 다른 경우 최적이 아닐 수 있다.</span>\n6. 모든 클러스터가 거리 기반은 아니다.","x":-7640,"y":-2400,"width":538,"height":600},
		{"id":"bacd139092bca522","type":"text","text":"## \bClustering via dbscan\n\n---\n## What is dbscan?\n\n<span style=\"color:rgb(118, 147, 234)\">Density-based spatial clustering of applications with noise</span>\n-잡음을 포함한 응용의 밀도 기반 공간 군집화\n\n![](194.Pasted%20image%2020241008124512.png)\n\n---\n## Overview of dbscan algorithm\n\n![](194.Pasted%20image%2020241008125250.png)\n\n\t무작위 데이터포인트를 선택하고 특정 거리 안에 다른 데이터 포인트가 있는지 확인(epsilon)\n\t최소 포인트 수를 파라미터로 지정 가능 - n개이상 모여있을 경우 군집으로 간주(minimum points)\n---\n## Dbscan parameters\n\n<span style=\"color:rgb(205, 205, 81)\">Epsilon(</span>$\\epsilon$<span style=\"color:rgb(205, 205, 81)\">)</span>\n1. <span style=\"color:rgb(236, 158, 111)\">군집을 찾는 스텝 크기</span>\n2. <span style=\"color:rgb(236, 158, 111)\">너무 작음 -> 클러스터가 여러개로 쪼개짐</span>\n3. <span style=\"color:rgb(236, 158, 111)\">너무 큼 -> 별개의 클러스터가 하나로 합쳐짐</span>\n\n<span style=\"color:rgb(118, 147, 234)\">최소 점(Minimum points)</span>\n1. <span style=\"color:rgb(236, 158, 111)\">클러스터로 인정받기 위한 최소 점 개수</span>\n2. <span style=\"color:rgb(236, 158, 111)\">너무 작음 -> 많은 클러스터가 생김</span>\n3. <span style=\"color:rgb(236, 158, 111)\">너무 큼 -> 실제로 작은 클러스터들이 무시됨</span>","x":-8280,"y":-2400,"width":640,"height":1446},
		{"id":"cdb3c3ceb48c8262","type":"text","text":"- 엘보 테스트\n- 실루엣 테스트","x":-7281,"y":-2480,"width":179,"height":80,"color":"3"},
		{"id":"763c8e1022ef12f6","type":"text","text":"## k-means vs. dbscan\n\n<span style=\"color:rgb(118, 147, 234)\">K-means :</span>\n- <span style=\"color:rgb(116, 195, 194)\">중심점과의 거리를 기반</span>\n- <span style=\"color:rgb(205, 205, 81)\">전역 거리를 고려</span>\n- <span style=\"color:rgb(236, 158, 111)\">클러스터 수를 지정하면, 알고리즘이 거리 임계값을 결정</span>\n- <span style=\"color:rgb(230, 122, 122)\">구형(spherical) 클러스터에 잘 동작함</span>\n- <span style=\"color:rgb(116, 195, 194)\">다른 차원에서의 스케일링 효과에 민감</span>\n- <span style=\"color:rgb(205, 205, 81)\">각 점이 클러스터에 할당됨</span>\n\n<span style=\"color:rgb(118, 147, 234)\">Dbscan:</span>\n- <span style=\"color:rgb(116, 195, 194)\">이웃 거리 기반</span>\n- <span style=\"color:rgb(205, 205, 81)\">완전히 지역적 거리에 기반을 둠</span>\n- <span style=\"color:rgb(236, 158, 111)\">거리 임계값을 지정하면, 알고리즘이 클러스터 수를 결정</span>\n- <span style=\"color:rgb(230, 122, 122)\">모든 형태의 클러스터에 잘 동작</span>\n- <span style=\"color:rgb(116, 195, 194)\">스케일링과 클러스터 간의 차이에 극도로 민감</span>\n- <span style=\"color:rgb(205, 205, 81)\">점들이 레이블이 없는 상태(unlabeld)로 있을 수 있음</span> ","x":-7874,"y":-2920,"width":468,"height":520},
		{"id":"7137f322fc59646b","type":"text","text":"## K-nearest neighbor classification\n\n## The goal of KNN categorization\n\n![](197.Pasted%20image%2020241010090940.png)\n\n\t블루팀, 레드팀이 존재하고 여기서 새로운 녹색 원이 들어갈때, 이 새로운 데이터는 블루팀인가? 레드팀인가?\n\tk=3 새로운 데이터포인트 주변에서 가장 가까운 3개의 데이터를 찾고, 이느 팀에 해당하는 데이터가 많은지 체크. -> k=3일 때는 레드팀, k=5일 때에는 블루팀\n\n\tKNN은 기존에 분류되어있는 데이터에서 시작\n\tK-means는 비 분류된 데이터에서 시작\n","x":-8840,"y":-2400,"width":560,"height":723},
		{"id":"9d88880dd2648304","type":"text","text":"## Principal components analysis(PCA)\n\n## The idea of PCA\n\n![199.Pasted image 20241011122142](../pic/16.%20Clustrering%20and%20dimension-reduction/199.Pasted%20image%2020241011122142.png)\n\n\t현재 데이터에서 중요한 축은 x,y축이 아님\n\t이 데이터에는 상관 구조의 특성을 훨씬 잘 포착하는 또 다른 축(주황,빨강)이 존재\n\n\tPCA의 개념은 벡터, 즉, 다변량 상관 공간\b의 방향을 식별하는 것\n\n\t동일 데이터는 PC(주성분) 공간에서는 아래처럼 보임\n![199.Pasted image 20241011122552](../pic/16.%20Clustrering%20and%20dimension-reduction/199.Pasted%20image%2020241011122552.png)\n\n\tx,y 변수를 차원으로 사용하는 대신 PC1,PC2를 차원으로 사용\n\t데이터 변경없이 방향만 재조정한것\n\t\n\n\n\tPCA를 적용한 후 할수 있는 일 중 하나는 데이터 압축\n\t데이터 압축 : 데이터 공간에서 일부 차원을 평면화 하는 개념\n\n","x":-9320,"y":-2400,"width":480,"height":1446,"color":"3"},
		{"id":"fda4a00aad26bb87","type":"text","text":"## Independent components analysis (ICA)\n\n## CA assumptions\n\tICA는 신호가 비가우시안 분포를 따른다는 가정과, 신호들의 임의 혼합은 가우시안 분포를 따른다는 가정에서 출발\n\n![](202.Pasted%20image%2020241015104617.png)\n\n---\n## How ICA works\n\n![](202.Pasted%20image%2020241015105205.png)\n\n<span style=\"color:rgb(116, 195, 194)\">공분산을 제거하기 위해 데이터를 화이트닝한다.</span> \n\n\t화이트닝 : PCA에 의해 수행되는 단계, 기본적으로 공분산 행렬을 대각선 이외의 요소가 모두 0인 행렬로 변환하는것\n\n![](202.Pasted%20image%2020241015105225.png)\n\n\t화이트닝된 버전의 예시. 여전히 변수들 간의 관계가 있음을 볼 수 있음\n\tICA는 이러한 관계를 찾아냄 \n\t하지만 데이터 자체는 모든 상관관계가 제거됨\n\t따라서 PC1, PC2간의 상관관계는 0\n\n\n<span style=\"color:rgb(118, 147, 234)\">선형종속성은 제거되지만, 공유된 정보는 유지됨.</span> \n\n<span style=\"color:rgb(236, 158, 111)\">공유된 정보를 최소하기 위해 축을 회전(경사적으로) 시킴</span> \n\n![](202.Pasted%20image%2020241015110807.png)\n","x":-9840,"y":-2400,"width":520,"height":1446,"color":"3"},
		{"id":"60ab1126536ec931","type":"text","text":"**PCA** \n데이터의 **분산을 최대화하는 방향**으로 차원을 줄여 **데이터의 주요 특성**을 추출하는 기법\n\n **ICA**\n**비가우시안 분포**를 따르는 **독립적인 신호**들의 혼합 데이터(가우시안)을 분리하여 **원래의 신호**를 복원하는 기법","x":-9500,"y":-2600,"width":360,"height":200},
		{"id":"bd6713aa474d7938","type":"text","text":"17. Signal detection theory","x":-5960,"y":-5680,"width":268,"height":91,"color":"4"},
		{"id":"99c63828f3413e94","type":"text","text":"## The two perspectives of the world\n\n## Experiment: Do <span style=\"color:rgb(118, 147, 234)\">you</span><span style=\"color:rgb(118, 147, 234)\"><span style=\"color:rgb(118, 147, 234)\"> seethe light?\n</span></span>\n![](204.Pasted%20image%2020241015142041.png)\n\n\t신호 탐지 이론에서는 네가지 반응 범주라는 공식 용어를 사용\n","x":-6480,"y":-5680,"width":520,"height":600},
		{"id":"fd4934550d8fa21e","type":"text","text":"## d-prime\n\nd-prime (d')\n\n<span style=\"color:rgb(116, 195, 194)\">d-prime은 변별력의 척도</span> \n\n<span style=\"color:rgb(118, 147, 234)\">모든 긍정 응답(히트+오경보)과 정확한 긍정 응답(히트)을 구분하는것이 필요</span>\n\n<span style=\"color:rgb(236, 158, 111)\">따라서 d'는  성능을 더 정확하게 측정하는 지표</span>\n\n---\n## Algorithm to compute d'\n\n1. <span style=\"color:rgb(116, 195, 194)\">적중과 오경보를 실제 사건의 총 개수에 대한 비율로 변환</span> \n![](205.Pasted%20image%2020241016075602.png)\n\n#### 적중(Hit)의 확률\n\n<span style=\"color:rgb(116, 195, 194)\">p(H)</span> = <span style=\"color:rgb(118, 147, 234)\">Hit</span> / <span style=\"color:rgb(205, 205, 81)\">(Hit+Miss)</span> \n\n- Note : 이것은 단순히 조건부 확률임! \n- p('yes'/Present)\n\n#### 거짓경보(False alarm)의 확률\n\n<span style=\"color:rgb(116, 195, 194)\">p(FA)</span> = <span style=\"color:rgb(118, 147, 234)\">FA</span> / <span style=\"color:rgb(205, 205, 81)\">(FA+CR)</span>\n\n---\n\n2. <span style=\"color:rgb(118, 147, 234)\"> 비율을 표준 z 점수로 변환</span>\n\n\t확률을 표준 Z 점수로 변환하면 극단적인 값들에 더 높은 민감도를 가질 수 있음\n\n![](205.Pasted%20image%2020241016080829.png)\n\n---\n3. <span style=\"color:rgb(236, 158, 111)\">d' = <span style=\"color:rgb(118, 147, 234)\">z(H)</span> - <span style=\"color:rgb(118, 147, 234)\">z(FA)</span></span>\n\n![](205.Pasted%20image%2020241016080944.png)\n","x":-7000,"y":-5680,"width":520,"height":2080},
		{"id":"2269645d314e17b9","type":"text","text":"## Response bias\n\n## Analysis: Compute response bias\n\n1. <span style=\"color:rgb(116, 195, 194)\">적중과 허위경보를 총 \"존재\" 사건 수에 대한 비율로 변환</span>\n\t![](207.Pasted%20image%2020241016113635.png)\n\n2. <span style=\"color:rgb(118, 147, 234)\">비율을 표즌 z로 변환</span>\n\t![](207.Pasted%20image%2020241016113648.png)\n\n3. <span style=\"color:rgb(236, 158, 111)\">음수 평균을 취함: </span><span style=\"color:rgb(236, 158, 111)\">-[z(FA)+z(H)]/2</span> \n\t![](207.Pasted%20image%2020241016113708.png)\n\n---\n## How to interpret response bias?\n\n![](207.Pasted%20image%2020241016114605.png)\n\n\t0 : 편향이 없음\n\t양수 : no라고 말하는 비율이 많다. / 미스와 올바른 거부가 더 많다\n\t음수 : yes라고 말하는 비율이 많다. / 적중과 거짓경보가 더 많다.\n\n","x":-7480,"y":-5680,"width":480,"height":1040},
		{"id":"44daa3ba7d0e9cc9","type":"text","text":"## d' : 피실험자가 신호와 잡음을 얼마나 잘 구분하는지","x":-7000,"y":-5800,"width":520,"height":120,"color":"1"},
		{"id":"e327b5ab4ca4a6fb","type":"text","text":"## Rseponse bias: 실험자의 응답이 어느쪽에 편향되어있는지","x":-7480,"y":-5800,"width":480,"height":120,"color":"1"},
		{"id":"a89419e6a75d1b8e","type":"text","text":"## F-score\n\n## Precision and recall\n-정확도와 재현율\n\n![](209.Pasted%20image%2020241016122516.png)\n### <span style=\"color:rgb(236, 158, 111)\">Precision</span>  = $\\frac{\\text{Hits}}{\\text{Hits+FA}}$\n\n### <span style=\"color:rgb(205, 205, 81)\">Recall</span> = $\\frac{\\text{Hits}}{\\text{Hits+Miss}}$\n\n---\n## F-score\n\n![](209.Pasted%20image%2020241016122824.png)\n\n\t적중 수량이 아무리 많더라도 최대 1을 넘을 수 없음\n\t모든 응답의 수가 일정하면 f-score값은 .5가 나옴\n\n\t값이 높을수록 정확도와 재현율이 높다 -> 성능이 좋다\n","x":-7920,"y":-5680,"width":440,"height":1040},
		{"id":"2baad504b71e237f","type":"text","text":"## F-score:  정확도와 재현율이 어느정도인지","x":-7920,"y":-5800,"width":440,"height":120,"color":"1"},
		{"id":"72b222f0d1635c08","type":"text","text":"## ROC curves : d'와 응답편향을 반영하여 모델의 분류 성능을 시각화","x":-8560,"y":-5800,"width":640,"height":120,"color":"1"},
		{"id":"3fcb15c6414985c4","type":"text","text":"## Receiver operating characteristics (ROC)\n\n## Isosensitivity curves in \"yes\" space\n\n<span style=\"color:rgb(118, 147, 234)\">Motivation, part 1:</span> <span style=\"color:rgb(116, 195, 194)\">동일한 d' 값은 특정 p(H)와 p(FA) 값들에서 다양하게 얻을 수 있음</span>\n\n<span style=\"color:rgb(118, 147, 234)\">Motivation, part 2:</span> <span style=\"color:rgb(116, 195, 194)\">적중과 오경보는 별개의 사건이다. </span>\n\n<span style=\"color:rgb(118, 147, 234)\">Therefore:</span> <span style=\"color:rgb(236, 158, 111)\">우리는 적중과 오경보를 기반으로 한 2D공간을 생성할 수 있다.</span> \n\n![](210.Pasted%20image%2020241016143443.png)\n\n\td'가 1 일때에는 적중률이 25%~90%인 경우가 존재함 -> motivation 1에 해당\n\n![](Pasted%20image%2020241016143931.png)\n\n\t응답편향은 0이하로 내려갈 수 있음\n\n\t이러한 d',응답 편향은 실제 데이터에서 그려질 수 있으며, 이공간 어디에서도 그릴 수 있음\n","x":-8560,"y":-5680,"width":640,"height":1395},
		{"id":"094c1d5eb8984b1f","x":-6600,"y":-7012,"width":632,"height":1132,"type":"text","text":"# 219. Married and Divorce data in us\n## Code\n\n```python\nimport numpy as np\nimport matplotlib.pyplot as plt \nimport scipy.stats as stats\nfrom sklearn.decomposition import PCA\nimport pandas as pd\n```\n\n```python\n# data urls\n\nmarriage_url = 'https://www.cdc.gov/nchs/data/dvs/state-marriage-rates-90-95-99-19.xlsx'\ndivorce_url  = 'https://www.cdc.gov/nchs/data/dvs/state-divorce-rates-90-95-99-19.xlsx'\n```\n\n```python\ndata = pd.read_excel(marriage_url,header=5)\ndata\n```\n\n```python\n# remove irrelevant rows\ndata.drop([0,52,53,54,55,56,57],axis=0,inplace=True)\ndata\n```\n\n```python\n# replace --- with nan\ndata = data.replace({'---':np.nan})\ndata\n```\n\n```python\n# replace nan's with column median\n# median으로 대치한 이유는 모든 주의 결혼율이 매년 대략 비슷할것이라는 가정을 했기 때문\n# 이상치가 있을 경우 mean이 아닌 median을 사용하여 대체값이 이상치에 영향을 받지 않도록 하기 위함\ndata.fillna(data.iloc[:,1:].median(), inplace=True)\ndata\n```\n\n```python\n# extract to matrices\nyearM = data.columns[1:].to_numpy().astype(float)\nyearM\n\nstatesM = data.iloc[:,0]\nstatesM\n# df 를 np로 변환 -> 강사의 개인적 선호\nM = data.iloc[:,1:].to_numpy()\nnp.round(M,2)\n```\n\n```python\n# make some plots\n\nfig, ax = plt.subplots(3,1,figsize=(8,5))\n\n# x : 년도, y :해당 주의 1000명당 결혼 건수\nax[0].plot(yearM,M.T)\nax[0].set_ylabel('M. rate (per 1k)')\nax[0].set_title('Marriage rates over time')\n\n# Marriage rates over time 그래프를 보면 값이 유별나게 다른 주(노란색)이 보이는데, 이 이상치 때문에 상대적으로 다른 주들의 데이터가 평평하게 보임\n# 그래서 z-정규화를 통해 모든 주 동일한 수 범위를 가지도록 플롯함\n# 데이터를 보면, 모든 주의 결혼율이 전반적으로 감소하 있음을 볼 수 있음. 약 30년동안 혼인률은 하락하고 있음.\nax[1].plot(yearM,stats.zscore(M.T))\nax[1].set_ylabel('M. rate (per 1k)')\nax[1].set_title('M. rate (z-norm)')\n\n# notice that x-axis is non-constant\n# M(주 별로 묶여있음)의 행의 평균 -> 전체 주들의 연도별 평균을 구함\n# 사각형 마커를 통해, x축의 연도가 일정하지 않음을 알 수 있음 -> 90 95 순으로 있다가, 99년부터는 매년 데이터가 존재함\nax[2].plot(yearM,np.mean(M,axis=0),'ks-',markerfacecolor='w',markersize=8)\nax[2].set_xlabel('Year')\nax[2].set_ylabel('M. rate (per 1k)')\nax[2].set_title('State-average')\n# QUESTION: Is this the same as the US average?\n# Me : yes. 데이터셋에서는 모든 혼인율이 인구 1000명당 비율로 정규화 되어있음. -> 각 주별 데이터가 정규화되어있기 때문에 모든 주의 평균은 전국 평균과 동일함\n# 만약 데이터가 정규화되어있지 않았다면, 즉 인구 1000명당 결혼율이 아닌 주별 연도별 결혼 건수의 원본 데이터만 있었다면 가중평균을 계산해야 함\nplt.tight_layout()\nplt.show()\n\n# 그래프를 보면, 이상하게 시간으 흐를수록 값이 증가하는것처럼 보이지만, df를 보면, 연도를 역순으로 나열하고있기때문에, 반대로 봐야 한다.\n# 하지만 선 그래프를 그릴 때는 python이 연도를 자동 정렬하여 오름차순으로 배치한다.\n# imshow는 정렬없이 그대로 출력함\nplt.imshow(stats.zscore(M,axis=1),aspect='auto')\nplt.xticks([])\nplt.xlabel('Year')\nplt.ylabel('State index')\nplt.colorbar()\nplt.show()\n```\n![](222.Pasted%20image%2020241017212520.png)\n![](222.Pasted%20image%2020241017212531.png)\n\n```python\n# barplot of average marriage rate\n\n# average over time\n# 위 그래프[State-average]에서는 연도별 주 전체의 평균 [np.mean(M,axis=0)] 을 구했지만, 여기서는 주별 총 평균[np.mean(M,axis=1)] 을 구함\nmeanMarriageRate = np.mean(M,axis=1)\n\n# sort index\nsidx_M = np.argsort(meanMarriageRate)\n\nfig= plt.figure(figsize=(12,4))\nplt.bar(statesM.iloc[sidx_M],meanMarriageRate[sidx_M])\nplt.xticks(rotation=90)\nplt.ylabel('M. rate (per 1k)')\nplt.title('Marriage rates per state')\nplt.show()\n\n# QUESTION:\n# IS Nevada a non-representative datapoint or error?\n# ME : 이 데이터는 비정상적이고 대표성은 부족하지만, 유효한 데이터이므 제거할 이유는 없음. 단지 다른 주들과 차이가 클 뿐\n\n\n# 강조 : 다양한 방식으로 데이터를 시각화하는것이 중요함.\n#       시각화를 통해 데이터를 조사하, 패턴을 파악하, 대표성 없는 비정상적 데이터 이상치를 발견할 수 있음\n```\n![](222.Pasted%20image%2020241017212550.png)\n\n```python\n# show the correlation matrix\n\n# imshow 함수는 축 간의 변화 선형일것이라 가정\n# 엄밀 말하자면 이 그래프는 정확하지 않음 (90-95년 사이의 데이터는 없지만 그래프에서는 있는것처럼 표현됨)\n\n# 강사는 모든 요소 강한 상관관계를 보이는 상관행렬을 보면, 이 데이터에 실제 몇개의 패턴이 있는지 궁금하다 함.\n# 아마 하나의 특징만으로 전체 데이터셋을 설명할 수 있을지도 모른다고 생각함.\n# 이 가능성을 확인하기 위해 PCA를 수행함\nplt.imshow(np.corrcoef(M.T),vmin=.9,vmax=1,\n    extent=[yearM[0],yearM[-1],yearM[-1],yearM[0]])\n\nplt.colorbar()\nplt.show()\n```\n![](222.Pasted%20image%2020241017212609.png)\n\n```python\n# PCA\n\npca = PCA().fit(M)\n\n# scree plot\n# PCA를 수행하는 이유는 스크리 플롯(scree plot)을 분석하 위함임\n# 스크리 플롯은 고유값을 나타내며, 각 주성분이 설명하는 분산의 비율을 보여줌\nplt.plot(100*pca.explained_variance_ratio_,'ks-',markerfacecolor='w',markersize=10)\nplt.ylabel('Percent variance explained')\nplt.xlabel('Component number')\nplt.title('PCA scree plot of marriage data')\nplt.show()\n\nprint(100*pca.explained_variance_ratio_[0])\n# 그래프를 보면 첫번째 주성분이 거의 모든 분산을 설명함.\n# 첫번째 주성분이 98.2%의 분산을 설명함. 즉 하나의 데이 특징만으로 전체 데이터셋을 설명할 수 있음\n# 이는 시간이 지남에 따른 전반적인 감소라는 데이터의 특징과 일치하므 놀랍지 않음..\n```\n![](222.Pasted%20image%2020241017212628.png)\n\n## Repeat for divorce data\n```python\n# import the data\n# 보통은 데이터 불러오기, 처리, 정제, 보간 작업을 한 코드 셀에서 모두 처리하는 것을 추천하지 않음\n# 각 코드 라인이 실행될 때마다 데이터를 확인하는 것이 좋음\ndata = pd.read_excel(divorce_url,header=5)\ndata.drop([0,52,53,54,55,56,57],axis=0,inplace=True)\ndata = data.replace({'---':np.nan})\ndata.fillna(data.iloc[:,1:].median(),inplace=True)\n\n# 이혼 데이터와 결혼 데이터가 완전히 일치하는지 확인을 해야함\nyearD = data.columns[1:].to_numpy().astype(float)\nstatesD = data.iloc[:,0]\nD = data.iloc[:,1:].to_numpy()\n```\n\n```python\n# make some plots\n\nfig, ax = plt.subplots(3,1,figsize=(8,5))\n\nax[0].plot(yearD,D.T)\nax[0].set_ylabel('D. rate (per 1k)')\nax[0].set_title('Divorce rates over time')\n\nax[1].plot(yearD,stats.zscore(D.T))\nax[1].set_ylabel('D. rate (per 1k)')\nax[1].set_title('D. rate (z-norm)')\n\n# notice that x-axis is non-constant\nax[2].plot(yearD,np.mean(D,axis=0),'ks-',markerfacecolor='w',markersize=8)\nax[2].set_xlabel('Year')\nax[2].set_ylabel('D. rate (per 1k)')\nax[2].set_title('State-average')\n\nplt.tight_layout()\nplt.show()\n\nplt.imshow(stats.zscore(D,axis=1),aspect='auto')\nplt.xticks([])\nplt.xlabel('Year')\nplt.ylabel('State index')\nplt.colorbar()\nplt.show()\n\n\nmeanDivorceRate = np.mean(D,axis=1)\n# sort index\nsidx_D = np.argsort(meanDivorceRate)\n\nfig= plt.figure(figsize=(12,4))\nplt.bar(statesD.iloc[sidx_D],meanDivorceRate[sidx_D])\nplt.xticks(rotation=90)\nplt.ylabel('D. rate (per 1k)')\nplt.title('Divorce rates per state')\nplt.show()\n\n\n# 결혼과는 달리 이혼은 상대적으 변동이 큼 -> 이혼율은 혼인율에 비해 더 동적인 변화를 보임\n# 하지만 색상이 .9에서 포화되 때문에 정확한 값을 알 수 없음.\nplt.imshow(np.corrcoef(D.T),vmin=.7,vmax=1, # vmin =.9 -> .7 로 수정\n    extent=[yearD[0],yearD[-1],yearD[-1],yearD[0]])\n\nplt.colorbar()\nplt.show()\n\n# PCA\n\npca = PCA().fit(D)\n\n# scree plot\n# 이 결과는 시간에 따른 전반적인 감소라는 하나의 주성분에 의해 설명되지만, 그 외에도 다른 요인이 존재함. 명확히 해석하기 어려움\nplt.plot(100*pca.explained_variance_ratio_,'ks-',markerfacecolor='w',markersize=10)\nplt.ylabel('Percent variance explained')\nplt.xlabel('Component number')\nplt.title('PCA scree plot of divorce data')\nplt.show()\n\nprint(100*pca.explained_variance_ratio_[0])\n```\n![](222.Pasted%20image%2020241017212728.png)\n![](222.Pasted%20image%2020241017212738.png)\n![](222.Pasted%20image%2020241017212754.png)\n![](222.Pasted%20image%2020241017212804.png)\n![](222.Pasted%20image%2020241017212815.png)\n\n```python\n# cehck if marriage and divorce datasets have the same year/state order\n\n# should be zero\nprint('Comparison of year vectors: ')\nprint(np.sum(yearD-yearM))\n\n# should be TRUE\nprint('')\nprint('Comparison of states vectors: ')\nprint(statesM.equals(statesD))\n# ... uh oh...\n\n# compare\ntmpStateNames = pd.concat([statesM,statesD],axis=1)\nprint(tmpStateNames)\n\n# find the difference / (array([4]),) -> 5번쨰 row의 값이 다름\nnp.where(tmpStateNames.iloc[:,0] != tmpStateNames.iloc[:,1])\n```\n\n```python\n# btw, you can also correlate over states\nfig = plt.figure(figsize=(12,12))\nplt.imshow(np.corrcoef(D),vmin=0,vmax=1)\nplt.xticks(ticks=range(len(statesD)),labels=statesD,rotation=90)\nplt.yticks(ticks=range(len(statesD)),labels=statesD)\nplt.colorbar()\nplt.show()\n```\n![](222.Pasted%20image%2020241017212841.png)\n\n## Now for some inferrential statistics\n```python\n# Correlate M and D over time per state\n# 혼인율과 이혼율의 연도별 상관관계 계산 -> 각 주의 혼인율과 이혼율 감소 30년동안 서로 상관관계가 있었는지\n\n# Bonferroni crrected threshold\npvalThresh = .05#/51 # 50개의 주와 워싱턴 DC에 대한 본페로니 보정 -> 보수적인 임계값 제공\n\nfig = plt.figure(figsize=(6,10))\n\ncolor = 'rg'\nfor si in range(len(statesM)):\n\n    # compute correlation\n    r,p = stats.pearsonr(M[si,:],D[si,:])\n\n    # plot the data point\n    plt.plot([r,1],[si,si],'-',color=[.5,.5,.5])\n    plt.plot(r,si,'ks',markerfacecolor=color[bool(p<pvalThresh)])\n\nplt.ylabel('State')\nplt.xlabel('Correlation')\nplt.title('Marriage-divorce correlations per state')\nplt.yticks(range(len(statesM)),labels=statesD)\nplt.tick_params(axis='y',which = 'both',labelleft=False,labelright=True)\nplt.xlim([-1,1])\nplt.ylim([-1,51])\nplt.plot([0,0],[-1,51],'k--')\nplt.show()\n\n## 대부분 양의 상관관계를 보이지만 유의미하지 않은 경우(빨간 점)도 존재. 몬타나, 미네소타는 상관관계가 0에 가깝다. \n```\n![](222.Pasted%20image%2020241017212909.png)\n\n```python\n# have marriage/divorce rates really declined over time?\n\nfig,ax = plt.subplots(2,1,figsize=(12,6))\n\n# initialize slope differences vector\nMvsD = np.zeros(len(statesM))\n\nfor rowi in range(len(statesM)):\n\n    # run regression (includes the intercept!)\n    bM,intercept,r,pM,seM = stats.linregress(yearM,M[rowi,:])\n    bD,intercept,r,pD,seD = stats.linregress(yearD,D[rowi,:])\n\n    # normalize beta coefficients\n    bM = bM / seM\n    bD = bD / seD\n\n    # plot the slope values\n    # 여기에는 유의미하지 않은값 몇개가 있으며, 워싱턴DC는 결혼율이 시간이 지남에 따라 증가함\n    ax[0].plot([rowi,rowi],[bM,bD],'k')\n    ax[0].plot(rowi,bM,'ko',markerfacecolor=color[bool(pM<pvalThresh)])\n    ax[0].plot(rowi,bD,'ks',markerfacecolor=color[bool(pD<pvalThresh)])\n\n    # plot the slope differences\n    # 양수값을 보이는 주는 이혼율의 감소 더 빠르다는것을 의미\n    ax[1].plot([rowi,rowi],[bM-bD,0],'k-',color=[.7,.7,.7])\n    ax[1].plot([rowi,rowi],[bM-bD,bM-bD],'ko',color=[.7,.7,.7])\n\n    # store the slope differences for subsequent t-test\n    MvsD[rowi] = bM-bD\n\n\n# make the plot look nicer\nfor i in range(2):\n    ax[i].set_xticks(range(51))\n    ax[i].set_xticklabels(statesD,rotation=90)\n    ax[i].set_xlim([-1,51])\n    ax[i].plot([-1,52],[0,0],'k--')\n\nax[0].set_ylabel('Decrease per year (norm.)')\nax[1].set_ylabel('$\\Delta$M - $\\Delta$D')\n\n\n### ttest on whether the M-vs-D rates are really different\nt,p = stats.ttest_1samp(MvsD,0)\ndf = len(MvsD)-1\n\n# set the title\nax[1].set_title('Marriage vs. divorce: t(%g)=%f, p =%f'%(df,t,p))\n\nplt.tight_layout()\nplt.show()\n\n# 지난 30년동안 결혼하는 사람은 줄어들었찌만, 이혼하는 사람은 더 많이 줄어들음\n# 하지만 효과는 크기 크 않고, 모든 주에서 동일한 현상이 나타나는것도 아님\n```\n![](222.Pasted%20image%2020241017212933.png)"},
		{"id":"d2ed46a5dfa0d658","x":-5968,"y":-7012,"width":250,"height":60,"color":"4","type":"text","text":"18. A real-world data journey"},
		{"id":"c9b788a5265b4673","x":-7200,"y":-7012,"width":600,"height":1132,"type":"text","text":"# 222. Take-home message\n\nThe morals of the story\n\n![](222.Pasted%20image%2020241017210012.png)\n\n![](222.Pasted%20image%2020241017210234.png)\n\n\t데이터작업에서는 시각화,정제,처리,기술 통계를 반복하여 데이터를 이해하는 과정이 매우 중요하다.\n\n<span style=\"color:rgb(118, 147, 234)\">의심이 든다면 시각화라:</span> <span style=\"color:rgb(116, 195, 194)\">자주, 그리고 여러가지 방식으로 시각화하라</span>\n\n<span style=\"color:rgb(118, 147, 234)\">누락되거나 비정상적인 데이터를 주의하라:</span> <span style=\"color:rgb(116, 195, 194)\">기술 통계의 시각화가 도움을 줄것임</span>\n\n<span style=\"color:rgb(118, 147, 234)\">많은 타당성 검사(Sanity check)를 수행하라:</span> <span style=\"color:rgb(116, 195, 194)\">데이터/행렬 크기를 확인하고, 벡터화된 결과를 for-loop로 비교하며, 정제 전후의 기술 통</span><span style=\"color:rgb(116, 195, 194)\">계를 비교하라. 다양한 데이터 선택에서 얻은 결과가 정</span><span style=\"color:rgb(116, 195, 194)\">말로 다른지 확인하고, 상관계수 r=1이 있는지 점검하라.</span> \n\n<span style=\"color:rgb(236, 158, 111)\">비판적이어라:</span><span style=\"color:rgb(205, 205, 81)\"> 항상 무언가가 잘못되었다고 가정하고, 그것이 틀렸음이 입증될 때까지 의심하라 (과학자가 되라!)</span>\n\n<span style=\"color:rgb(236, 158, 111)\">열린 마음을 가져라:</span> <span style=\"color:rgb(205, 205, 81)\">가설을 세우되, 그 가설이 맞다고 단정하지 말라. 데이터가 \"무엇을 말하고 있는지\" 들어라</span>\n\n<span style=\"color:rgb(236, 158, 111)\">호기심을 가져라:</span> <span style=\"color:rgb(205, 205, 81)\">첫번째 생각은 항상 <span style=\"color:rgb(255, 192, 0)\">\"어? 왜 그렇지?\"</span>가 되어야 하고, 두번째 생각은 <span style=\"color:rgb(255, 192, 0)\">\"그걸 테스트해 볼 수 있을까?\"</span>가 되어야 한다.\n</span> \n"}
	],
	"edges":[
		{"id":"ba7104d9704d54ef","fromNode":"281815e60a582d94","fromSide":"top","toNode":"34bb680a05836778","toSide":"bottom"},
		{"id":"2b5510b54d5fafd4","fromNode":"281815e60a582d94","fromSide":"top","toNode":"5c88d52ce34aeaae","toSide":"bottom"},
		{"id":"f6d3d6f43080f521","fromNode":"34bb680a05836778","fromSide":"top","toNode":"2969f5ddb96bf49b","toSide":"bottom"},
		{"id":"cdabf9758ec14e53","fromNode":"5c88d52ce34aeaae","fromSide":"top","toNode":"5222d78e38d88a93","toSide":"bottom"},
		{"id":"205dc56b451de602","fromNode":"ab4d9e7ca63e7486","fromSide":"left","toNode":"a118fa474c14b969","toSide":"right"},
		{"id":"e13f45f7601612a2","fromNode":"ab4d9e7ca63e7486","fromSide":"left","toNode":"8ad1c6fa167d7a35","toSide":"right"},
		{"id":"78a6bb85e055c453","fromNode":"ab4d9e7ca63e7486","fromSide":"left","toNode":"b0c9f43252f1e408","toSide":"right"},
		{"id":"6f0e4a525d97a58e","fromNode":"ab4d9e7ca63e7486","fromSide":"left","toNode":"ccca9dd81d60974a","toSide":"right"},
		{"id":"2adf5ce23d14db6d","fromNode":"71ecbdfd793dbabf","fromSide":"left","toNode":"ed980d6d0388996a","toSide":"right"},
		{"id":"dee17d7cf3a51937","fromNode":"71ecbdfd793dbabf","fromSide":"left","toNode":"a57da3e1a70f6ebe","toSide":"right"},
		{"id":"a00dfb9f71dfe808","fromNode":"ed980d6d0388996a","fromSide":"left","toNode":"2d4e4aa3b848a4c5","toSide":"right"},
		{"id":"4e2fcccefece5736","fromNode":"a57da3e1a70f6ebe","fromSide":"left","toNode":"14023f0cf1e85fbd","toSide":"right"},
		{"id":"1b8b674cfac4fdda","fromNode":"71ecbdfd793dbabf","fromSide":"left","toNode":"409fe67de148e573","toSide":"right"},
		{"id":"2886d2ff0c24b460","fromNode":"71ecbdfd793dbabf","fromSide":"left","toNode":"3c1d2086c07223ce","toSide":"right"},
		{"id":"7ca42d06343db597","fromNode":"3c1d2086c07223ce","fromSide":"bottom","toNode":"345c5ead364b2b84","toSide":"top"},
		{"id":"05f81fa737f4ac64","fromNode":"7e3e9268b61e5a13","fromSide":"right","toNode":"66fa53b713badf0f","toSide":"left"},
		{"id":"d0aa82e94a041a59","fromNode":"66fa53b713badf0f","fromSide":"top","toNode":"8baafc495bd276c3","toSide":"bottom"},
		{"id":"885304656e7311e2","fromNode":"8baafc495bd276c3","fromSide":"right","toNode":"cba44d1b74847bf4","toSide":"left"},
		{"id":"7d79a2bc78f46903","fromNode":"cba44d1b74847bf4","fromSide":"top","toNode":"a5eb80b88a01d1de","toSide":"bottom"},
		{"id":"8692192b10e9f8fb","fromNode":"a5eb80b88a01d1de","fromSide":"top","toNode":"d890a4060b61fe30","toSide":"bottom"},
		{"id":"16a75435c9d9a4b5","fromNode":"7e3e9268b61e5a13","fromSide":"bottom","toNode":"175649a14f111021","toSide":"top"},
		{"id":"01bb20a63540f336","fromNode":"231c162dbbecb5e1","fromSide":"top","toNode":"6c16b372f693d2fe","toSide":"bottom"},
		{"id":"69cec997a5295471","fromNode":"231c162dbbecb5e1","fromSide":"top","toNode":"089fd6758057dcea","toSide":"bottom"},
		{"id":"0ec0754b7cba4038","fromNode":"231c162dbbecb5e1","fromSide":"top","toNode":"3a70afecaf8ae196","toSide":"bottom"},
		{"id":"b56ea2a682a98838","fromNode":"3a70afecaf8ae196","fromSide":"top","toNode":"e22951f6e95680e0","toSide":"bottom"},
		{"id":"3744288203f73fc8","fromNode":"7dc27cb2f60da8cc","fromSide":"bottom","toNode":"0e393383ecc6ce9f","toSide":"top"},
		{"id":"4960b68ecc88ccc8","fromNode":"7dc27cb2f60da8cc","fromSide":"bottom","toNode":"60ce1c75c56393c6","toSide":"top"},
		{"id":"36a3483e4939b401","fromNode":"60ce1c75c56393c6","fromSide":"bottom","toNode":"7730212384478361","toSide":"top"},
		{"id":"611eacca76248ddc","fromNode":"7730212384478361","fromSide":"bottom","toNode":"fe5fe5d330c186f0","toSide":"top"},
		{"id":"394f71a14d3a73c8","fromNode":"7dc27cb2f60da8cc","fromSide":"bottom","toNode":"4373f07b891fca9f","toSide":"top"},
		{"id":"ae23fe16dea56f95","fromNode":"60ce1c75c56393c6","fromSide":"bottom","toNode":"2e19ce1f0f2a8ce5","toSide":"top"},
		{"id":"9cab41281a33afe1","fromNode":"2e19ce1f0f2a8ce5","fromSide":"bottom","toNode":"84f0f08170c49d34","toSide":"top"},
		{"id":"3e71eb12a42debfc","fromNode":"2e19ce1f0f2a8ce5","fromSide":"right","toNode":"93e92bc5bc7dfc49","toSide":"left"},
		{"id":"d02ecb4c61c61e4e","fromNode":"7dc27cb2f60da8cc","fromSide":"right","toNode":"5e9b2df1422c85e0","toSide":"left"},
		{"id":"ecbaf7a212d29836","fromNode":"2e19ce1f0f2a8ce5","fromSide":"top","toNode":"6d70b5b00c09a622","toSide":"bottom"},
		{"id":"6938e21a000a3ca7","fromNode":"7dc27cb2f60da8cc","fromSide":"right","toNode":"2a48f668ae88620f","toSide":"left"},
		{"id":"287c373155656aad","fromNode":"9d46a84c291b7f8c","fromSide":"bottom","toNode":"12abecb377e3f208","toSide":"top"},
		{"id":"6e4156cba0d98419","fromNode":"c5a87d762d153961","fromSide":"bottom","toNode":"c6a330fa810d7c4f","toSide":"top"},
		{"id":"10f298e422d8b87b","fromNode":"c5a87d762d153961","fromSide":"bottom","toNode":"f90a930e7decd24d","toSide":"top"},
		{"id":"a30296efbd2d3c4a","fromNode":"12abecb377e3f208","fromSide":"bottom","toNode":"ca6c48b0324ef94b","toSide":"left"},
		{"id":"b028970681da71a7","fromNode":"ca6c48b0324ef94b","fromSide":"right","toNode":"bbeb403717779fb4","toSide":"top"},
		{"id":"729b3587a48fa4f0","fromNode":"ca6c48b0324ef94b","fromSide":"right","toNode":"ebe205ad6c00e530","toSide":"bottom"},
		{"id":"e2c1a5ef792b1640","fromNode":"ca6c48b0324ef94b","fromSide":"bottom","toNode":"668f5a92425f848e","toSide":"top"},
		{"id":"d65785e7044aa7b8","fromNode":"c5a87d762d153961","fromSide":"top","toNode":"12abecb377e3f208","toSide":"bottom"},
		{"id":"25ef73a3a31d023d","fromNode":"668f5a92425f848e","fromSide":"right","toNode":"5dad725776417e94","toSide":"left"},
		{"id":"ac193c495d10443a","fromNode":"668f5a92425f848e","fromSide":"right","toNode":"f7975a8865713391","toSide":"left"},
		{"id":"a86894c068fa1623","fromNode":"a4d794af84de6967","fromSide":"bottom","toNode":"fd96fd05a96cf49f","toSide":"top"},
		{"id":"f683d88a2a97995a","fromNode":"fd96fd05a96cf49f","fromSide":"bottom","toNode":"038e0d07d23e5ee9","toSide":"top"},
		{"id":"68ac7859e84c0da2","fromNode":"b986d0c7d6610687","fromSide":"bottom","toNode":"9f3c6f07539d9378","toSide":"top"},
		{"id":"af0a9dc67601624d","fromNode":"a4d794af84de6967","fromSide":"top","toNode":"b986d0c7d6610687","toSide":"bottom"},
		{"id":"98b5a5f73078bfa6","fromNode":"9f3c6f07539d9378","fromSide":"bottom","toNode":"0c6067c38acd5e13","toSide":"top"},
		{"id":"3f0bb4977a0a3b3a","fromNode":"cb4b50ab3c27ed51","fromSide":"bottom","toNode":"fbf9f3f247c02df2","toSide":"top"},
		{"id":"62d0e20708de8104","fromNode":"df12c19288f30521","fromSide":"bottom","toNode":"ff916cc57e332c7f","toSide":"top"},
		{"id":"9de982857db740bb","fromNode":"ff916cc57e332c7f","fromSide":"bottom","toNode":"13ce34d4d27e6b49","toSide":"top"},
		{"id":"5622fee0475c3e9d","fromNode":"2d4e4aa3b848a4c5","fromSide":"left","toNode":"5de28f8fa8ef2405","toSide":"right"},
		{"id":"958660443410b4d1","fromNode":"5de28f8fa8ef2405","fromSide":"top","toNode":"fbb1160be4f269a2","toSide":"bottom"},
		{"id":"28603fac45f490f7","fromNode":"ff916cc57e332c7f","fromSide":"left","toNode":"2b7a8fd1c0cb34fa","toSide":"right"},
		{"id":"61d398b037355480","fromNode":"fbf9f3f247c02df2","fromSide":"bottom","toNode":"df12c19288f30521","toSide":"top"},
		{"id":"6e088576d2dd6255","fromNode":"fbf9f3f247c02df2","fromSide":"bottom","toNode":"d8265769904608ac","toSide":"top"},
		{"id":"14f8a29a8e8a1c54","fromNode":"fbf9f3f247c02df2","fromSide":"left","toNode":"6a0704b5c66e223f","toSide":"right"},
		{"id":"3a2d7b3e41561cc9","fromNode":"fbf9f3f247c02df2","fromSide":"bottom","toNode":"8b1640a5516d3759","toSide":"top"},
		{"id":"507bbb7eb77b91b2","fromNode":"df12c19288f30521","fromSide":"left","toNode":"86ce7860d2619526","toSide":"right"},
		{"id":"6f4332036f3178b4","fromNode":"9dd67eb31b8ae32f","fromSide":"bottom","toNode":"459d47079d9191d6","toSide":"top"},
		{"id":"5b43cae7609fc2a3","fromNode":"459d47079d9191d6","fromSide":"right","toNode":"66fd1f8c0e2af2f3","toSide":"left"},
		{"id":"a86213b27a5e8e04","fromNode":"9dd67eb31b8ae32f","fromSide":"bottom","toNode":"8fb5b2e3921e0976","toSide":"top"},
		{"id":"70a198b4fca00aae","fromNode":"8fb5b2e3921e0976","fromSide":"left","toNode":"fb94604147d2460b","toSide":"right"},
		{"id":"7ecaeff002e17237","fromNode":"fb94604147d2460b","fromSide":"left","toNode":"1e9b1208243a7a98","toSide":"right"},
		{"id":"5422f83a57d507bf","fromNode":"1e9b1208243a7a98","fromSide":"left","toNode":"85d82c08db0ba93b","toSide":"right"},
		{"id":"6eca3cdc7fd77d03","fromNode":"01ff75e28d9cdefd","fromSide":"bottom","toNode":"9a571e05bdf7ed58","toSide":"top"},
		{"id":"37e8e262b3756f53","fromNode":"01ff75e28d9cdefd","fromSide":"bottom","toNode":"47ced4cca55e1cd8","toSide":"top"}
	]
}