Data Analysis
STATISTICS
1. REGRESSION: 연속된 독립변수와 종속변수의 관계 및 경향성을 찾는 방법 --> Least Squares (오차를 최소화하는 모델 찾기)
X: [1, 2, 3, 4, 5]
Y: [2, 4, 5, 4, 5]
1. FIND DEVIATION OF X and Y deviation of x = X - x̄ deviation of y = Y - ȳ |
deviation of x: [-2, -1, 0, 1, 2] deviation of y: [-2, 0, 1, 0, 1] |
2. CALCULATE PRODUCT OF DEV deviation of x * deviation of y |
product of deviations: [4, 0, 0, 0, 2] |
3. CALCULATE SQUARED DEV OF X deviation of x * deviation of y |
squared deviation of x: [4, 1, 0, 1, 4] |
4. CALCULATE SLOPE (β₁) β₁ = Σ(product of deviations) Σ(squared deviation of x) |
Sum of product of deviations = 4 + 0 + 0 + 0 + 2 = 6 Sum of squared deviation of x = 4 + 1 + 0 + 1 + 4 = 10 β₁ = 6 / 10 = 0.6 |
5. Calculate Intercept β₀ β₀ = ȳ - β₁ * x̄ |
β₀ = 4 - 0.6 * 3 = 2.2 |
6. Write Regression Equation Y = β₀ + β₁ * X |
Y = 2.2 + 0.6 * X |
2. VARIANCE (1개 변수) & COVARIANCE (여러개 변수): 확률변수가 기댓값(mean) 부터 얼마나 멀리 있는지
VARIANCE (σ²) = Σ (xᵢ - μ)²
(n-1)
COV(X, Y) = Σ (xᵢ - X̄)(yᵢ - Ȳ)
(n - 1)
1. COVARIANCE MATRIX
STEP 1:
X: [2, 3, 5, 7, 10]
Y: [6, 9, 12, 15, 18]
Step 2: Calculate the Means (μx and μy):
μx = (2 + 3 + 5 + 7 + 10) / 5 = 5.4
μy = (6 + 9 + 12 + 15 + 18) / 5 = 12
Step 3: Calculate the Covariance:
Cov(X, Y)
= Σ((xi - μx) * (yi - μy)) / (n - 1)
= (20.4 + 7.2 + 0 + 4.8 + 27.6) / (5 - 1)
= 59 / 4
≈ 14.75
Cov(X, X)
= Σ((xi - μx)^2) / (n - 1)
= ((-3.4)^2 + (-2.4)^2 + (-0.4)^2 + 1.6^2 + 4.6^2) / (5 - 1)
= 43.2 / 4
= 10.8
Cov(Y, Y)
= Σ((yi - μy)^2) / (n - 1)
= ((-6)^2 + (-3)^2 + 0^2 + 3^2 + 6^2) / (5 - 1)
= 90 / 4
= 22.5
Step 5: Assemble the Covariance Matrix:
| Cov(X, X) Cov(X, Y) |
| Cov(Y, X) Cov(Y, Y) |
Covariance matrix:
| 10.8 14.75 |
| 14.75 22.5 |
3. Correlation Coefficient: Normalized Covariance
ρ (X,Y) = cov (X,Y)
σXσY
4. Z-score (Normalization): Z = (X - μ) / σ
NUMPY
PANDAS
1. Returns # of Sales / Month
sales['Month'].value_counts()
2.
france_states = sales.loc[sales['Country'] == 'France', 'State'].value_counts()
i. FILTER Country == France
ii. SELECT STATES COLUMN
iii. COUNT # OF STATES
3.
sales.loc[(sales['Customer_Gender'] == 'M') & (sales['Revenue'] == 500)].shape[0]
i. FILTER Customer_Gender == M && Revenue == 500
ii. C