DATA ANALYSIS

Data Analysis

로그앤 2023. 8. 15. 17:10

STATISTICS

1. REGRESSION: 연속된 독립변수와 종속변수의 관계 및 경향성을 찾는 방법 --> Least Squares (오차를 최소화하는 모델 찾기) 

X: [1, 2, 3, 4, 5]
Y: [2, 4, 5, 4, 5]

1. FIND DEVIATION OF X and Y
deviation of x = X - x̄
deviation of y = Y - ȳ
deviation of x: [-2, -1, 0, 1, 2]
deviation of y: [-2, 0, 1, 0, 1]
2. CALCULATE PRODUCT OF DEV
deviation of x * deviation of y

product of deviations: [4, 0, 0, 0, 2]
3. CALCULATE SQUARED DEV OF X
deviation of x * deviation of y
squared deviation of x: [4, 1, 0, 1, 4]
4. CALCULATE SLOPE (β₁)
β₁ = Σ(product of deviations)
       Σ(squared deviation of x)

Sum of product of deviations =
4 + 0 + 0 + 0 + 2 = 6
Sum of squared deviation of x =
4 + 1 + 0 + 1 + 4 = 10

β₁ = 6 / 10 = 0.6
5. Calculate Intercept β₀
β₀ = ȳ - β₁ * x̄
β₀ = 4 - 0.6 * 3 = 2.2
6. Write Regression Equation
Y = β₀ + β₁ * X
Y = 2.2 + 0.6 * X
   

2. VARIANCE (1개 변수) & COVARIANCE (여러개 변수): 확률변수가 기댓값(mean) 부터 얼마나 멀리 있는지 

           VARIANCE (σ²) =     Σ (xᵢ - μ)² 

                                                 (n-1)

           COV(X, Y)          =    Σ (x - X̄)(y - Ȳ)

                                                   (n - 1)

1. COVARIANCE MATRIX

STEP 1:

X: [2, 3, 5, 7, 10]
Y: [6, 9, 12, 15, 18]

 

Step 2: Calculate the Means (μx and μy):

μx = (2 + 3 + 5 + 7 + 10) / 5 = 5.4
μy = (6 + 9 + 12 + 15 + 18) / 5 = 12

Step 3: Calculate the Covariance:

Cov(X, Y) 

          = Σ((xi - μx) * (yi - μy)) / (n - 1)
          = (20.4 + 7.2 + 0 + 4.8 + 27.6) / (5 - 1)
          = 59 / 4
          ≈ 14.75

Cov(X, X) 

          = Σ((xi - μx)^2) / (n - 1)
          = ((-3.4)^2 + (-2.4)^2 + (-0.4)^2 + 1.6^2 + 4.6^2) / (5 - 1)
          = 43.2 / 4
          = 10.8
Cov(Y, Y) 

          = Σ((yi - μy)^2) / (n - 1)
          = ((-6)^2 + (-3)^2 + 0^2 + 3^2 + 6^2) / (5 - 1)
          = 90 / 4
          = 22.5

Step 5: Assemble the Covariance Matrix:

 

| Cov(X, X)  Cov(X, Y) |
| Cov(Y, X)  Cov(Y, Y)  |          

Covariance matrix:
| 10.8      14.75 |
| 14.75     22.5  |

 

3. Correlation Coefficient: Normalized Covariance 

ρ (X,Y) = cov (X,Y)

                 σXσY

4. Z-score (Normalization): Z = (X - μ) / σ

NUMPY

PANDAS

1. Returns # of Sales / Month 

sales['Month'].value_counts()

2.

france_states = sales.loc[sales['Country'] == 'France', 'State'].value_counts()

i. FILTER Country == France 

ii. SELECT STATES COLUMN  

iii. COUNT # OF STATES 

 

3.

sales.loc[(sales['Customer_Gender'] == 'M') & (sales['Revenue'] == 500)].shape[0]

i. FILTER Customer_Gender == M && Revenue == 500 

ii. C