EDA & Pre-Processing

April 2, 2022 2 분 소요

EDA and Data Preprocessing

EDA란?

데이터 분석에 있어서 초기 분석의 단계이며, 데이터 분석 전체 프로세스 중에 7 ~ 80% 시간을 소모할 만큼 중요한 부분입니다.
내용
- 패턴발견 : 시각화 도구 사용
- 데이터 특성 발견
- 가설 검정 : 통계, 그래픽, 시각적 표현 등을 사용
종류
- Graphic : 그림,차트,시각적 요소를 통해 데이터를 확인
- Non-Graphic : Summary Statistics를 통해 데이터를 확인
Target
- Univariate
- Multi-Variate : 여러 변수들간의 관계를 보는 것이 주요 목적

EDA의 종류와 Target을 조합하여 아래와 같이 구분합니다.

(출처 : https://chartio.com/learn/charts/what-is-a-scatter-plot/)

Multi-Non Graphic
- 데이터들의 관계를 보는 것이 주된 목표
- 종류 : Cross-Tabulation, Cross-Statistic(Correlation, Covariance)
  - Numerical Feature의 경우 Cross statisticsc처럼 색상을 넣어 히트맵처럼 시각화 가능

Missing Data
- isna, isnull, notna, notnull, dropna, fillna
Data Frame
- index, columns, dtypes, info, select_dtypes, loc, iloc, insert, head, tail, apply, aggregate,drop, rename, replace, nsmallest, nlargest, sort_values, sort_index, value_counts, describe, shape
Visualization
- plot, plot.area, plot.bar, plot.barh, plot.box, plot.density, plot.hexbin, plot.hist, plot.kde , plot.line, plot.pie, plot.scatter

데이터 분석의 목적에 맞게 데이터를 가공하는 작업의 전 과정을 데이터의 전처리(Data Preprocessing)이라고 합니다.
아무리 좋은 모델이어도 모델에 입력되는 데이터의 질이 좋지 않다면 결과 또한 좋지 않습니다(GIGO : Garbage In Garbage Out)

Data cleaning, Data integration, Data. transformation, Data reduction

Data cleaning

노이즈를 제거하거나 비일정성을 보정하는 과정을 의미합니다.
- Missing Values
  - Ignore the tuple(결측치 있는 데이터 삭제)
  - Manual Fill(데이터 수동 입력)
  - Global Constant(“Unknown”)
  - Imputation(All mean, Class mean, Inference mean, Regression 등)
- Noisy Data
- Binning
- Regression
- Outlier analysis
Integration
- 여러개로 나누어져 있는 데이터들을 분석하기 편하게 하나로 합치는 과정을 의미합니다.
- Merge사용
Transformation
- 데이터의 형태를 변환하는 작업으로, 스케일링(scaling) 이라고 부르기도 합니다.
- Normalize사용
Reduction
- 데이터를 의미있게 줄이는 것을 의미하며, dimension reduction과 유사한 목적을 가집니다.
- 차원의 저주관련, pca사용

출처 : https://computer-science-student.tistory.com/172

출처 : https://min23th.tistory.com/40