์ƒ์„ธ ์ปจํ…์ธ 

๋ณธ๋ฌธ ์ œ๋ชฉ

์„œ์šธ์‹œ ๋…น์ง€๋Œ€ (1) - ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

๋ณธ๋ฌธ

* ์‚ฌ์šฉํ•œ Tool :  Google Colab 

* ๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ : ์„œ์šธ์‹œ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ ๊ด‘์žฅ 

https://data.seoul.go.kr/dataList/OA-1321/S/1/datasetView.do

์‹œ์ž‘์— ์•ž์„œ

์„œ์šธํŠน๋ณ„์‹œ '๊ด‘์ง„๊ตฌ'์˜ ๋น…๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ณต๋ชจ์ „์œผ๋กœ๋ถ€ํ„ฐ ์ถœ๋ฐœํ–ˆ๋‹ค.

ํ•œ์—ฌ๋ฆ„์˜ ๊ณ ์˜จ๋‹ค์Šต๊ณผ ๊ณตํ•ด๋ฅผ ์ค„์ด๋Š” ๋ฐ ๋„์›€๋˜๋Š” '๋…น์ง€ํ™”'์˜ ํšจ๊ณผ๋ฅผ ๋ณด๊ณ ์ž ํ–ˆ๋‹ค. 

๊ด‘์ง„๊ตฌ ํ•˜๋‚˜์˜ ์ž์น˜๊ตฌ์— ๋Œ€ํ•ด ๋ณด๊ธฐ ์ „์—, ์„œ์šธ์‹œ ๋‚ด ๋…น์ง€ํ™”๊ฐ€ ์–ด๋–ป๊ฒŒ ๋˜์–ด์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ณ ์ž ํ•œ ๊ฒƒ์ด ํ•ด๋‹น ๋ถ„์„์˜ ๋ชฉ์ ์ด๋‹ค.

 

 

 ๋ชฉ์ฐจ 

0. ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

1. ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ์นผ๋Ÿผ ์ œ๊ฑฐ 

2. ํ˜•(type) ๋ณ€ํ™˜ 

3. object(๋ฌธ์ž์—ด) ๋ณ€์ˆ˜ ํ™•์ธ ๋ฐ ์ˆ˜์ • 

 

 

0. ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

์‚ฌ์šฉ ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒ ์ง€๋งŒ ํ•„์ž๋Š” ๊ตฌ๊ธ€์—์„œ ์ œ๊ณตํ•˜๋Š” ์ฝ”๋žฉ์„ ์‚ฌ์šฉํ–ˆ๋‹ค. 

๊ตฌ๊ธ€ ์ฝ”๋žฉ์œผ๋กœ ๋ฌธ์„œํŒŒ์ผ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด ๊ตฌ๊ธ€๋“œ๋ผ์ด๋ธŒ์— ๋ฌธ์„œํŒŒ์ผ์„ ์ €์žฅํ•ด์ฃผ์ž

# ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
path = r'/content/drive/MyDrive/data' # ํŒŒ์ผ ๋“ค์–ด์žˆ๋Š” ๊ฒฝ๋กœ
file = r'/แ„‚แ…ฉแ†จแ„Œแ…ตแ„ƒแ…ข_แ„ƒแ…ฆแ„‹แ…ตแ„แ…ฅ/แ„‰แ…ฅแ„‹แ…ฎแ†ฏแ„‰แ…ต แ„‚แ…ฉแ†จแ„Œแ…ตแ„ƒแ…ข แ„‹แ…ฑแ„Žแ…ตแ„Œแ…ฅแ†ผแ„‡แ…ฉ (แ„Œแ…ชแ„‘แ…ญแ„€แ…จ_ WGS1984).csv'
import pandas as pd
green_df = pd.read_csv(path+file, encoding = 'EUC-KR')

 

๋”๋ณด๊ธฐ

ํŒŒ์ผ ์ธ์ฝ”๋”ฉ ํ˜•์‹ ์ฐพ๋Š” ๋ฐฉ๋ฒ• 

# chardet ๋ชจ๋“ˆ ์„ค์น˜ํ•ด์•ผ ํ•จ (pip install chardet)
import chardet # ๋ฌธ์ž์—ด ์ธ์ฝ”๋”ฉ ์ฐพ์•„์ฃผ๋Š” ๋ชจ๋“ˆ

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
    return result['encoding']

detect_encoding(ํŒŒ์ผ๊ฒฝ๋กœ๊ฐ์ฒด)

 

1.  ์นผ๋Ÿผ ์ œ๊ฑฐ 

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋ถˆ๋Ÿฌ์˜จ ํ…Œ์ด๋ธ”์„ ํ™•์ธํ•˜๊ณ , ๋ชจ๋‘ ๊ฒฐ์ธก๋˜์–ด ์žˆ๊ฑฐ๋‚˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ์นผ๋Ÿผ์„ ์ œ๊ฑฐํ•ด์ค€๋‹ค. 

# ์นผ๋Ÿผ ์ œ๊ฑฐ
green_df=green_df.drop(['๋…น์ง€๋Œ€์กฐ์„ฑ๋…„๋„','์ƒ์„ฑ์ผ','์กฐ๊ฒฝ๋Ÿ‰','์‚ฌ์ง„ํŒŒ์ผ๋ช…'], axis=1 )
# drop ๋Œ€์ƒ์ด ์นผ๋Ÿผ์ธ ๊ฒฝ์šฐ axis = 1 
green_df

 

 

2. ํ˜•(type) ๋ณ€ํ™˜ 

์ถ”ํ›„์— ๋ฐ์ดํ„ฐ ์œ„๋„,๊ฒฝ๋„ ์ขŒํ‘œ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ง€๋„์— ๋งตํ•‘ํ•  ๊ฒƒ์ด๋‹ค. 

information์„ ํ™•์ธํ•ด๋ณด๋‹ˆ ์œ„๋„,๊ฒฝ๋„๊ฐ€ object์ด๋‹ค. float๋กœ ๋ฐ”๊ฟ”์ค€๋‹ค. 

์ฒ˜์Œ์— ๊ณ„์† ์˜ค๋ฅ˜ ๋–ด๋Š”๋ฐ ๋ฉ”์„ธ์ง€ ์ฝ์–ด๋ณด๋‹ˆ ๊ณต๋ฐฑ์ด ์žˆ๋‹ค๋Š” ๋“ฏํ•œ ๋ฉ”์„ธ์ง€์—ฌ์„œ ํ™•์ธํ•˜๊ณ  ๊ณต๋ฐฑ-> ๊ฒฐ์ธก์น˜๋กœ ๋ฐ”๊ฟ”์ฃผ๊ณ  ์ง„ํ–‰ํ–ˆ๋‹ค.

์˜ค๋ฅ˜๋ฉ”์„ธ์ง€ : ValueError: Unable to parse string " "

 

# ๊ณต๋ฐฑ ''์„ NaN์œผ๋กœ ๋ฐ”๊พธ๊ณ  ๊ฒฐ์ธก์น˜ ์ œ๊ฑฐ
green_df[green_df['๊ฒฝ๋„']==' ']= pd.NA # ๊ณต๋ฐฑ์„ ๊ฒฐ์ธก์น˜๋กœ
green_df.dropna(subset=['์œ„๋„','๊ฒฝ๋„'], inplace=True)
green_df[['์œ„๋„','๊ฒฝ๋„' ]] = green_df[['์œ„๋„','๊ฒฝ๋„' ]].astype(float)
green_df.info()

 

3. object(๋ฌธ์ž์—ด) ๋ณ€์ˆ˜ ํ™•์ธ ๋ฐ ์ˆ˜์ • 

์˜ˆ์‹œ๋Š” ๋…น์ง€๋Œ€๋ถ„๋ฅ˜ ๋ณ€์ˆ˜์ด๋‹ค.

(๋ฐ์ดํ„ฐ ์ž‘์„ฑ์ž๊ฐ€ ์—ฌ๋Ÿฟ์ด์—ˆ๋Š”์ง€, ์˜คํƒ€์™€ ๋‹ค๋ฅธ์ด๋ฆ„์˜ ๊ณตํ†ต๋œ ๊ฐ’์ด ๋งŽ์•˜๋‹ค.)

# ํ™•์ธ 
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].value_counts()

# ๋ณ€๊ฒฝ 
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace(' ','๋ฏธ๋ถ„๋ฅ˜')
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace('๋„๋ก๋ณ€๋…น์ง€','๋„๋กœ๋ณ€๋…น์ง€')
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace('๋…ธ๋กœ๋ณ€๋…น์ง€','๋„๋กœ๋ณ€๋…น์ง€')
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace('๋„๋กœ๋ณ€๋…น์ง€๋Œ€','๋„๋กœ๋ณ€๋…น์ง€')
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace('์‰ผํ„ฐ','ํœด์‹๊ณต๊ฐ„')

green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace('๋…ธ๋ณ€๋ถ„๋ฆฌ.*','๋…ธ๋ณ€๋ถ„๋ฆฌ๋Œ€', regex = True)
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace('์žํˆฌ๋ฆฌ.*','์žํˆฌ๋ฆฌ๋…น์ง€', regex = True)
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace('.*๊ณต๊ณต.*','๊ณต๊ณต๊ฑด๋ฌผ', regex=True)
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace('.*ํ•˜์ฒœ๋ณ€.*','ํ•˜์ฒœ๋ณ€', regex=True)
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace('.*์ง€ํ•˜์ฒ .*','์ง€ํ•˜์ฒ ํ™˜๊ธฐ๊ตฌ์ฃผ๋ณ€', regex=True)
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace('.*๊ฑด๋ฌผ์กฐ.*','๊ฑด๋ฌผ์กฐ๊ฒฝ', regex=True)
green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜']= green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].str.replace('.*๋ฌธํ™”์žฌ.*','๋ฌธํ™”์žฌ์ฃผ๋ณ€', regex=True)

green_df['๋…น์ง€๋Œ€๋ถ„๋ฅ˜'].value_counts()

 

๋ฐ˜์‘ํ˜•

๊ด€๋ จ๊ธ€ ๋”๋ณด๊ธฐ