머신러닝_3. 데이터 분리

2023. 7. 11. 00:24

머신러닝을 할 때 적절히 데이터를 분리하는 작업이 필요!

import random

random.seed(42) # 이를 설정하므로써 랜덤 값이 변경되지 않게 함

inds = random.sample(range(150), int(150*0.7))
inds[:5]

# random.sample(range(150), int(150*0.7)) : 랜덤으로 150개 중에서 70개만 뽑음

iris 활용해 데이터 분리 연습하기

import seaborn as sns

iris = sns.load_dataset('iris')
iris.loc[inds, :] #행 위에 설정한 무작위로 넣음

tain_data = iris.loc[inds, :]
train_data.sort_index() # index 기준으로 오름차순으로 정렬

train_data.sort_index(inplace=True)
train_data[:5]

# inplace = True : inplace는 'in place of ~'의 뜻을 가진 ~을 대신하다, 대처하다로 기존의 데이터 프레임에 변경된 설정을 덮어쓰는 것을 의미한다.

# tain_data의 index는 0, 1, 7, 8 ~

# train_data의 values는 [5.1, 3,5, 0.2, 'setosa'], [4.9, 3.0 ~

test_data = iris.loc[~iris.index.isin(train_data.index)]
test_data[:10]

# iris.loc : 논리값을 이용해서 부분 집합을 얻을 수 있음 → 데이터 프렘임이 완성되어 있을 때 쓰기 좋음

# ~ : 반전(True → False , False → True)

# isin() : 데이터 프레임 또는 시리즈가 데이터를 포함하는지 대한 여부를 알려줌(T/F)

test_data.sort_index(inplace = True)

train_data = iris.sample(frac=0.7)
test_data = iris.loc[~iris.index.isin(train_data.index)]
train_data.shape, test_data.shape

# iris.sample(frac=0.7) : 테스트용으로 샘플을 랜덤하게 70프로만 뽑음

# test_data는 train_data에 대한 인덱스를 포함하는지 알려주고 반전시킴

# train_data에 대한 행렬의 차원은 (150, 5)이고, test_data에 대한 행렬의 차원은 (45, 5)

# Dataset 분리(test/train data)

import seaborn as sns

iris = sns.load_dataset('iris')

# x, y 분리
x = iris.iloc[:, :-1]
y = iris.iloc[:, -1]

from sklearn.model_selection import train_test_spilt

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

# test_size : 테스트 셋 구성의 비율

# random_state : 셋을 섞을 때 해당 int(42)값을 보고 섞으며, 하이퍼 파라미터를 튜닝 시 이에 대한 값을 고정하고 튜닝해야 매번 데이터셋이 변경되는 것을 방지할 수 있음

▶ train : 모델 구축 (8)

▶ test : 성능 평가 (2) -> 홀드 아웃(의도적으로 사용하지 않음)

▶ train(8) 데이터에서 train과 vaildation으로 분할 → train 데이터로 모델을 만들고 vaildation 데이터로 검증한 뒤, 만족하면 다시 합쳐서 학습한 후에 test 데이터를 넣어 확인한다.

y_test.value_counts() 
#value_counts : 각각 몇 개씩 있는지 알려줌

'Project > 23.07~08 AI' 카테고리의 다른 글

머신러닝_5. 단순회귀분석_규제 (0)	2023.07.11
머신러닝_4. 모델링 (0)	2023.07.11
머신러닝_2. 데이터 전처리 (0)	2023.07.10
머신러닝_1. 데이터 탐색 (0)	2023.07.10
머신러닝_0. 개요 (0)	2023.07.10

Blue Peach