下面我们看看各个单变量与class的相关性分析,为更直观展示,直接作图,如下:
- # 各个变量的矩阵分布
- crecreditcard_data.hist(figsize=(15,15),bins=50)
- plt.show()
6、三种方法建模并分析
本部分将应用逻辑回归、随机森林、支持向量SVM三种方法建模分析,分别展开如下:
准备数据:
- # 先把数据分为欺诈组和正常组,然后按比例生产训练和测试数据集
- # 分组
- Fraud=crecreditcard_data[crecreditcard_data.Class == 1]
- Normal=crecreditcard_data[crecreditcard_data.Class == 0]
- # 训练特征集
- x_train=Fraud.sample(frac=0.7)
- x_train=pd.concat([x_train,Normal.sample(frac=0.7)],axis=0)
- # 测试特征集
- x_test=crecreditcard_data.loc[~crecreditcard_data.index.isin(x_train.index)]
- # 标签集
- y_train=x_train.Class
- y_test=x_test.Class
- # 去掉特征集里的标签和时间列
- x_train=x_train.drop(['Class','Time'],axis=1)
- x_test=x_test.drop(['Class','Time'],axis=1)
- # 查看数据结构
- print(x_train.shape,y_train.shape,
- 'n',x_test.shape,y_test.shape)
- (199364, 29) (199364,)
- (85443, 29) (85443,)
6.1 逻辑回归方法
- from sklearn import metrics
- import scipy.optimize as op
- from sklearn.linear_model import LogisticRegression
- from sklearn.cross_validation import KFold,cross_val_score
- from sklearn.metrics import (precision_recall_curve,
- auc,roc_auc_score,
- roc_curve,recall_score,
- classification_report)
- lrmodel = LogisticRegression(penalty='l2')
- lrmodel.fit(x_train, y_train)
- #查看模型
- print('lrmodel')
- print(lrmodel)
- lrmodel
- LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
- intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
- penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
- verbose=0, warm_start=False)
- #查看混淆矩阵
- ypred_lr=lrmodel.predict(x_test)
- print('confusion_matrix')
- print(metrics.confusion_matrix(y_test,ypred_lr))
- confusion_matrix
- [[85284 11]
- [ 56 92]]
- #查看分类报告
- print('classification_report')
- print(metrics.classification_report(y_test,ypred_lr))
- classification_report
- precision recall f1-score support
- 0 1.00 1.00 1.00 85295
- 1 0.89 0.62 0.73 148
- avg / total 1.00 1.00 1.00 85443
- #查看预测精度与决策覆盖面
- print('Accuracy:%f'%(metrics.accuracy_score(y_test,ypred_lr)))
- print('Area under the curve:%f'%(metrics.roc_auc_score(y_test,ypred_lr)))
- Accuracy:0.999216
- Area under the curve:0.810746
(编辑:威海站长网)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|