Python分析信用卡反欺诈！骗我程序员，不存在的

发布时间：2019-10-13 20:29:54 所属栏目：教程来源：一枚程序媛呀

导读：前言：本文研究的是大数据量(284807条数据)下模型选择的问题，也参考了一些文献，但大多不够清晰，因此吐血整理本文，希望对大家有帮助; 本文试着从数据分析师的角度，设想拿到数据该如何寻找规律、选哪种模型来构建反欺诈模型?的角度来分析，以业务导向

6.2 随机森林模型

from sklearn.ensemble import RandomForestClassifier 
rfmodel=RandomForestClassifier() 
rfmodel.fit(x_train,y_train) 
#查看模型 
print('rfmodel') 
rfmodel 
rfmodel 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', 
 max_depth=None, max_features='auto', max_leaf_nodes=None, 
 min_impurity_decrease=0.0, min_impurity_split=None, 
 min_samples_leaf=1, min_samples_split=2, 
 min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, 
 oob_score=False, random_state=None, verbose=0, 
 warm_start=False) 
#查看混淆矩阵 
ypred_rf=rfmodel.predict(x_test) 
print('confusion_matrix') 
print(metrics.confusion_matrix(y_test,ypred_rf)) 
confusion_matrix 
[[85291 4] 
 [ 34 114]] 
#查看分类报告 
print('classification_report') 
print(metrics.classification_report(y_test,ypred_rf)) 
classification_report 
 precision recall f1-score support 
 0 1.00 1.00 1.00 85295 
 1 0.97 0.77 0.86 148 
avg / total 1.00 1.00 1.00 85443 
#查看预测精度与决策覆盖面 
print('Accuracy:%f'%(metrics.accuracy_score(y_test,ypred_rf))) 
print('Area under the curve:%f'%(metrics.roc_auc_score(y_test,ypred_rf))) 
Accuracy:0.999625 
Area under the curve:0.902009

6.3支持向量机SVM

# SVM分类 
from sklearn.svm import SVC 
svcmodel=SVC(kernel='sigmoid') 
svcmodel.fit(x_train,y_train) 
#查看模型 
print('svcmodel') 
svcmodel 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, 
 decision_function_shape='ovr', degree=3, gamma='auto', kernel='sigmoid', 
 max_iter=-1, probability=False, random_state=None, shrinking=True, 
 tol=0.001, verbose=False) 
#查看混淆矩阵 
ypred_svc=svcmodel.predict(x_test) 
print('confusion_matrix') 
print(metrics.confusion_matrix(y_test,ypred_svc)) 
confusion_matrix 
[[85197 98] 
 [ 142 6]] 
#查看分类报告 
print('classification_report') 
print(metrics.classification_report(y_test,ypred_svc)) 
classification_report 
 precision recall f1-score support 
 0 1.00 1.00 1.00 85295 
 1 0.06 0.04 0.05 148 
avg / total 1.00 1.00 1.00 85443 
#查看预测精度与决策覆盖面 
print('Accuracy:%f'%(metrics.accuracy_score(y_test,ypred_svc))) 
print('Area under the curve:%f'%(metrics.roc_auc_score(y_test,ypred_svc))) 
Accuracy:0.997191 
Area under the curve:0.519696

7、小结

通过三种模型的表现可知，随机森林的误杀率最低;
不应只盯着精度，有时候模型的精度高并不能说明模型就好，特别是像本项目中这样的数据严重不平衡的情况。举个例子，我们拿到有1000条病人的数据集，其中990人为健康，10个有癌症，我们要通过建模找出这10个癌症病人，如果一个模型预测到了全部健康的990人，而10个病人一个都没找到，此时其正确率仍然有99%，但这个模型是无用的，并没有达到我们寻找病人的目的;
建模分析时，遇到像本例这样的极度不平衡数据集，因采取下采样、过采样等办法，使数据平衡，这样的预测才有意义，下一篇文章将针对这个问题进行改进;
模型、算法并没有高低、好坏之分，只是在不同的情况下有不同的发挥罢了，这点应正确的看待。

（编辑：威海站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

7/7

首页

wifi密码忘了,教您wif	英雄联盟截图,教您如何
显存速度是什么内存与	独显超级本电脑哪款好