3、欺诈与时间序列分布关系
- # 查看二者的描述性统计,与时间的序列分布关系
- print('Normal')
- print(crecreditcard_data.
- Time[crecreditcard_data.Class == 0].describe())
- print('-'*25)
- print('Fraud')
- print(crecreditcard_data.
- Time[crecreditcard_data.Class == 1].describe())
- Normal
- count 284315.000000
- mean 94838.202258
- std 47484.015786
- min 0.000000
- 25% 54230.000000
- 50% 84711.000000
- 75% 139333.000000
- max 172792.000000
- Name: Time, dtype: float64
- -------------------------
- Fraud
- count 492.000000
- mean 80746.806911
- std 47835.365138
- min 406.000000
- 25% 41241.500000
- 50% 75568.500000
- 75% 128483.000000
- max 170348.000000
- Name: Time, dtype: float64
- f,(ax1,ax2)=plt.subplots(2,1,sharex=True,figsize=(12,6))
- bins=50
- ax1.hist(crecreditcard_data.Time[crecreditcard_data.Class == 1],bins=bins)
- ax1.set_title('欺诈(Fraud))',fontsize=22)
- ax1.set_ylabel('交易量',fontsize=15)
- ax2.hist(crecreditcard_data.Time[crecreditcard_data.Class == 0],bins=bins)
- ax2.set_title('正常(Normal',fontsize=22)
- plt.xlabel('时间(单位:秒)',fontsize=15)
- plt.xticks(fontsize=15)
- plt.ylabel('交易量',fontsize=15)
- # plt.yticks(fontsize=22)
- plt.show()
欺诈与时间并没有必然联系,不存在周期性;
正常交易有明显的周期性,有类似双峰这样的趋势。
4、欺诈与金额的关系和分布情况
- print('欺诈')
- print(crecreditcard_data.Amount[crecreditcard_data.Class ==1].describe())
- print('-'*25)
- print('正常交易')
- print(crecreditcard_data.Amount[crecreditcard_data.Class==0].describe())
- 欺诈
- count 492.000000
- mean 122.211321
- std 256.683288
- min 0.000000
- 25% 1.000000
- 50% 9.250000
- 75% 105.890000
- max 2125.870000
- Name: Amount, dtype: float64
- -------------------------
- 正常交易
- count 284315.000000
- mean 88.291022
- std 250.105092
- min 0.000000
- 25% 5.650000
- 50% 22.000000
- 75% 77.050000
- max 25691.160000
- Name: Amount, dtype: float64
- f,(ax1,ax2)=plt.subplots(2,1,sharex=True,figsize=(12,6))
- bins=30
- ax1.hist(crecreditcard_data.Amount[crecreditcard_data.Class == 1],bins=bins)
- ax1.set_title('欺诈(Fraud)',fontsize=22)
- ax1.set_ylabel('交易量',fontsize=15)
- ax2.hist(crecreditcard_data.Amount[crecreditcard_data.Class == 0],bins=bins)
- ax2.set_title('正常(Normal)',fontsize=22)
- plt.xlabel('金额($)',fontsize=15)
- plt.xticks(fontsize=15)
- plt.ylabel('交易量',fontsize=15)
- plt.yscale('log')
- plt.show()
金额普遍较低,可见金额这一列的数据对分析的参考价值不大。
5、查看各个自变量(V1-V29)与因变量的关系
看看各个变量与正常、欺诈之间是否存在联系,为了更直观展示,通过distplot图来逐个判断,如下:
- features=[x for x in crecreditcard_data.columns
- if x not in ['Time','Amount','Class']]
- plt.figure(figsize=(12,28*4))
- gs =gridspec.GridSpec(28,1)
- import warnings
- warnings.filterwarnings('ignore')
- for i,cn in enumerate(crecreditcard_data[v_features]):
- ax=plt.subplot(gs[i])
- sns.distplot(crecreditcard_data[cn][crecreditcard_data.Class==1],bins=50,color='red')
- sns.distplot(crecreditcard_data[cn][crecreditcard_data.Class==0],bins=50,color='green')
- ax.set_xlabel('')
- ax.set_title('直方图:'+str(cn))
- plt.savefig('各个变量与class的关系.png',transparent=False,bbox_inches='tight')
- plt.show()
红色表示欺诈,绿色表示正常
- 两个分布的交叉面积越大,欺诈与正常的区分度最小,如V15;
- 两个分布的交叉面积越小,则该变量对因变量的影响越大,如V14;
(编辑:威海站长网)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|