此教程适合有pandas基础的童鞋来看,很多知识点会一笔带过,不做详细解释
DataFrame和Series的关系 每个column就是一个Series基础属性shape,index,columns,values,dtypes,describe(),head(),tail()
统计属性Series count(),value_counts()
,前者是统计总数,后者统计各自value的总数df.isnull()
df的空值为Truedf.notnull()
df的非空值为True
修改列名 1 df.rename(columns = {'key' :'key2' },inplace=True )
更改数据格式 astype()
1 2 3 isin unique value_counts
数据清洗 丢弃值drop()
df.drop(labels, axis=1)# 按列(axis=1),丢弃指定label的列,默认按行。。。1
丢弃缺失值dropna()
1 2 3 4 5 df.dropna() df.dropna(axis=1 ) df.dropna(how='all' ) df.dropna(thresh=3 )
缺失值填充fillna()
1 2 3 df.fillna(0 ) df.fillna({1 :0 ,2 :0.5 }) df.fillna(method='ffill' )
值替换replace()
1 2 3 4 5 6 7 8 df['A' ].replace(-999 , np.nan) obj.replace([-999 ,1000 ], np.nan) obj.replace([-999 ,1000 ], [np.nan, 0 ]) obj.replace({-999 :np.nan, 1000 :0 })
重复值处理duplicated(),unique(),drop_duplictad() 1 2 3 4 5 6 7 8 df.duplicated() df.duplicated('key' ) df['A' ].unique() df.drop_duplicates(['k1' ]) df.drop_duplicates(['k1' ,'k2' ], take_last=True )
排序 索引排序 1 2 3 4 df.sort_index() df.sort_index(axis=1 , ascending=False )
值排序 1 2 3 4 5 s = pd.Series([4 , 6 , np.nan, 2 , np.nan]) s.order() df.sort_values(by=['a' ,'b' ])
排名 1 2 3 4 5 6 a=Series([7 ,-5 ,7 ,4 ,2 ,0 ,4 ]) a.rank()
索引设置 设置索引 reindex()
更新index
或者columns
,默认:更新index,返回一个新的DataFrame
1 2 3 4 5 6 7 8 9 10 11 12 13 14 df2 = df1.reindex(['a' ,'b' ,'c' ,'d' ,'e' ]) df2 = df1.reindex(['a' ,'b' ,'c' ,'d' ,'e' ], fill_value=0 ) df1.reindex(['a' ,'b' ,'c' ,'d' ,'e' ], inplace=Ture) states = ["Texas" ,"Utah" ,"California" ] df2 = df1.reindex( columns=states ) set_index()
将DataFrame中的列columns设置成索引index
打造层次化索引的方法 1 2 3 4 5 6 7 8 adult.set_index(['race' ,'sex' ], inplace = True ) adult.set_index(['race' ,'sex' ], inplace = True ) reset_index()
将使用set_index()
打造的层次化逆向操作 既是取消层次化索引,将索引变回列,并补上最常规的数字索引df.reset_index()
数据选取 只能对行进行(row/index) 切片,前闭后开df[0:3],df[:4],df[4:]
where 布尔查找
isin
1 2 3 4 s.isin([1 ,2 ,3 ]) df['A' ].isin([1 ,2 ,3 ]) df.loc[df['A' ].isin([5.8 ,5.1 ])]
query 多个where整合切片,&:于,|:或
1 df.query(" A>5.0 & (B>3.5 | C<1.0) " )
loc :根据名称Label切片 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 df.loc[1 :4 ,['petal_length' ,'petal_width' ]] df.loc[df['sepal_length' ] > 6 , 'test' ] = 1 df.loc[df['sepal_length' ] <=6 , 'test' ] = 0 df['test2' ] = 0 df.loc[(df['petal_length' ]>2 )&(df['petal_width' ]>0.3 ), 'test2' ] = 1 df.loc[(df['sepal_length' ]>6 )&(df['sepal_width' ]>3 ), 'test2' ] = 2 iloc:切位置 df.iloc[1 :4 ,:]
ix:混切 名称和位置混切,但效率低,少用
1 df1.ix[0 :3 ,['sepal_length' ,'petal_width' ]]
map与lambda 1 2 3 4 alist = [1 ,2 ,3 ,4 ] map (lambda s : s+1 , alist)df['sepal_length' ].map (lambda s:s*2 +1 )[0 :3 ]
apply和applymap apply
和applymap
是对dataframe
的操作,前者操作一行或者一列,后者操作每个元素
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 These are techniques to apply function to element, column or dataframe. Map: It iterates over each element of a series. df[‘column1’].map (lambda x: 10 +x), this will add 10 to each element of column1. df[‘column2’].map (lambda x: ‘AV’+x), this will concatenate “AV“ at the beginning of each element of column2 (column format is string). Apply: As the name suggests, applies a function along any axis of the DataFrame. df[[‘column1’,’column2’]].apply(sum ), it will returns the sum of all the values of column1 and column2. df0[['data1' ]].apply(lambda s:s+1 ) ApplyMap: 对dataframe的每一个元素施加一个函数 func = lambda x: x+2 df.applymap(func), dataframe每个元素加2 (所有列必须数字类型) contains df_obj[df_obj['套餐' ].str .contains(r'.*?语音CDMA.*' )] df[df['商品名称' ].str .contains("四件套" )] df[df['商品名称' ].str .contains(r".*四件套.*" )]