panda数据处理

2022-09-28

Word count: 1.1k | Reading time≈ 5 min

panda 读取到的csv datafram中：

# df=pd.read_csv("/xxx/xx")
# df=pd.read_csv("/xxx/",nrows=15) #只读取前15行
# df=pd.read_csv("/xxx/",skiprows=9,nrows=5) #跳过前9行，往下读5行
# df=pd.read_csv("/xxx/.csv", usecols=[0]))  #只读第一列的数据
# print(df['0'].values)  #读取第0列的所有数

df.info()   #读取信息
df.head(10)  #读取前十行
df.index    #读取索引值
df.columns  #读取所有的列名
df.values   #读取内容
df.describe() #返回所有数值的列的统计各种信息，平均值，个数等，

类型转换：

data={'country':['aaa','bbb','ccc'],
        'population':[10,12,23]}

df_data=pd.DataFrame(data)  #转换成dataframe类型

设置索引：

df=df.set_index('Name')  #将名字所在的列设置为索引值

loc 和 iloc

loc 用label定位，（字符）
iloc用position 来定位（数字）

loc():

# df.set_index('5')    #设置columns为’5‘的列为index列
#df.loc['Iris-virginica']   #返回所有index为Iris-virginica的行（前提是：index为Iris-virginica的字符类型）

df.loc[df['Sex']=='male','Age'] #返回Sex列为male中所有Age的值
df.loc[df['Sex']=='male','Age'].mean()  ##返回Sex列为male中所有Age的值平均值

读取特定的行数: iloc()：

# print(df.iloc[0])  #第一行
# print(df.iloc[:3]) #前三行
# print(df.iloc[:,0]) #第一列的数
# print(df.iloc[:,:2]) #前2列的数
# print(df.iloc[0:5,1:3]) #前5行中第1，2列的数

也可以：

df[3:10]  #读取第四行到第十行
df["列的名字"]  #查看某一列的值
df[['age','fare']] #读取age,fare两个列的数值
df[['age','fare']][:5] #只读前5行，

读取特定的列：

age=df['Age']  #读取columns为Age的那一列
                #返回的age类型为series
age.index     #age的索引值
age.values[:5] #读取前5个values的值

bool类型的读取方式

df['Fare']>40  #读取columns为fare的行返回True,False的值
df[df['Fare']>5] #返回的是所有fare>5的行的数据
df[df['sex']=='male'] #返回所有性别为男的数据

其它操作

(df['Age']>70).sum()   #返回age列中age>70的个数

groupby 操作

创建数据：

df=pd.DataFrame({'key':['A','B','C','A','A','B','C','A','B'],
                'data':[0,2,3,4,5,6,7,8,9]})

计算key='A’的情况， data的总和

(df.loc[df['key']=='A','data']).sum()

out:
    17

groupby()

df.groupby('key').sum() #对key进行分类，返回类型的总和
Out[59]: 
        data
    key      
    A      17
    B      17
    C      10

数值运算：

df=pd.DataFrame([[1,2,3],[4,5,6]],index=['a','b'],columns=['A','B','C'])
    
    Out[62]: 
       A  B  C
    a  1  2  3
    b  4  5  6

df.sum()

# df.sum() 默认对每列进行求和
df.sum()

    Out[63]: 
    A    5
    B    7
    C    9
    dtype: int64

df.sum(axis=0)

#df.sum(axis=0) 0：对列求和，1：对行求和
# axis=0
    df.sum(axis=0)
    Out[64]: 
    A    5
    B    7
    C    9
    dtype: int64

df.sum(axis=1)

# axis=1
    df.sum(axis=1)
    Out[65]: 
    a     6
    b    15
    dtype: int64

df.sum(axis=‘columns’)

# axis='columns'
df.sum(axis='columns')
    Out[66]: 
    a     6
    b    15
    dtype: int64

df.mean(axis=0)

#计算所有列的平均值
    Out[67]: 
    A    2.5
    B    3.5
    C    4.5
    dtype: float64

df.mean(axis=1)

#计算所有行的平均值
    Out[68]: 
    a    2.0
    b    5.0
    dtype: float64

df.min()

    #返回每行的最小值
    df.mean(axis=1)
    Out[68]: 
    a    2.0
    b    5.0
    dtype: float64


    #返回每列的最小值，或de.min(axis=0)
    df.min()
    Out[69]: 
    A    1
    B    2
    C    3
    dtype: int64

以下命令的使用的数据集格式为

Out[76]: 
         1    2    3    4            5
    0  5.1  3.5  1.4  0.2  Iris-setosa
    1  4.9  3.0  1.4  0.2  Iris-setosa
    2  4.7  3.2  1.3  0.2  Iris-setosa
    3  4.6  3.1  1.5  0.2  Iris-setosa
    4  5.0  3.6  1.4  0.2  Iris-setosa
        ...

df.cov()

    #返回协方差的对称矩阵
        1         2         3         4
    1  0.685694 -0.039268  1.273682  0.516904
    2 -0.039268  0.188004 -0.321713 -0.117981
    3  1.273682 -0.321713  3.113179  1.296387
    4  0.516904 -0.117981  1.296387  0.582414

df.corr()

    #返回相关系数
          1         2         3         4
    1  1.000000 -0.109369  0.871754  0.817954
    2 -0.109369  1.000000 -0.420516 -0.356544
    3  0.871754 -0.420516  1.000000  0.962757
    4  0.817954 -0.356544  0.962757  1.000000

df[‘1’].value_counts()

    #返回的是列名为’1‘中的各种值的个数，默认为按照个数的降序
    #升序可以设置为， df['1'].value_counts(ascending=True)
    
Out[77]: 
    5.0    10     #值为5.0的有10个
    6.3     9
    5.1     9
    6.7     8
    5.7     8
    5.5     7
    5.8     

    #df['1'].value_counts(ascending=True,bins=5)
    #将列名为’1‘中的值从小到大的顺序分成5份，然后计算个数

    Out[78]: 
        (7.18, 7.9]      11
        (6.46, 7.18]     24
        (4.295, 5.02]    32
        (5.02, 5.74]     41
        (5.74, 6.46]     42
        Name: 1, dtype: int64

df[‘1’].count()

    当前列的个数

    df['1'].count()
    Out[80]: 150

对象操作

续。。。。

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.