PythonPlaza - Python & AI

Pandas

Pandas (stands for Python Data Analysis) is an open-source software library that is intended for data manipulation and analysis.It evolves around two primary Data structures: Series (1D) and DataFrame (2D).Pandas is built on top of NumPy, and efficiently manages large datasets an also offer ways for data cleaning, transformation, and analysis.Pandas also provide tools for working with time series data, including date range generation and frequency conversion. For example, we can convert date or time columns into pandas’ datetime type using pd.to_datetime(), or specify parse_dates=True during CSV loading. Pandas integrates with other Python libraries like NumPy, Matplotlib, and scikit-learn.

What is Pandas mainly used for?

With pandas, you can perform a wide range of data operations, such as follows
• Pandas are very effective in Data Analysis for various Machine Learning algorithms. It can be used to read and write data from various file formats like CSV, Excel and SQL databases.

• Machine Learning algorithms (both supervised and unsupervised) need clean data. Pandas can clean and prepare data by handling missing values and filtering entries.

• Sometimes data can be obtained from multiple sources, Pandas can help in merging and joining the multiple datasets.

• Pandas can reshape the data through pivoting and stacking operations.

• Pandas can also be used for statistical analysis and generating descriptive statistics.

• Pandas can help in visualizing data with plotting capabilities.

Create DataFrame

You can create a pandas DataFrame using various data structures from a Python dictionary, a list of lists, or a list of dictionaries and passing the data into DataFrame() methof of Pandas. The Pandas library should be imported.

import pandas as pd

Here are some common methods for creating a DataFrame:



Code Example 1:
#Create an empty Dataframe
import pandas as pd 
df=pd.DataFrame()

#print
print(df)

#Output:
Empty DataFrame
Columns: []
Index: []

Code Example 2:
#Create DataFrame from List
row1=[1,'John', 3.65]
row2=[2,'Mary', 3.76]
row3=[3,'Mike', 3.60]
row4=[4,'Amy',  3.75]

data=[row1,row2,row3,row4]
column_names=['Id','Name','GPA']

df=pd.DataFrame(data, columns=column_names)

#print
print(df)

#Output:
   Id  Name   GPA
0   1  John  3.65
1   2  Mary  3.76
2   3  Mike  3.60
3   4   Amy  3.75



Code Example 3:
#Create DataFrame from Dictionary

dict={'name':["Vinny","Tracy","Mike","John"],
       'age': [26, 25, 27, 29],
       'weight':[160, 145, 148, 151]
       }

df = pd.DataFrame(dict)

#print
print(df)


#Output:
    name  age  weight
0  Vinny   26     160
1  Tracy   25     145
2   Mike   27     148
3   John   29     151



Create Dataframe from Excel sheet

It is a most common way of creating a DataFrame. Use the read_excel() method and pass the path of the excel file as the parameter. If you have data in multiple excel sheets and you want data to create a DataFrame only from a particular excel sheet name, pass the name of the sheet as the second parameter. Let's see some examples.


Code Example 4:
#Create Dataframe from Excel
file_path = 'my_data.xlsx'

# Create the DataFrame
df = pd.read_excel(file_path)

# View the first 5 rows of the DataFrame
print(df.head())

Code Example 5:
#Read a specific sheet by name
file_path = 'my_data.xlsx'

# Create the DataFrame
df = pd.read_excel(file_path,sheet_name='Sheet1')

# View the first 5 rows of the DataFrame
print(df.head())


Handing Categorial text data

In Pandas workflows, LabelEncoder is a tool that is used to convert categorical text data into numerical labels (integers). This is a critical preprocessing step because string values cannot be processed by most of the machine learning algorithms and hence require numerical input for training and prediction. We can achieve this using LabelEncoder from sklearn. Let's see an example below.



Code Example 6:
#Dataframe with categotical values for gender
import pandas as pd
from sklearn.preprocessing import LabelEncoder


dict={'name':["Vinny","Tracy","Mike","John"],
       'gender':['M', 'F', 'M', 'M'],
       'weight':[160, 145, 148, 151]
       }

df = pd.DataFrame(dict)

le = LabelEncoder()

df['gender']=le.fit_transform(df['gender'])
print(df)

#Output:

    name  gender  weight
0  Vinny    1     160
1  Tracy    0     145
2   Mike    1     148
3   John    1     151


Dropping a column in DataFrame

To drop a column in DataFrame can be done by the following:

df = df.drop(columns=['col1', 'col2'])

By default drop() returns a new DataFrame. To modify the original object directly, use inplace=True.

df = df.drop(columns=['col1', 'col2'], inplace=True)



Code Example 7:
import pandas as pd
dict={'name':["Vinny","Tracy","Mike","John"],
       'gender':['M', 'F', 'M', 'M'],
       'weight':[160, 115, 148, 151],
       'age':[36, 38, 43, 35]
       }

df = pd.DataFrame(dict)
updated_df = df.drop(columns=['gender', 'weight'])
print(updated_df)

#Output:
    name  age
0  Vinny   36
1  Tracy   38
2   Mike   43
3   John   35

#use inplace=True to modify the original
DataFrame directly:
df = df.drop(columns=['gender', 'weight'], inplace=True)
print(df)

#Output:
    name  age
0  Vinny   36
1  Tracy   38
2   Mike   43
3   John   35

Convert a date to a numeric format

There are 2 ways to convert Date to a numeric format.
1.) Convert to Unix Timestamp (Seconds or Nanoseconds) and then divide by 1000000000 (10 to the power of 9)

2.) 2. Convert to YYYYMMDD Integer

Let's look at both examples.



Code Example 8:
#Convert to Unix Timestamp (Nanoseconds),
Then divide by 1000000000.

import pandas as pd

dict={"dat": ["2024 03 01 19:23:15.887", 
"2024 03 02 11:23:15.887"]}

df=pd.DataFrame(dict)

df['numeric_date'] = pd.to_datetime(df['dat']).astype(int) // 1000000000

print(df)

#Output:

     dat                       numeric_date
0   2024 03 01 19:23:15.887    1709320995
1   2024 03 02 11:23:15.887    1709378595


Code Example 9:
Convert to YYYYMMDD Integer

dict={"dat": ["2024 03 01 19:23:15.887", 
"2024 03 02 11:23:15.887"]}

df=pd.DataFrame(dict)
df['date_int'] = pd.to_datetime(df['dat']).dt.strftime('%Y%m%d').astype(int)

print(df)

#Output:

      dat                    date_int
0   2024 03 01 19:23:15.887  20240301
1   2024 03 02 11:23:15.887  20240302


Code Example 10:
Convert to YYYYMMDD Integer
import pandas as pd

dict={"dat": ["12/28/1974", 
"8/8/1980"]}

df=pd.DataFrame(dict)
df['date_int'] = pd.to_datetime(df['dat']).dt.strftime('%Y%m%d').astype(int)

print(df)

#Output:
 
          dat  date_int
0  12/28/1974  19741228
1    8/8/1980  19800808


Append different DataFrames

Sometimes you may need to append the columns from different DataFrames and create a new DataFrame. Let's see an example.



Code Example 11:
import pandas as pd
# Create a sample DataFrame
data = {'c1': ['A', 'B', 'A', 'C', 'B'],
        'c2': ['X', 'Y', 'X', 'Z', 'Y'],
        'o1': [10, 20, 30, 40, 50],
        'o2': [1.1, 2.2, 3.3, 4.4, 5.5],
        'x1':[4,6,7,8,2],
        'y1':[7,8,3,9,1]
}

df1=pd.DataFrame(data)
        
df2=df1[['x1','y1']]
df3=df1['o1']     
 
df_final = pd.concat([df2, df3], axis=1)

print(df_final)

#Output:

   x1  y1  o1
0   4   7  10
1   6   8  20
2   7   3  30
3   8   9  40
4   2   1  50


Handling negative values in DataFrames

Sometimes you may need to eliminate the negative values DataFrames, or convert the negative values to 0. Let's see some examples.



Code Example 12:
import pandas as pd
# Create a sample DataFrame
data = {'c1': [1, 5, -4, 5, 7],
        'c2': [4, 5, 7, 9, 1],
        'c3': [10, -3, 5, 4, 7],
        'c4': [1, 2, 9, 4, 5],

}

df=pd.DataFrame(data)
         
 
#Filter for only rows where a specific column's
#value is greater than or equal to zero.

df = df[df['c1'] >= 0]
df = df[df['c3'] >= 0]

print(df)

#Output: 
  c1  c2  c3  c4
0   1   4  10   1
3   5   9   4   4
4   7   1   7   5


Code Example 13:
#Replace negative values with 0
import pandas as pd
# Create a sample DataFrame
data = {'c1': [1, 5, -4, 5, 7],
        'c2': [4, 5, 7, 9, 1],
        'c3': [10, -3, 5, 4, 7],
        'c4': [1, 2, 9, 4, 5],

}

df=pd.DataFrame(data)
         
 df['c1'] = df['c1'].clip(lower=0) 
 df['c3'] = df['c3'].clip(lower=0)
print(df)

#Output: 
   c1  c2  c3  c4
0   1   4  10   1
1   5   5   0   2
2   0   7   5   9
3   5   9   4   4
4   7   1   7   5



How to remove spaces or "" in the DataFrame



Code Example 14:
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'c1': [1, 5, "  ", 5, 7],
        'c2': [4, 5, 7, 9, 1],
        'c3': [10, -3, 5, 4, 7],
        'c4': [1, 23, '', 4, 5.],

}

df=pd.DataFrame(data)
df=df.replace(r'^\s*$', np.nan, regex=True).dropna()
#If you want Daframe to have only integers, then use astype(int)
df = df.astype(int)
print(df)

#Output: 
   c1  c2  c3  c4
0   1   4  10   1
1   5   5  -3  23
3   5   9   4   4
4   7   1   7   5

Remove rows with missing data in Excel files.



Code Example 15:

import pandas as pd
df = pd.read_excel('file_data.xlsx')
# Drop rows where at least one element is missing
df_cleaned = df.dropna()
# Drop rows only if all columns are missing
df_cleaned_all = df.dropna(how='all')
df_cleaned.to_excel('cleaned_file.xlsx', index=False)

Head() and Tail() method in Pandas.



Code Example 16:

import pandas as pd

# Creating a small sample DataFrame
data = {'Name': ['A', 'B', 'C', 'D', 'E', 'F', 'G'],
        'Score': [81, 91, 78, 92, 88, 75, 95]}
df = pd.DataFrame(data)

# 1. Using head() to see the first 5 rows (default)
print("--- First 5 Rows (head) ---")
print(df.head())

#Output: 
  Name  Score
0    A     81
1    B     91
2    C     78
3    D     92
4    E     88

# 2. Using head(n) to see a custom number of top rows
print("\n--- First 3 Rows (head(3)) ---")
print(df.head(3))

#Output: 
  Name  Score
0    A     81
1    B     91
2    C     78

# 3. Using tail() to see the last 5 rows (default)
print("\n--- Last 5 Rows (tail) ---")
print(df.tail())

#Output: 
  Name  Score
2    C     78
3    D     92
4    E     88
5    F     75
6    G     95

# 4. Using tail(n) to see a custom number of bottom rows
print("\n--- Last 2 Rows (tail(2)) ---")
print(df.tail(2))

#Output: 
  Name  Score
5    F     75
6    G     95


Test



Code Example 15:

import pandas as pd
df = pd.read_excel('file_data.xlsx')
# Drop rows where at least one element is missing
df_cleaned = df.dropna()
# Drop rows only if all columns are missing
df_cleaned_all = df.dropna(how='all')
df_cleaned.to_excel('cleaned_file.xlsx', index=False)



About Us  | Contact Us | Sitemap  | Privacy Policy