PythonPlaza - Python & AI

Pandas

Pandas (stands for Python Data Analysis) is an open-source software library that is intended for data manipulation and analysis.It evolves around two primary Data structures: Series (1D) and DataFrame (2D).Pandas is built on top of NumPy, and efficiently manages large datasets an also offer ways for data cleaning, transformation, and analysis.Pandas also provide tools for working with time series data, including date range generation and frequency conversion. For example, we can convert date or time columns into pandas’ datetime type using pd.to_datetime(), or specify parse_dates=True during CSV loading. Pandas integrates with other Python libraries like NumPy, Matplotlib, and scikit-learn.

What is Pandas mainly used for?

With pandas, you can perform a wide range of data operations, such as follows
• Pandas are very effective in Data Analysis for various Machine Learning algorithms. It can be used to read and write data from various file formats like CSV, Excel and SQL databases.

• Machine Learning algorithms (both supervised and unsupervised) need clean data. Pandas can clean and prepare data by handling missing values and filtering entries.

• Sometimes data can be obtained from multiple sources, Pandas can help in merging and joining the multiple datasets.

• Pandas can reshape the data through pivoting and stacking operations.

• Pandas can also be used for statistical analysis and generating descriptive statistics.

• Pandas can help in visualizing data with plotting capabilities.

Create DataFrame

You can create a pandas DataFrame using various data structures from a Python dictionary, a list of lists, or a list of dictionaries and passing the data into DataFrame() methof of Pandas. The Pandas library should be imported.

import pandas as pd

Here are some common methods for creating a DataFrame:



Code Example 1:
#Create an empty Dataframe
import pandas as pd 
df=pd.DataFrame()

#print
print(df)

#Output:
Empty DataFrame
Columns: []
Index: []

Code Example 2:
#Create DataFrame from List
row1=[1,'John', 3.65]
row2=[2,'Mary', 3.76]
row3=[3,'Mike', 3.60]
row4=[4,'Amy',  3.75]

data=[row1,row2,row3,row4]
column_names=['Id','Name','GPA']

df=pd.DataFrame(data, columns=column_names)

#print
print(df)

#Output:
   Id  Name   GPA
0   1  John  3.65
1   2  Mary  3.76
2   3  Mike  3.60
3   4   Amy  3.75



Code Example 3:
#Create DataFrame from Dictionary

dict={'name':["Vinny","Tracy","Mike","John"],
       'age': [26, 25, 27, 29],
       'weight':[160, 145, 148, 151]
       }

df = pd.DataFrame(dict)

#print
print(df)


#Output:
    name  age  weight
0  Vinny   26     160
1  Tracy   25     145
2   Mike   27     148
3   John   29     151



Create Dataframe from Excel sheet

It is a most common way of creating a DataFrame. Use the read_excel() method and pass the path of the excel file as the parameter. If you have data in multiple excel sheets and you want data to create a DataFrame only from a particular excel sheet name, pass the name of the sheet as the second parameter. Let's see some examples.


Code Example 4:
#Create Dataframe from Excel
file_path = 'my_data.xlsx'

# Create the DataFrame
df = pd.read_excel(file_path)

# View the first 5 rows of the DataFrame
print(df.head())

Code Example 5:
#Read a specific sheet by name
file_path = 'my_data.xlsx'

# Create the DataFrame
df = pd.read_excel(file_path,sheet_name='Sheet1')

# View the first 5 rows of the DataFrame
print(df.head())


Variance

Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation!
Or the other way around, if you multiply the standard deviation by itself, you get the variance!
To calculate the variance you have to do as follows:

Find Variance of 32,111,138,28,59,77,97
1. Find the mean:
(32+111+138+28+59+77+97) / 7 = 77.4
2. For each value: find the difference from the mean:

32 - 77.4 = -45.4
111 - 77.4 = 33.6
138 - 77.4 = 60.6
28 - 77.4 = -49.4
59 - 77.4 = -18.4
77 - 77.4 = - 0.4
97 - 77.4 = 19.6

3. For each difference: find the square value:

(-45.4)2 = 2061.16
(33.6)2 = 1128.96
(60.6)2 = 3672.36
(-49.4)2 = 2440.36
(-18.4)2 = 338.56
(- 0.4)2 = 0.16
(19.6)2 = 384.16

4. The variance is the average number of these squared differences:

(2061.16+1128.96+3672.36+2440.36+338.56+0.16+384.16) / 7 = 1432.2



Code Example 12:
#Variance using Numpy
import numpy
speed = [32,111,138,28,59,77,97]
x = numpy.var(speed)
print(x)

#Output:
1432.2448979591834


Percentiles

Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.
Example: Let's say we have an array that contains the ages of every person living on a street.
ages = [5,31,43,48,50,41,7,11,15,39]
What is the 75 percentile? The answer is 42.5, meaning that 75% of the people are 43 or younger.

The NumPy module has a method for finding the specified percentile:



Code Example 13:
import numpy
ages = [5,31,43,48,50,41,7,11,15,39]
x = numpy.percentile(ages, 75)

print(x)

#Output:
42.5