Understand Machine Learning Data with Descriptive Statistics in Python

In this tutorial we will learn about Machine Learning with Descriptive Statistics in Python. Descriptive statistics are essential for understanding machine learning data. They provide distribution, central tendency, and variability of your dataset.

This is the dataset which I have taken to understand about topic.

https://www.kaggle.com/datasets/ananthr1/weather-prediction

1.Peek Data:

In this I have review the first 10 rows of my data using the head() function on the Pandas Dataframe.

import pandas
url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction"
names = ['precipitation','temp_max','temp_min','wind','weather']
data = pandas.read_csv(url,names=names)
peek=data.head(5)
print(peek)

2.Dimensions of your Data:

In this we can find shape and size of my dataset by printing the shape property on the Pandas DataFrame.

import pandas
url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction"
names = ['precipitation','temp_max','temp_min','wind','weather']
data = pandas.read_csv(url,names=names)
shape = data.shape
print(shape)

The result are listed in rows and columns. In this 1462 rows, 6 columns.

[1462,6]

3.Data Type For Each Attribute:

In this we can list what data types used by the dataframe to characterize each attribute using the dtypes property.

import pandas 
url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction" 
names = ['precipitation','temp_max','temp_min','wind','weather']
data = pandas.read_csv(url,names=names)
types = data.dtypes
print(types)
The result is:

precipitation   float

temp_min       float

temp_max       float

wind           float

weather        string

dtype:object

4.Descriptive Statistics:

In this we use describe() function on the Pandas Dataframe.

 import pandas
 url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction"
 names = ['precipitation','temp_max','temp_min','wind','weather'] 
 data = pandas.read_csv(url,names=names)
 pandas.set_option('display.width',100)
 pandas.set_option('precision',3)
description=data. Describe()
 print(description)

5.Class Distribution:

In this we will know how to balance class value.

 import pandas
 url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction"
 names = ['precipitation','temp_max','temp_min','wind','weather'] 
 data = pandas.read_csv(url,names=names) 
 class_counts = data.groupby('class').size()
print(class_counts)

6.Correlation :

It is to find relationship between two variables. We can use corr() function on the Pandas Dataframe to calculate a correlation matrix.

import pandas
url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction"
names = ['precipitation','temp_max','temp_min','wind','weather'] 
data = pandas.read_csv(url,names=names) 
pandas.set_option('display.width',100)
pandas.set_option('precision',3)
correlations = data.corr(method='pearson')
print(correlation)

7.Skew :

import pandas 
url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction"
names = ['precipitation','temp_max','temp_min','wind','weather'] 
data = pandas.read_csv(url,names=names)
skew = data.skew()
print(skew)

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top