In this tutorial we will learn about Machine Learning with Descriptive Statistics in Python. Descriptive statistics are essential for understanding machine learning data. They provide distribution, central tendency, and variability of your dataset.
This is the dataset which I have taken to understand about topic.
https://www.kaggle.com/datasets/ananthr1/weather-prediction
1.Peek Data:
In this I have review the first 10 rows of my data using the head() function on the Pandas Dataframe.
import pandas url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction" names = ['precipitation','temp_max','temp_min','wind','weather'] data = pandas.read_csv(url,names=names) peek=data.head(5) print(peek)
2.Dimensions of your Data:
In this we can find shape and size of my dataset by printing the shape property on the Pandas DataFrame.
import pandas url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction" names = ['precipitation','temp_max','temp_min','wind','weather'] data = pandas.read_csv(url,names=names) shape = data.shape print(shape)
The result are listed in rows and columns. In this 1462 rows, 6 columns.
[1462,6]
3.Data Type For Each Attribute:
In this we can list what data types used by the dataframe to characterize each attribute using the dtypes property.
import pandas url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction" names = ['precipitation','temp_max','temp_min','wind','weather'] data = pandas.read_csv(url,names=names) types = data.dtypes print(types)
The result is: precipitation float temp_min float temp_max float wind float weather string dtype:object
4.Descriptive Statistics:
In this we use describe() function on the Pandas Dataframe.
import pandas url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction" names = ['precipitation','temp_max','temp_min','wind','weather'] data = pandas.read_csv(url,names=names) pandas.set_option('display.width',100) pandas.set_option('precision',3) description=data. Describe() print(description)
5.Class Distribution:
In this we will know how to balance class value.
import pandas url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction" names = ['precipitation','temp_max','temp_min','wind','weather'] data = pandas.read_csv(url,names=names) class_counts = data.groupby('class').size() print(class_counts)
6.Correlation :
It is to find relationship between two variables. We can use corr() function on the Pandas Dataframe to calculate a correlation matrix.
import pandas url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction" names = ['precipitation','temp_max','temp_min','wind','weather'] data = pandas.read_csv(url,names=names) pandas.set_option('display.width',100) pandas.set_option('precision',3) correlations = data.corr(method='pearson') print(correlation)
7.Skew :
import pandas url = "https://www.kaggle.com/datasets/ananthr1/weather-prediction" names = ['precipitation','temp_max','temp_min','wind','weather'] data = pandas.read_csv(url,names=names) skew = data.skew() print(skew)