-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy patheda_on_iris.py
181 lines (121 loc) · 6.48 KB
/
eda_on_iris.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
# -*- coding: utf-8 -*-
"""EDA_On_Iris.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1aVhHn2N8WSNNAZVa7Y2qbpCoBG9QTw6U
**In this notebook, I have done the Exploratory Data Analysis of the famous Iris dataset and tried to gain useful insights from the data.**
"""
# Commented out IPython magic to ensure Python compatibility.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# %matplotlib inline
"""**pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.
Load Iris Data set CSV file
"""
iris_df = pd.read_csv("/content/iris.csv")
#Gives top 5 entry of the dataset
iris_df.head()
#gives number of row and columns
iris_df.shape
iris_df.columns
#Unique values of each feature
def inspect_data(data):
return pd.DataFrame({"Data Type":data.dtypes,"No of Level":data.apply(lambda x: x.nunique(),axis=0)})
inspect_data(iris_df)
iris_df.info()
"""**Statistical Insights**"""
iris_df.describe()
plt.title('Species Count')
sns.countplot(iris_df['species']);
"""**Visualising relations between variables**"""
plt.title('comparision between sepal width and length')
sns.scatterplot(iris_df['sepal_length'],iris_df['sepal_width']);
plt.figure(figsize=(16,9))
plt.title('comparision between sepal width and length on the basis of species')
sns.scatterplot(iris_df['sepal_length'],iris_df['sepal_width'],hue=iris_df['species'],s=50);
"""From the above visualization, we can tell that the iris-setosa species has smaller sepal length but higher width. While we see Versicolor lies in almost middle for length as well as width. While Virginica has larger sepal lengths and smaller sepal widths.
**Comparision between petal length and petal width**
"""
plt.title('comparision between petal width and length')
sns.scatterplot(iris_df['petal_length'],
iris_df['petal_width']);
"""According to Speieces """
plt.title('comparision between petal length and petal width on the basis of spieces ')
sns.scatterplot(iris_df['petal_length'],
iris_df['petal_width'],hue=iris_df['species'],s=50);
"""We see that setosa has the smallest petal length as well as petal widths, while Versicolor has average petal length and petal width while the virginica species has the highest petal length as well as petal width.
**Visualization using Pair Plots**
"""
sns.pairplot(iris_df,vars = iris_df.columns[:4], hue="species");
cat_col = iris_df.dtypes[iris_df.dtypes=='object'].index
print('Number of Categorial Features:',len(cat_col),'\n')
print(cat_col)
Num_col = iris_df.dtypes[iris_df.dtypes!='object'].index
print('Number of Numerical Features:',len(Num_col),'\n')
print(Num_col)
#Summary of Numeric Features
iris_df[Num_col].describe()
"""We can see that the mean and median values don't have a large difference among their values so no **data transformation** is required.
he data you have available may not be in the right format or may require transformations to make it more useful. Data Transformation activities and techniques include:
1. Categorical encoding
2. Dealing with skewed data
3. Bias mitigation
4. Scaling
5. Rank transformation
6. Power functions
https://towardsdatascience.com/data-preparation-for-machine-learning-cleansing-transformation-feature-engineering-d2334079b06d
**Correlation Between Variables**
https://www.mygreatlearning.com/blog/covariance-vs-correlation/
"""
iris_df.corr()
"""**Histograms**
Gives us the frequency of the each feature in that particular range.
"""
fig, axes = plt.subplots(2, 2, figsize=(16,9))
axes[0,0].set_title("Distribution of Sepal Width")
axes[0,0].hist(iris_df['sepal_width'], bins=5);
axes[0,1].set_title("Distribution of Sepal Length")
axes[0,1].hist(iris_df['sepal_length'], bins=7);
axes[1,0].set_title("Distribution of Petal Width")
axes[1,0].hist(iris_df['petal_width'], bins=5);
axes[1,1].set_title("Distribution of Petal Length")
axes[1,1].hist(iris_df['petal_length'], bins=6);
"""**Univariate Analysis of our columns**"""
sns.FacetGrid(iris_df,hue="species",height=5).map(sns.distplot,"petal_width").add_legend();
"""We see the setosa is easily separable while some portions of Versicolor and virginica are mixed."""
sns.FacetGrid(iris_df,hue="species",height=5).map(sns.distplot,"petal_length").add_legend();
"""Again we see that on the basis of petal length setosa is separable while the other two are still overlapping."""
sns.FacetGrid(iris_df,hue="species",height=5).map(sns.distplot,"sepal_length").add_legend();
"""We see it is quite tough to separate the species on the basis of sepal_length alone."""
sns.FacetGrid(iris_df,hue="species",height=5).map(sns.distplot,"sepal_width").add_legend();
"""While the overlapping of species is more intense in the case of sepal_width.
**Box Plot**
"""
fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.boxplot( y="petal_width", x= "species", data=iris_df, orient='v' , ax=axes[0, 0])
sns.boxplot( y="petal_length", x= "species", data=iris_df, orient='v' , ax=axes[0, 1])
sns.boxplot( y="sepal_length", x= "species", data=iris_df, orient='v' , ax=axes[1, 0])
sns.boxplot( y="sepal_length", x= "species", data=iris_df, orient='v' , ax=axes[1, 1])
plt.show()
"""1. Further, we see that the box plots describe that the setosa usually has smaller features with few outliers.
2. The Versicolor species has average features.
3. The virginica species has the longest features widths and lengths as compared to others.
**Violin Plot**
"""
fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.violinplot(y="petal_width", x= "species", data=iris_df, orient='v' , ax=axes[0, 0])
sns.violinplot(y="petal_length", x= "species", data=iris_df, orient='v' , ax=axes[0, 1])
sns.violinplot(y="sepal_length", x= "species", data=iris_df, orient='v' , ax=axes[1, 0])
sns.violinplot(y="sepal_length", x= "species", data=iris_df, orient='v' , ax=axes[1, 1])
plt.show()
"""#**Conclusion**
1. The dataset is balanced i.e. equal records are present for all three species.
2. We have four numerical columns while just one categorical column which in turn is our target column.
3. A strong correlation is present between petal width and petal length.
4. The setosa species is the most easily distinguishable because of its small feature size.
5. The Versicolor and Virginica species are usually mixed and are sometimes hard to separate, while usually Versicolor has average feature sizes and virginica has larger feature sizes.
"""