eda_on_iris.py

# -*- coding: utf-8 -*-
"""EDA_On_Iris.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1aVhHn2N8WSNNAZVa7Y2qbpCoBG9QTw6U

**In this notebook, I have done the Exploratory Data Analysis of the famous Iris dataset and tried to gain useful insights from the data.**
"""

# Commented out IPython magic to ensure Python compatibility.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# %matplotlib inline

"""**pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

Load Iris Data set CSV file
"""

iris_df = pd.read_csv("/content/iris.csv")

#Gives top 5 entry of the dataset
iris_df.head()

#gives number of row and columns
iris_df.shape

iris_df.columns

#Unique values of each feature
def inspect_data(data):
  return pd.DataFrame({"Data Type":data.dtypes,"No of Level":data.apply(lambda x: x.nunique(),axis=0)})
inspect_data(iris_df)

iris_df.info()

"""**Statistical Insights**"""

iris_df.describe()

plt.title('Species Count')
sns.countplot(iris_df['species']);

"""**Visualising relations between variables**"""

plt.title('comparision between sepal width and length')
sns.scatterplot(iris_df['sepal_length'],iris_df['sepal_width']);

plt.figure(figsize=(16,9))
plt.title('comparision between sepal width and length on the basis of species')
sns.scatterplot(iris_df['sepal_length'],iris_df['sepal_width'],hue=iris_df['species'],s=50);

"""From the above visualization, we can tell that the iris-setosa species has smaller sepal length but higher width. While we see Versicolor lies in almost middle for length as well as width. While Virginica has larger sepal lengths and smaller sepal widths.

**Comparision between petal length and petal width**
"""

plt.title('comparision between petal width and length')
sns.scatterplot(iris_df['petal_length'],
                iris_df['petal_width']);

"""According to Speieces """

plt.title('comparision between petal length and petal width on the basis of spieces ')
sns.scatterplot(iris_df['petal_length'],
                iris_df['petal_width'],hue=iris_df['species'],s=50);

"""We see that setosa has the smallest petal length as well as petal widths, while Versicolor has average petal length and petal width while the virginica species has the highest petal length as well as petal width.

**Visualization using Pair Plots**
"""

sns.pairplot(iris_df,vars = iris_df.columns[:4], hue="species");

cat_col = iris_df.dtypes[iris_df.dtypes=='object'].index

print('Number of Categorial Features:',len(cat_col),'\n')
print(cat_col)

Num_col = iris_df.dtypes[iris_df.dtypes!='object'].index
print('Number of Numerical Features:',len(Num_col),'\n')
print(Num_col)

#Summary of Numeric Features
iris_df[Num_col].describe()

"""We can see that the mean and median values don't have a large difference among their values so no **data transformation** is required.

he data you have available may not be in the right format or may require transformations to make it more useful. Data Transformation activities and techniques include:


1. Categorical encoding
2.   Dealing with skewed data
3.  Bias mitigation
4. Scaling
5. Rank transformation
6. Power functions

https://towardsdatascience.com/data-preparation-for-machine-learning-cleansing-transformation-feature-engineering-d2334079b06d

**Correlation Between Variables**

https://www.mygreatlearning.com/blog/covariance-vs-correlation/
"""

iris_df.corr()

"""**Histograms**
Gives us the frequency of the each feature in that particular range.
"""

fig, axes = plt.subplots(2, 2, figsize=(16,9))
axes[0,0].set_title("Distribution of Sepal Width")
axes[0,0].hist(iris_df['sepal_width'], bins=5);
axes[0,1].set_title("Distribution of Sepal Length")
axes[0,1].hist(iris_df['sepal_length'], bins=7);
axes[1,0].set_title("Distribution of Petal Width")
axes[1,0].hist(iris_df['petal_width'], bins=5);
axes[1,1].set_title("Distribution of Petal Length")
axes[1,1].hist(iris_df['petal_length'], bins=6);

"""**Univariate Analysis of our columns**"""

sns.FacetGrid(iris_df,hue="species",height=5).map(sns.distplot,"petal_width").add_legend();

"""We see the setosa is easily separable while some portions of Versicolor and virginica are mixed."""

sns.FacetGrid(iris_df,hue="species",height=5).map(sns.distplot,"petal_length").add_legend();

"""Again we see that on the basis of petal length setosa is separable while the other two are still overlapping."""

sns.FacetGrid(iris_df,hue="species",height=5).map(sns.distplot,"sepal_length").add_legend();

"""We see it is quite tough to separate the species on the basis of sepal_length alone."""

sns.FacetGrid(iris_df,hue="species",height=5).map(sns.distplot,"sepal_width").add_legend();

"""While the overlapping of species is more intense in the case of sepal_width.

**Box Plot**
"""

fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.boxplot(  y="petal_width", x= "species", data=iris_df,  orient='v' , ax=axes[0, 0])
sns.boxplot(  y="petal_length", x= "species", data=iris_df,  orient='v' , ax=axes[0, 1])
sns.boxplot(  y="sepal_length", x= "species", data=iris_df,  orient='v' , ax=axes[1, 0])
sns.boxplot(  y="sepal_length", x= "species", data=iris_df,  orient='v' , ax=axes[1, 1])
plt.show()

"""1.   Further, we see that the box plots describe that the setosa usually has smaller features with few outliers.
2.  The Versicolor species has average features.
3. The virginica species has the longest features widths and lengths as compared to others.

**Violin Plot**
"""

fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.violinplot(y="petal_width", x= "species", data=iris_df,  orient='v' , ax=axes[0, 0])
sns.violinplot(y="petal_length", x= "species", data=iris_df,  orient='v' , ax=axes[0, 1])
sns.violinplot(y="sepal_length", x= "species", data=iris_df,  orient='v' , ax=axes[1, 0])
sns.violinplot(y="sepal_length", x= "species", data=iris_df,  orient='v' , ax=axes[1, 1])
plt.show()

"""#**Conclusion**


1.   The dataset is balanced i.e. equal records are present for all three species.
2.  We have four numerical columns while just one categorical column which in turn is our target column.
3. A strong correlation is present between petal width and petal length.
4. The setosa species is the most easily distinguishable because of its small feature size.
5. The Versicolor and Virginica species are usually mixed and are sometimes hard to separate, while usually Versicolor has average feature sizes and virginica has larger feature sizes.


"""