Unlock skill-first hiring with HackerEarth today
Learn moreData visualization for beginners – Part 1
This is a series of blogs dedicated to different data visualization techniques used in various domains of machine learning. Data Visualization is a critical step for building a powerful and efficient machine learning model. It helps us to better understand the data, generate better insights for feature engineering, and, finally, make better decisions during modeling and training of the model.
For this blog, we will use the seaborn and matplotlib libraries to generate the visualizations. Matplotlib is a MATLAB-like plotting framework in python, while seaborn is a python visualization library based on matplotlib. It provides a high-level interface for producing statistical graphics. In this blog, we will explore different statistical graphical techniques that can help us in effectively interpreting and understanding the data. Although all the plots using the seaborn library can be built using the matplotlib library, we usually prefer the seaborn library because of its ability to handle DataFrames.
We will start by importing the two libraries. Here is the guide to installing the matplotlib library and seaborn library. (Note that I’ll be using matplotlib and seaborn libraries interchangeably depending on the plot.)
### Importing necessary library
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Simple Plot
Let’s begin by plotting a simple line plot which is used to plot a mathematical. A line plot is used to plot the relationship or dependence of one variable on another. Say, we have two variables ‘x’ and ‘y’ with the following values:
x = np.array([ 0, 0.53, 1.05, 1.58, 2.11, 2.63, 3.16, 3.68, 4.21,
4.74, 5.26, 5.79, 6.32, 6.84])
y = np.array([ 0, 0.51, 0.87, 1. , 0.86, 0.49, -0.02, -0.51, -0.88,
-1. , -0.85, -0.47, 0.04, 0.53])
To plot the relationship between the two variables, we can simply call the plot function.
### Creating a figure to plot the graph.
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_xlabel('X data')
ax.set_ylabel('Y data')
ax.set_title('Relationship between variables X and Y')
plt.show() # display the graph
### if %matplotlib inline has been invoked already, then plt.show() is automatically invoked and the plot is displayed in the same window.
Here, we can see that the variables ‘x’ and ‘y’ have a sinusoidal relationship. Generally, .plot() function is used to find any mathematical relationship between the variables.
Histogram
A histogram is one of the most frequently used data visualization techniques in machine learning. It represents the distribution of a continuous variable over a given interval or period of time. Histograms plot the data by dividing it into intervals called ‘bins’. It is used to inspect the underlying frequency distribution (eg. Normal distribution), outliers, skewness, etc.
Let’s assume some data ‘x’ and analyze its distribution and other related features.
### Let 'x' be the data with 1000 random points.
x = np.random.randn(1000)
Let’s plot a histogram to analyze the distribution of ‘x’.
plt.hist(x)
plt.xlabel('Intervals')
plt.ylabel('Value')
plt.title('Distribution of the variable x')
plt.show()
The above plot shows a normal distribution, i.e., the variable ‘x’ is normally distributed. We can also infer that the distribution is somewhat negatively skewed. We usually control the ‘bins’ parameters to produce a distribution with smooth boundaries. For example, if we set the number of ‘bins’ too low, say bins=5, then most of the values get accumulated in the same interval, and as a result they produce a distribution which is hard to predict.
plt.hist(x, bins=5)
plt.xlabel('Intervals')
plt.ylabel('Value')
plt.title('Distribution of the variable x')
plt.show()
Similarly, if we increase the number of ‘bins’ to a high value, say bins=1000, each value will act as a separate bin, and as a result the distribution seems to be too random.
plt.hist(x, bins=1000)
plt.xlabel('Intervals')
plt.ylabel('Value')
plt.title('Distribution of the variable x')
plt.show()
Kernel Density Function
Before we dive into understanding KDE, let’s understand what parametric and non-parametric data are.
Parametric Data: When the data is assumed to have been drawn from a particular distribution and some parametric test can be applied to it
Non-Parametric Data: When we have no knowledge about the population and the underlying distribution
Kernel Density Function is the non-parametric way of representing the probability distribution function of a random variable. It is used when the parametric distribution of the data doesn’t make much sense, and you want to avoid making assumptions about the data.
The kernel density estimator is the estimated pdf of a random variable. It is defined as
Similar to histograms, KDE plots the density of observations on one axis with height along the other axis.
### We will use the seaborn library to plot KDE.
### Let's assume random data stored in variable 'x'.
fig, ax = plt.subplots()
### Generating random data
x = np.random.rand(200)
sns.kdeplot(x, shade=True, ax=ax)
plt.show()
Distplot combines the function of the histogram and the KDE plot into one figure.
### Generating a random sample
x = np.random.random_sample(1000)
### Plotting the distplot
sns.distplot(x, bins=20)
So, the distplot function plots the histogram and the KDE for the sample data in the same figure. You can tune the parameters of the displot to only display the histogram or kde or both. Distplot comes in handy when you want to visualize how close your assumption about the distribution of the data is to the actual distribution.
Scatter Plot
Scatter plots are used to determine the relationship between two variables. They show how much one variable is affected by another. It is the most commonly used data visualization technique and helps in drawing useful insights when comparing two variables. The relationship between two variables is called correlation. If the data points fit a line or curve with a positive slope, then the two variables are said to show positive correlation. If the line or curve has a negative slope, then the variables are said to have a negative correlation.
A perfect positive correlation has a value of 1 and a perfect negative correlation has a value of -1. The closer the value is to 1 or -1, the stronger the relationship between the variables. The closer the value is to 0, the weaker the correlation.
For our example, let’s define three variables ‘x’, ‘y’, and ‘z’, where ‘x’ and ‘z’ are randomly generated data and ‘y’ is defined as
We will use a scatter plot to find the relationship between the variables ‘x’ and ‘y’.
### Let's define the variables we want to find the relationship between.
x = np.random.rand(500)
z = np.random.rand(500)
### Defining the variable 'y'
y = x * (z + x)
fig, ax = plt.subplots()
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_title('Scatter plot between X and Y')
plt.scatter(x, y, marker='.')
plt.show()
From the figure above we can see that the data points are very close to each other and also if we fit a curve, along with the points, it will have a positive slope. Therefore, we can infer that there is a strong positive correlation between the values of the variable ‘x’ and variable ‘y’.
Also, we can see that the curve that best fits the graph is quadratic in nature and this can be confirmed by looking at the definition of the variable ‘y’.
Joint Plot
Jointplot is seaborn library specific and can be used to quickly visualize and analyze the relationship between two variables and describe their individual distributions on the same plot.
Let’s start with using joint plot for producing the scatter plot.
### Defining the data.
mean, covar = [0, 1], [[1, 0,], [0, 50]]
### Drawing random samples from a multivariate normal distribution.
### Two random variables are created, each containing 500 values, with the given mean and covariance.
data = np.random.multivariate_normal(mean, covar, 500)
### Storing the variables in a dataframe.
df = pd.DataFrame(data=data, columns=['X', 'Y'])
### Joint plot between X and Y
sns.jointplot(df.X, df.Y, kind='scatter')
plt.show()
Next, we can use the joint point to find the best line or curve that fits the plot.
sns.jointplot(df.X, df.Y, kind='reg')
plt.show()
Apart from this, jointplot can also be used to plot ‘kde’, ‘hex plot’, and ‘residual plot’.
PairPlot
We can use scatter plot to plot the relationship between two variables. But what if the dataset has more than two variables (which is quite often the case), it can be a tedious task to visualize the relationship between each variable with the other variables.
The seaborn pairplot function does the same thing for us and in just one line of code. It is used to plot multiple pairwise bivariate (two variable) distribution in a dataset. It creates a matrix and plots the relationship for each pair of columns. It also draws a univariate distribution for each variable on the diagonal axes.
### Loading a dataset from the sklearn toy datasets
from sklearn.datasets import load_linnerud
### Loading the data
linnerud_data = load_linnerud()
### Extracting the column data
data = linnerud_data.data
Sklearn stores data in the form of a numpy array and not data frames, thereby storing the data in a dataframe.
### Creating a dataframe
data = pd.DataFrame(data=data, columns=diabetes_data.feature_names)
### Plotting a pairplot
sns.pairplot(data=data)
So, in the graph above, we can see the relationships between each of the variables with the other and thus infer which variables are most correlated.
Conclusion
Visualizations play an important role in data analysis and exploration. In this blog, we got introduced to different kinds of plots used for data analysis of continuous variables. Next week, we will explore the various data visualization techniques that can be applied to categorical variables or variables with discrete values. Next, I encourage you to download the iris dataset or any other dataset of your choice and apply and explore the techniques learned in this blog.
Have anything to say? Feel free to comment below for any questions, suggestions, and discussions related to this article. Till then, Sayōnara.
Get advanced recruiting insights delivered every month
Related reads
The Impact of Talent Assessments on Reducing Employee Turnover
Organizations of all industries struggle with employee turnover. The high turnover rates cause increased hiring costs, lost productivity, and broken team dynamics. That’s…
Virtual Recruitment Events: A Complete Guide
Virtual hiring events are becoming vital for modern recruitment, and the hiring world is changing rapidly. As businesses embrace remote-first cultures and global…
The Role of Recruitment KPIs in Optimizing Your Talent Strategy
The competition for talent today is intense, and this makes it very important for organizations to get the right people on board. However,…
Interview as a Service – Optimizing Tech Hiring for Efficient Recruitment
Hiring trends are continuously evolving over the ages to keep pace with the latest technological advances. Hiring processes are being optimized almost every…
HR Scorecards: Using Metrics to Improve Hiring and Workforce Management
Hiring practices have changed significantly over the past 30 years. Technological advancements and changing workforce demographics have driven hirers to strike the right…
Why Recruiting Analytics is Critical for Hiring Success in 2024
In the current world, where the hiring process is ever-evolving, it has become crucial to make the right hiring decisions based on certain…