The best way to learn a new skill is by doing it!
This article is meant to help R users enhance their set of skills and learn Python for data science (from scratch). After all, R and Python are the most important programming languages a data scientist must know.
Python is a supremely powerful and a multi-purpose programming language. It has grown phenomenally in the last few years. It is used for web development, game development, and now data analysis / machine learning. Data analysis and machine learning is a relatively new branch in python.
For a beginner in data science, learning python for data analysis can be really painful. Why ?
You try Googling “learn python,” and you’ll get tons of tutorials only meant for learning python for web development. How can you find a way then ?
In this tutorial, we’ll be exploring the basics of python for performing data manipulation tasks. Alongside, we’ll also look how you do it in R. This parallel comparison will help you relate the set of tasks you do in R to how you do it in python! And in the end, we’ll take up a data set and practice our newly acquired python skills.
Note: This article is best suited for people who have a basic knowledge of R language.
No doubt, R is tremendously great at what it does. In fact, it was originally designed for doing statistical computing and manipulations. Its incredible community support allows a beginner to learn R quickly.
But, python is catching up fast. Established companies and startups have embraced python at a much larger scale compared to R.
According to indeed.com (from Jan 2016 to November 2016), the number of job postings seeking “machine learning python” increased much faster (approx. 123%) than “machine learning in R” jobs. Do you know why ? It is because
These programming languages understand the complexity of a data set based on its variables and data types. Yes! Let’s say you have a data set with one million rows and 50 columns. How would these programming languages understand the data ?
Basically, both R and Python have pre-defined data types. The dependent and independent variables get classified among these data types. And, based on the data type, the interpreter allots memory for use. Python supports the following data types:
integer
type.bit64
package to read hexadecimal values.numeric
type.factor
type or a character
type. There exists a tiny difference between Boolean values in R and python. In R, Boolean are stored as TRUE and FALSE. In python, they are stored as True and False. There’s a difference in the letter case.character
type.Since R is a statistical computing language, all the functions to manipulate data and reading variables are available inherently. On the other hand, python hails all the data analysis / manipulation / visualization functions from external libraries. Python has several libraries for data manipulation and machine learning. The most important ones are:
In a way, python for a data scientist is largely about mastering the libraries stated above. However, there are many more advanced libraries which people have started using. Therefore, for practical purposes you should remember the following things:
list
. It can be multidimensional. It can contain data of the same or multiple classes. In case of multiple classes, the coercion effect takes place.data.frame
and python uses the Dataframe
function from the pandas library.matrix
function. In python, we use the numpy.column_stack
function.Until here, I hope you’ve understood the basics of data types and data structures in R and Python. Now, let’s start working with them!
Let’s use the knowledge gained in the previous section and understand its practical implications. But before that, you should install python using anaconda’s jupyter notebook (previously called as ipython notebook). You can download here. Also, you can download other python IDEs for data analysis. I hope you already have R Studio installed on your laptop.
In R, lists are created using the base list
function:
my_list <- list ('monday','specter',24,TRUE)
typeof(my_list)
[1] "list"
In Python, lists are created using square brackets:
my_list = ['monday','specter',24,True]
type(my_list)
list
You can get the same output using the pandas library also. In pandas, lists are known as series. To load pandas in python, write:
#importing pandas library as pd notation (you can use any notation)
import pandas as pd
pd_list = pd.Series(my_list)
pd_list
0 monday 1 specter 2 24 3 True
The numbers (0,1,2,3) denote array indexing. Did you notice anything? Python is a zero-based indexing language, whereas indexing in R starts from 1. Let’s proceed and understand the difference between list subsetting in R and Python.
#create a list
new_list <- list(roll_number = 1:10, Start_Name = LETTERS[1:10])
Think of a new_list
as a train. This train has two coaches named roll_number
and Start_Name
. In each of these coaches, there are 10 people. So, in list subsetting, we can extract the value of coaches, people sitting in the coaches, etc.
#extract first coach information
new_list[1] #or
df['roll_number'] $roll_number
[1] 1 2 3 4 5 6 7 8 9 10
#extract only people sitting in first coach
new_list[[1]] #or
df$roll_number#[1] 1 2 3 4 5 6 7 8 9 10
If you check type of new_list[1],
you’ll find that it’s a list, whereas type of new_list[[1]]
is a character. Similarly, in python, you can extract list components like this:
#create a new list
new_list = pd.Series({'Roll_number' : range(1,10), 'Start_Name' : map(chr, range(65,70))})
Roll_number [1, 2, 3, 4, 5, 6, 7, 8, 9]
Start_Name [A, B, C, D, E]
dtype: object
#extracting first coach
new_list[['Roll_number']] #or
new_list[[0]]Roll_number [1, 2, 3, 4, 5, 6, 7, 8, 9]
dtype: object
#extract people sitting in first coach
new_list['Roll_number'] #or
new_list.Roll_number[1, 2, 3, 4, 5, 6, 7, 8, 9]
There’s a confusing difference in list indexing in R and Python. If you would have noticed [[ ]]
extracts the elements of a coach in R, whereas [[ ]]
extracts the coach itself in python.
A matrix is a 2D-structure created by a combination of vectors (or arrays). Generally, a matrix contains elements of the same class. However, even if you mix up elements from different classes (string, boolean, numeric etc), it will still work. The method of subsetting a matrix is quite similar except for the indexing number. To reiterate, python indexing starts with 0 and R indexing start with 1.
In R, a matrix can be created as:
my_mat <- matrix(1:10,nrow = 5)
my_mat
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
Subsetting a matrix is really easy.
#to select first row
my_mat[1,]
#to select second column
my_mat[,2]
In Python, we’ll take the help of numpy arrays to create a matrix. Therefore, first we’ll load the numpy library.
import numpy as np
a=np.array(range(10,15))
b=np.array(range(20,25))
c=np.array(range(30,35))
my_mat = np.column_stack([a,b,c])
#to select first row
my_mat[0,]
#to select second column
my_mat[:,1]
Data frames provide a much-needed skeleton to the loosely collected data from multiple sources. It’s spreadsheet-like structure which provides a data scientist with a nice picture of how the data set looks. In R, we can create a data frame using data.frame()
function:
data_set <- data.frame(Name = c("Sam","Paul","Tracy","Peter"),
Hair_Colour = c("Brown","White","Black","Black"),
Score = c(45,89,34,39))
So, we know that a dataframe is created by collection of vectors (or lists). To create a data frame in python, we’ll create a dictionary (collection of arrays) and enclose the dictionary in Dataframe
function from pandas library.
data_set = pd.DataFrame({'Name' : ["Sam","Paul","Tracy","Peter"],
'Hair_Colour' : ["Brown","White","Black","Black"],
'Score' : [45,89,34,39]})
Now, let’s look at the most crucial aspect of working with dataframe, i.e., subsetting. In fact, most of the data manipulation revolves around slicing and dicing a dataframe from every possible angle. Let’s look at the tasks one by one:
#select first column in R
data_set$Name # or
data_set[["Name]] #or
data_set[1]
#select first column in Python
data_set['Name'] #or
data_set.Name #or
data_set[[0]]
#select multiple columns in R
data_set[c('Name','Hair_Colour')] #or
data_set[,c('Name','Hair_Colour')]
#select multiple columns in Python
data_set[['Name','Hair_Colour']] #or
data_set.loc[:,['Name','Hair_Colour']]
.loc
function is used for label based indexing.
Until here, we’ve understood the skeleton of data types, structures, and formats in R and Python. Let’s now take up a data set and explore various other aspects of exploring data in python.
The wonderful scikit learn library contains an inbuilt repository of data sets. For our practice purpose, we’ll be using Boston housing data set. It’s a popular data set used in data analysis.
#import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
#store in a variable
boston = load_boston()
The variable boston is a dictionary. Just to refresh, a dictionary is a combination of key-value pairs. Let’s look at the key information:
boston.keys()
['data', 'feature_names', 'DESCR', 'target']
Now we know our required data set resides in the key data
. We also see that there is a separate key for feature names. I suppose the data set will not have column names attributed. Let’s check the column name we are going to deal with.
print(boston['feature_names'])
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
Can you understand these names ? Me neither. Now, let’s check the data description and understand the significance of each variable.
print(boston['DESCR'])
This data set has 506 rows and 13 columns. It comprises various characteristics which help in determining the prices of houses in Boston (U.S.). Now, let’s create the dataframe and start exploring.
bos_data = pd.DataFrame(boston['data'])
Similar to R, python also has a head()
function to peek into data:
bos_data.head()
The output shows that data set has no column names (as anticipated above). Attributing column names to a dataframe is easy.
bos_data.columns = boston['feature_names']
bos_data.head()
Just like R’s dim()
function, python has shape()
function to check the dimension of the data set. To get the statistical summary of the data sets, we can write:
bos_data.describe()
It shows us column-wise statistical summary of the data. Let’s quickly explore other aspects of this data.
#get first 10 rows
bos_data.iloc[:10]
#select first 5 columns
bos_data.loc[:,'CRIM':'NOX'] #or
bos_data.iloc[:,:5]
#filter columns based on a condition
bos_data.query("CRIM > 0.05 & CHAS == 0")
#sample the data set
bos_data.sample(n=10)
#sort values - default is ascending
bos_data.sort_values(['CRIM']).head() #or
bos_data.sort_values(['CRIM'],ascending=False).head()
#rename a column
bos_data.rename(columns={'CRIM' : 'CRIM_NEW'})
#find mean of selected columns
bos_data[['ZN','RM']].mean()
#transform a numeric data into categorical
bos_data['ZN_Cat'] = pd.cut(bos_data['ZN'],bins=5,labels=['a','b','c','d','e'])
#calculate the mean age for ZN_Cat variable
bos_data.groupby('ZN_Cat')['AGE'].sum()
In addition, python also allows us to create pivot tables. Yes! just like MS Excel or any other spreadsheet software, you can create a pivot table and understand data more closely. Unfortunately, creating a pivot table in R is a quite complex process. In python, a pivot table requires row names, column names, and the value to be calculated. If we don’t pass any column name, the results would be just like what you would get using the groupby
function. Therefore, let’s create another categorical variable.
#create a new categorical variable
bos_data['NEW_AGE'] = pd.cut(bos_data['AGE'],bins=3,labels=['Young','Old','Very_Old'])
#create a pivot table calculating mean age per ZN_Cat variable
bos_data.pivot_table(values='DIS',index='ZN_Cat',columns= 'NEW_AGE',aggfunc='mean')
This was just the tip of the iceberg. Where to go next ? Just like we used Boston data, now you should work with iris data. It is available in the sklearn_datasets repository. Try to explore it in depth. Remember, the more your practice, more time you spend coding, and the better you’ll become.
While coding in python, I realized that there is not much difference in the amount of code you write here;although some functions are shorter in R than in Python. However, R has really awesome packages which handle big data quite conveniently. Do let me know if you wish to learn about them!
Overall, learning both the languages would give you enough confidence to handle any type of data set. In fact, the best part about learning python is its comprehensive documentation available on numpy, pandas, and scikit learn libraries, which are sufficient enough to help you overcome all initial obstacles.
In this article, we just touched the basics of python. There’s a long to way to go. Next week, we’ll learn about data manipulation in python in detail. After that, we’ll look into data visualization, and the powerful machine learning library in python.
Do share your experience, suggestions, and questions below while practicing this tutorial!
In today's competitive talent landscape, attracting top candidates requires going beyond traditional job board postings.…
With growth, recruiting the best technical talents becomes one of the most important, but also…
In recent years, recruitment practices have changed tremendously. As the times advanced, organisations took numerous…
Today’s job market is very competitive. Organizations must adopt data-driven approaches to amplify their recruitment…
Organizations of all industries struggle with employee turnover. The high turnover rates cause increased hiring…
Candidate assessment is a major part of the hiring process. The talent acquisition system emphasizes…