How can R Users Learn Python for Data Science ?

Manish Saraswat

Author

11 mins read

January 12, 2017

Introduction

The best way to learn a new skill is by doing it!

This article is meant to help R users enhance their set of skills and learn Python for data science(from scratch). After all, R and Python are the most important programming languages a data scientist mustknow.

Python is a supremely powerful and a multi-purpose programming language. It has grown phenomenally in the last few years. It is used forweb development, game development, and now data analysis / machine learning. Data analysis andmachine learning is a relatively new branchinpython.

For a beginner in data science, learning python for data analysis can be really painful.Why ?

You try Googling “learn python,” and you’ll get tons oftutorials onlymeant for learning python for web development. How can youfind a way then ?

In this tutorial, we’llbe exploring the basics of python for performing data manipulation tasks. Alongside, we’ll also look how you do it in R. This parallel comparison will help you relate the set of tasks you do in R tohow you do it in python! And in the end,we’ll take up a data set and practice our newly acquired python skills.

Note: This article is best suited for people who have abasic knowledge of R language.

Why learn Python (even if you already know R)
Understanding Data Types and Structures in Pythonvs. R
Writing Code in Python vs. R
Practicing Python on a Data Set

Why learn Python (even if you already know R)

No doubt, R is tremendously great at what it does. In fact, it was originallydesigned for doing statistical computing and manipulations. Its incredible community support allows a beginner to learn R quickly.

But, python is catching up fast. Established companies andstartups have embraced python at a much larger scale compared to R.

r machine learning vs python machine learning

According to indeed.com (from Jan 2016 to November 2016), the number of job postings seeking “machine learning python” increased much faster (approx.123%) than “machine learning in R” jobs. Do you know why ? It is because

Python supports the entire spectrum of machine learning in a much better way.
Pythonnot only supports model building but also supportsmodel deployment.
The support of various powerful deep learning libraries such as keras, convnet, theano, and tensorflow is more for python thanR.
You don’t need to juggle between several packages to locate a function in python unlike you do in R. Python has relatively fewer libraries, with each having all the functions a data scientist would need.

Understanding Data Types and Structures in Pythonvs. R

These programming languages understand the complexity of a data set based on its variables anddata types. Yes! Let’s say you have a data set with one million rows and 50 columns. How would these programming languages understand the data ?

Basically, both R and Python have pre-defined data types. The dependent andindependent variables get classified among these data types. And, based on the data type, the interpreter allots memory for use. Python supports the following datatypes:

Numbers – It stores numeric values. These numeric values can be stored in 4 types: integer, long, float, and complex. Let’s understand them.
- Integer – It refers towhole numbers such as 10,13,91,102, etc. It is the same as R’s integer type.
- Long – It refers tolong integers which are represented in octa andhexadecimal. In R, you use bit64 package to read hexadecimal values.
- Float – It refers todecimal values such as 1.23, 9.89, etc. It is the same as R’s numeric type.
- Complex – It refers to complex numbers such as 2 + 3i, 5i, etc. However, this data type is rarely found in data.
Boolean – It stores two values (True andFalse). In R, it can be stored as a factortype or a character type. There exists a tiny difference between Boolean values in R and python. In R, Boolean are stored as TRUE and FALSE. In python, they are stored as True and False. There’s a difference in the letter case.
Strings – It stores text (character) data such as “elephant,” “lotus,” etc. It is the same as R’s character type.
Lists – It is the same as R’s list data type. It is capable of storing values of multiple variable types such as string, integer, Boolean, etc.
Tuples – There is nothing like tuples in R. Think of tuples as an R vector whose values can’t be changed; i.e., it is immutable.
Dictionary – It provides a two dimensional structure which supports key : value pair. In simple words, think of a key as a column name, and pair as column values.

Since R is a statistical computing language, all the functions to manipulate data and reading variables are available inherently. On the other hand, python hails all the data analysis / manipulation / visualization functions from externallibraries. Python has several libraries for data manipulation and machine learning. The most important ones are:

Numpy – It is used for doing numerical computing in python. It provides access to numerous mathematical function such as linear algebra, statistics etc. It is largely used to create arrays. In R, think of an array as a list. It consists of one class (numeric or string or boolean) or multiple classes also. It can be unidimensional or multidimensional.
Scipy – It is used for doing scientific computing in python.
Matplotlib – It is used for doing data visualization in python. For R, we use the famous ggplot2 library.
Pandas – It is the powerhouse for doing data manipulation tasks. In R, we use packages like dplyr, data.table etc.
Scikit Learn – It is the powerhouse for implementing machine learning algorithms. In fact, it’s the best part about doing machine learning in python. It contains all the functions you would require for model building.

In a way, python for a data scientist is largely about mastering the libraries stated above. However, there are many more advanced libraries which people have started using. Therefore, for practical purposes you should remember the following things:

Array – This is similar to R’s list. It can be multidimensional. It can contain data of the same or multiple classes.In case of multiple classes, the coercion effect takes place.
List – This is also similar to R’s List.
Data Frame – It’s a two-dimensional structurecomprising several lists. R has a built-in function data.frame and python uses theDataframe function from the pandas library.
Matrix – It’s a two (or multi) dimensional structure comprising all values of the same class (or multiple class). Think of a matrix as a 2D-version of a vector. In R, we use thematrix function. In python, we use thenumpy.column_stackfunction.

Until here, I hope you’ve understood the basics of data types and data structures in R and Python. Now, let’s start working with them!

Writing Code in Python vs. R

Let’s use the knowledge gained in the previous section and understand its practical implications. But before that, you should install python using anaconda’s jupyter notebook (previously called as ipython notebook). You candownload here. Also, you can download other python IDEs for data analysis. I hope you already have R Studio installed on your laptop.

1. Creating Lists

In R, lists are created using the base list function:

my_list <- list ('monday','specter',24,TRUE)
typeof(my_list)
[1] "list"

In Python, lists are created using square brackets:

my_list = ['monday','specter',24,True]
type(my_list) list

You can get the same output using the pandas library also. In pandas, lists are known as series. To load pandasin python, write:

#importing pandas library as pd notation (you can use any notation)
import pandas as pd pd_list = pd.Series(my_list) pd_list

0     monday
1    specter
2         24
3       True

The numbers (0,1,2,3) denote array indexing. Did you notice anything? Python is a zero-based indexing language, whereas indexing in R starts from 1. Let’s proceed and understandthe difference between list subsetting in R and Python.

#create a list new_list <- list(roll_number = 1:10, Start_Name = LETTERS[1:10])

Think of a new_list as a train. This train has twocoaches named roll_number and Start_Name. In each of these coaches, there are 10 people. So, in list subsetting, we can extract the value of coaches, people sitting in the coaches, etc.

#extract first coach information
new_list[1] #or df['roll_number']
$roll_number
[1] 1 2 3 4 5 6 7 8 9 10

#extract only people sitting in first coach
new_list[[1]] #or df$roll_number
#[1] 1 2 3 4 5 6 7 8 9 10

If you check type of new_list[1], you’ll find that it’s a list, whereas type of new_list[[1]] is a character. Similarly, in python, you can extract list components like this:

#create a new list new_list = pd.Series({'Roll_number' : range(1,10),
'Start_Name' : map(chr, range(65,70))})

Roll_number [1, 2, 3, 4, 5, 6, 7, 8, 9]Start_Name [A, B, C, D, E]dtype: object

#extracting first coach
new_list[['Roll_number']] #or new_list[[0]]
Roll_number [1, 2, 3, 4, 5, 6, 7, 8, 9] dtype: object

#extractpeople sitting in first coach
new_list['Roll_number'] #or new_list.Roll_number[1, 2, 3, 4, 5, 6, 7, 8, 9]

There’s a confusing difference in list indexing in R and Python. If you would have noticed [[ ]] extracts the elements of a coach in R, whereas [[ ]] extracts the coach itself in python.

2. Matrix

A matrix is a 2D-structure created by a combination ofvectors (or arrays). Generally, a matrix contains elements of the same class. However, even if you mix up elements from different classes (string, boolean, numeric etc), it will still work. The method of subsetting a matrix is quite similar except for the indexing number. To reiterate, python indexing starts with 0 and R indexing start with 1.

In R, a matrix can be created as:

my_mat <- matrix(1:10,nrow = 5)
my_mat

     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

Subsetting a matrix is really easy.

#to select first row
my_mat[1,]

#to select secondcolumn
my_mat[,2]

In Python,we’ll take the help ofnumpy arrays to create a matrix. Therefore, first we’ll load the numpy library.

import numpy as np
a=np.array(range(10,15))
b=np.array(range(20,25))
c=np.array(range(30,35))
my_mat = np.column_stack([a,b,c])

#to select first row
my_mat[0,]

#to select second column
my_mat[:,1]

3. Data Frames

Data frames provide a much-needed skeleton to the loosely collected datafrom multiple sources. It’s spreadsheet-like structure which provides a data scientist with a nice picture of how the data set looks. In R, we can create a data frame using data.frame() function:

data_set <- data.frame(Name = c("Sam","Paul","Tracy","Peter"),
Hair_Colour = c("Brown","White","Black","Black"),
Score = c(45,89,34,39))

So, we knowthat a dataframe is created by collection of vectors (or lists). To create a data frame in python, we’ll create a dictionary (collection of arrays) andenclose the dictionary in Dataframe function from pandas library.

data_set = pd.DataFrame({'Name' : ["Sam","Paul","Tracy","Peter"],
'Hair_Colour' : ["Brown","White","Black","Black"],
'Score' : [45,89,34,39]})

Now, let’s look at the most crucial aspect of working with dataframe, i.e., subsetting. In fact, most of the data manipulationrevolves around slicing and dicing a dataframe from every possible angle. Let’s look at the tasks one by one:

#select firstcolumn in R
data_set$Name # or
data_set[["Name]] #or
data_set[1]

#select first column in Python
data_set['Name'] #or
data_set.Name #or
data_set[[0]]

#select multiple columns in R
data_set[c('Name','Hair_Colour')] #or
data_set[,c('Name','Hair_Colour')]

#select multiple columns in Python
data_set[['Name','Hair_Colour']] #or
data_set.loc[:,['Name','Hair_Colour']]

.loc function is used for label based indexing.

Until here, we’ve understood the skeletonof data types, structures, and formats in R and Python. Let’s now take up a data set and explore various other aspects of exploring data in python.

Practicing Python on a Data Set

The wonderful scikit learn library contains an inbuilt repository of data sets. For our practice purpose, we’ll be using Boston housing data set. It’s apopular data set used indata analysis.

#import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston

#store in a variable
boston = load_boston()

The variable boston is a dictionary. Just to refresh, a dictionary is a combination of key-value pairs. Let’s look at the key information:

boston.keys()
['data', 'feature_names', 'DESCR', 'target']

Now we know our required data set resides in the key data. We also see that there is a separate key for feature names. I suppose the data set will not have column names attributed. Let’s check the column name we are going to deal with.

print(boston['feature_names'])['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']

Can you understand these names ? Me neither. Now, let’s check the data description and understand the significance of each variable.

print(boston['DESCR'])

This data set has 506 rows and 13 columns. It comprises various characteristics which help in determining the prices of houses in Boston (U.S.). Now, let’s createthe dataframe and start exploring.

bos_data = pd.DataFrame(boston['data'])

Similar to R, python also has ahead() function to peek into data:

bos_data.head()

The output shows that data set has no column names (as anticipated above). Attributing column names to a dataframe is easy.

bos_data.columns = boston['feature_names'] bos_data.head()

Just like R’s dim() function, python has shape() function to check the dimension of the data set. To get the statistical summary of the data sets, we can write:

bos_data.describe()

It shows us column-wise statistical summary of the data. Let’s quickly explore other aspects of this data.

#get first 10 rows
bos_data.iloc[:10]

#select first 5 columns
bos_data.loc[:,'CRIM':'NOX'] #or
bos_data.iloc[:,:5]

#filter columns based on a condition
bos_data.query("CRIM > 0.05 & CHAS == 0")

#sample the data set
bos_data.sample(n=10)

#sort values - default is ascending
bos_data.sort_values(['CRIM']).head() #or
bos_data.sort_values(['CRIM'],ascending=False).head()

#rename a column
bos_data.rename(columns={'CRIM' : 'CRIM_NEW'})

#find mean of selected columns
bos_data[['ZN','RM']].mean()

#transform a numeric data into categorical
bos_data['ZN_Cat'] = pd.cut(bos_data['ZN'],bins=5,labels=['a','b','c','d','e'])

#calculate the mean age for ZN_Cat variable
bos_data.groupby('ZN_Cat')['AGE'].sum()

In addition, pythonalso allows us to create pivot tables. Yes! just like MS Excel or any other spreadsheet software, you can create a pivot table and understand data more closely. Unfortunately, creating a pivot table in R is a quite complex process. In python, a pivot table requires row names, column names, and the value to be calculated. If we don’t pass any column name, the results would be just like what you would get using thegroupby function. Therefore, let’s create another categorical variable.

#create a new categorical variable
bos_data['NEW_AGE'] = pd.cut(bos_data['AGE'],bins=3,labels=['Young','Old','Very_Old'])

#create a pivot table calculating mean age per ZN_Cat variable
bos_data.pivot_table(values='DIS',index='ZN_Cat',columns= 'NEW_AGE',aggfunc='mean')

This was just the tip of the iceberg. Where to go next ? Just like we used Boston data, now you should work with iris data. It is available in the sklearn_datasets repository. Try to explore it indepth. Remember, the more your practice, more time you spend coding, and the better you’ll become.

Summary

While coding in python, I realized that there is not much difference in the amount of code you write here;although some functions are shorter in R than inPython. However, R has really awesome packages which handle big data quite conveniently. Do let me know if you wish to learn about them!

Overall, learning both the languages would give you enough confidence tohandle any type of data set. In fact, the best part about learning python is its comprehensive documentation available on numpy, pandas, and scikit learn libraries, which are sufficient enough to help you overcome all initial obstacles.

In this article, we just touched the basics ofpython. There’s a long to way to go. Next week,we’ll learn about data manipulation in python in detail. After that, we’ll look into data visualization, and the powerfulmachine learning library in python.

Do share your experience, suggestions, and questions below while practicing this tutorial!

Hire top tech talent with our recruitment platform

Access Free Demo

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

By Role

By need

Product

Content

How can R Users Learn Python for Data Science ?

Introduction

Table of Contents

Why learn Python (even if you already know R)

Understanding Data Types and Structures in Pythonvs. R

Writing Code in Python vs. R

1. Creating Lists

2. Matrix

3. Data Frames

Practicing Python on a Data Set

Summary

Get advanced recruiting insights delivered every month

Hire top tech talent with our recruitment platform

Get advanced recruiting insights delivered every month

Get insightful articles from the world of tech recruiting straight to your inbox

Related reads

Vibe Coding: Shaping the Future of Software

Guide to Conducting Successful System Design Interviews in 2025

How Candidates Use Technology to Cheat in Online Technical Assessments

Talent Acquisition Strategies For Rehiring Former Employees

Automation in Talent Acquisition: A Comprehensive Guide

Predictive Analytics for Talent Management

Get advanced recruiting insights delivered every month

Top Products

Engage global developers through innovation

AI-driven advanced coding assessments

Real-time code editor for effective coding interviews

Tailored learning paths for continuous assessments

For Businesses

Solutions

Features

Knowledge

Company

Enterprise Readiness