FE581 – Topics: R [Python and R for the ModernData Scientist]

Users often forget that Jupyter is shortfor “Julia, Python, and R” because it’s very Python-centric.

One last note: Python users refer to themselves as Pythonistias, which is a really coolname! There’s no real equivalent in R, and they also don’t get a really cool animal, butthat’s life when you’re a single-letter language. R users are typically called…wait forit…useRs! (Exclamation optional.) Indeed, the official annual conference is calleduseR! (exclamation obligatory), and the publisher Springer has an ongoing and veryexcellent series of books of the same name.

A python:R Bilingual Dictionary

View online with AzatAI Datalore Server

Package Management

Installing a single package


1
1
install.packages('tudyverse')


1
1
pip install pandas

Installing specific package versions


4
1
devtools::install_version(
2
"ggmap",
3
version = "3.5.2"
4
)


1
1
pip install pandas==1.1.0

Installing multiple packages


1
1
install.packages(c("sf", "ggmap"))


2
1
pip install pandas scikit-learn seaborn 
2
pip install -r requirements.txt

Loading packages


13
1
# Multiple calls to library()
2
library(MASS)
3
library(nlme)
4
library(psych)
5
library(sf)
6
# Install if not already available:
7
if (!require(readr)) {
8
install.packages("readr")
9
library(readr)
10
}
11
# Check, install if necessary, and
12
# load single or multiple packages:
13
pacman::p_load(MASS, nlme, psych, sf)


11
1
# Full package
2
import math
3
# Full package with alias
4
import pandas as pd
5
# Module
6
from sklearn import datasets
7
Module with alias
8
import statsmodels.api as sm
9
# Function
10
from statsmodels.formula.api import ols
11
# For ordinary least squares regression

Assign Operators

Fucking R has so much ugly assignments.

Types

The four most common user-defined atomic-vector types in R:

Type	Data frame shorthand	Tibble shorthand	Description	Example
Logical	logi	`<lgl>`	Binary data	TRUE/FALSE, T/F, 1/0
Integer	int	`<int>`	$-\infin$ $\infin$	7, 9 , 2, -4
Double	num	`<dbl>`	$-\infin$ $\infin$	3.14, 2.78, 6.45
Character	chr	`<chr>`	All alpha-numeric characters, includingwhite spaces	“Apple,” “Dog”

The four most common user-defined types in Python:

Type	Shorthand	Description	Example
Boolean	`bool`	Binary Data	True/False
Integer	`int`	$-\infin$ $\infin$	7, 9 , 2, -4
Float	`float`	$-\infin$ $\infin$	3.14, 2.78, 6.45
String	`str`	All alpha-numeric characters, including white spaces	“Apple,” “Dog”

Arithmetic Operators

Common arithmetic operators:

Description	R Operator	Python Operator
Addition	`+`	`+`
Substraction	`-`	`-`
Multiplication	`*`	`*`
Division (float)	`/`	`/`
Exponentiation	`^` or `**`	`**`
Integer Division (floor)	`%/%`	`//`
Modulus	`%%`	`%`

Attributes

Class attributes:


x
1
# List attributes
2
attributes(df)
3

4
# Access functions
5
dim(df)
6
names(df)
7
class(df)
8
comment(df)
9

10
# Add comment 
11
comment(df) <- "new info"
12

13
# Add custom attribute
14
attr(df, "custom") <- "alt info"
15
attributes(df)$custom


xxxxxxxxxx
14
1
# Definition of a class
2
class Food:
3
  name = 'toast'
4
  
5
# An instance of a class
6
breakfast = Food()
7

8
# An attribute of the class
9
# inherited by the instance
10
breakfast.name
11

12
# Setting an attibute
13
breakfast.name = 'simis'
14
# setattr(breakfast, 'name','simis')

Keywords

Reserved words and keywords:

Reserved words or keywords means you can not use them to name ur var.


xxxxxxxxxx
6
1
?reserved
2
if, else, repeat, while, function,
3
for, in, next, break, TRUE, FALSE,
4
NULL, Inf, NaN, NA, NA_integer_,
5
NA_real_, NA_complex_, NA_character_,
6
... (..1, ..2, etc.)


xxxxxxxxxx
12
1
# py Keywords
2
import keyword
3
print(keyword.kwlist)
4
## ['False', 'None', 'True', 'and',
5
'as', 'assert', 'async', 'await',
6
'break', 'class', 'continue', 'def',
7
'del', 'elif', 'else', 'except',
8
'finally', 'for', 'from', 'global',
9
'if', 'import', 'in', 'is', 'lambda',
10
'nonlocal', 'not', 'or', 'pass',
11
'raise', 'return', 'try', 'while',
12
'with', 'yield']

Functions and Methods


xxxxxxxxxx
5
1
# Basic definition
2
myFunc <- function(x, ...){
3
    x*10
4
}
5
myFunc(4)

40


xxxxxxxxxx
5
1
# Multiple unnamed arguments
2
myFunc <- function(...){
3
    sum(...)
4
}
5
myFunc(100,200,300)

600


xxxxxxxxxx
6
1
# Simple definition
2

3
def my_func(x):
4
    return x * 10
5

6
my_func(4)


xxxxxxxxxx
5
1
# Multiple named arguments, passed as a tuple
2
def my_func(*x):
3
    return x[2]
4

5
my_func(100,200,300)

300


xxxxxxxxxx
7
1
# Multiple unknown arguments, saved as a dict
2
def my_func(**num):
3
    print("x: ", num['x'])
4
    print("y: ", num['y'])
5

6

7
my_func(x=40, y=100)

x: 40 y: 100

Style and Naming Conventions

Style in R is generally more loosely defined than in Python. Nonetheless, see the Advanced R style guide by Hadley Wickham (CRC Press) or Google’s R Style guide forsuggestions.

For Python, see the PEP 8 style guide.

Analogous Data Storage Objects

Analogous Python objects for common R objects:

R Structure	Python Analogous Structure
Vector (one-dimensional homogeneous)	`ndarray`, bnut also `scalars`, homogenous `list` and `tuple`
Vector, `matrix` or `array`	NumPy `n-dimensional array` (`ndarray`)
Unnamed list (heterogenous)	`list`
Named list (heterogenous)	Dictionary `dict`, but lacking order
Environment (named, but unordered elements)	Dictionary, `dict`
Variable/column in a `data.frame`	Pandas Series (`pd.Series`)
Two-dimensional `data.frame`	Pandas data frame (`pd.DataFrame`)

Analogous R objects for common Python objects:

Python Structure	R Analogous Structure
`scalar`	One-element long vector
`list`( homo)	Vector, but as if lacking vectorization
list (hetero)	Unnamed list
tuple immutable	Vector, list as separated output from a function
Dictionary, dict, a key-value pair	Named list or better environment
NumPy n-dimensional array (ndarray)	Vector, matrix, or array
Pandas Series	Vector, variable/column in a data.frame
Pandas Data Frame	Two-dimensional data.frame


xxxxxxxxxx
3
1
# Vectors
2
cities <- c("Istanbul", "Urumqi", "Almaty")
3
dist <- c(584,1054,653)

'Istanbul''Urumqi''Almaty' 584 1054 653


xxxxxxxxxx
3
1
# Lists
2
cities = ['Istanbul', 'Berlin', "Korla"]
3
dist = [584, 1054, 653]

['Istanbul', 'Berlin', 'Korla']
[584, 1054, 653]

One-dimensional, heterogeneous key-value pairs (Lists in R,dictionaries in Python):


xxxxxxxxxx
18
1
# A list of data frames
2
cities <- list(
3
    Munich = data.frame(
4
        dist=584,
5
        pop=143275,
6
        country="DE"
7
    ),
8
    Istanbul = data.frame(
9
        dist=5584,
10
        pop=1423275,
11
        country="TR"
12
    ),
13
    Almaty = data.frame(
14
        dist=84,
15
        pop=275,
16
        country="KZ"
17
    )
18
)

2023-02-27 at 16.00.18


xxxxxxxxxx
2
1
# As a list object
2
cities[1]

2023-02-27 at 16.01.10


xxxxxxxxxx
1
1
cities['Istanbul']

2023-02-27 at 16.01.59


xxxxxxxxxx
2
1
# As a data.frame object (table)
2
cities[[1]]

A data.frame: 1 × 3

dist	pop	country

584	143275	DE


xxxxxxxxxx
1
1
cities$Almaty

A data.frame: 1 × 3

dist	pop	country

84	275	KZ


xxxxxxxxxx
4
1
# A list of heterogeneous data
2
lm_list <- lm(weight ~ group, data = PlantGrowth)
3
length(lm_list)
4
names(lm_list)

13 'coefficients''residuals''effects''rank''fitted.values''assign''qr''df.residual''contrasts''xlevels''call''terms''model'


xxxxxxxxxx
7
1
# lists
2

3
city = ['Munich', 'Paris', 'Amsterdam']
4
dist = [584, 1054, 653]
5
pop = [1484226, 2175601, 1558755]
6
area = [310.43, 105.4, 219.32]
7
country = ['DE', 'FR', 'NL']


xxxxxxxxxx
5
1
import numpy as np
2

3
# NumPy arrays
4
city_a = np.array(city)
5
city_a


xxxxxxxxxx
1
1
array(['Munich', 'Paris', 'Amsterdam'], dtype='%lt;U9')

2023-02-27 at 16.07.44


xxxxxxxxxx
1
1
pop_a = np.array(pop)


xxxxxxxxxx
6
1
# Dictionaries 
2
yy = {'city': ['Munich', 'Paris', 'Amsterdam'],
3
      'dist': [584, 1054, 653],
4
      'pop': [1484226, 2175601, 1558755],
5
      'area': [310.43, 105.4, 219.32],
6
      'country': ['DE', 'FR', 'NL']}


xxxxxxxxxx
5
1
{'city': ['Munich', 'Paris', 'Amsterdam'],
2
 'dist': [584, 1054, 653],
3
 'pop': [1484226, 2175601, 1558755],
4
 'area': [310.43, 105.4, 219.32],
5
 'country': ['DE', 'FR', 'NL']}

Data Frames

Data Frames in Python:


xxxxxxxxxx
5
1
# class pd.DataFrame
2
import pandas as pd
3

4
# From a dictionary, yy
5
yy_df = pd.DataFrame(yy)

2023-02-27 at 16.10.20


xxxxxxxxxx
6
1
# From lists
2
# names
3
list_names = ['city', 'dist', 'pop', 'area', 'country']
4

5
# columns are a list of lists
6
list_cols = [city, dist, pop, area, country]

['city', 'dist', 'pop', 'area', 'country']
[['Munich', 'Paris', 'Amsterdam'], [584, 1054, 653], [1484226, 2175601, 1558755], [310.43, 105.4, 219.32], ['DE', 'FR', 'NL']]


xxxxxxxxxx
6
1
# A zipped list of tuples
2
zip_list = list(
3
    zip(
4
        list_cols, list_names
5
    )
6
)

[(['Munich', 'Paris', 'Amsterdam'], 'city'), ([584, 1054, 653], 'dist'), ([1484226, 2175601, 1558755], 'pop'), ([310.43, 105.4, 219.32], 'area'), (['DE', 'FR', 'NL'], 'country')]


xxxxxxxxxx
1
1
zip_df = pd.DataFrame(zip_list)

2023-02-27 at 16.16.37


xxxxxxxxxx
11
1
# Easier
2
# import pandas library
3
import pandas as pd
4

5
# init list of lists
6
list_rows = [['Munich', 584, 1484226, 310.43, 'DE'],
7
             ['Paris', 1054, 2175601, 105.40, 'FR'],
8
             ['Amsterdam', 653, 1558755, 219.32, 'NL']]
9

10
# Create the pandas data frame
11
df = pd.DataFrame(list_rows, columns=list_names)

2023-02-27 at 16.16.54

Two-dimensional, heterogenous, tabular data frames in R:


xxxxxxxxxx
6
1
# class data.frame from vectors
2
cities_df <- data.frame(city = c("Munich", "Paris", "Amsterdam"),
3
dist = c(584, 1054, 653),
4
pop = c(1484226, 2175601, 1558755),
5
area = c(310.43, 105.4, 219.32),
6
country = c("DE", "FR", "NL"))

2023-02-27 at 16.18.31

Multidimensional arrays:


xxxxxxxxxx
6
1
# array
2

3
arr_r <- array(c(1:4,
4
seq(10, 40, 10),
5
seq(100, 400, 100)),
6
dim = c(2,2,3) )

2023-02-27 at 16.22.36


xxxxxxxxxx
1
1
rowSums(arr_r,dims = 2)

2023-02-27 at 16.22.53


xxxxxxxxxx
1
1
rowSums(arr_r,dims = 1)

444 666


xxxxxxxxxx
1
1
colSums(arr_r,dims = 1)

2023-02-27 at 16.23.40


xxxxxxxxxx
1
1
colSums(arr_r,dims = 2)

2023-02-27 at 16.23.59

Python:


xxxxxxxxxx
8
1
arr = np.array(
2
    [[[1, 2],
3
      [3, 4]],
4
     [[10, 20],
5
      [30, 40]],
6
     [[100, 200],
7
      [300, 400]]]
8
)

2023-02-27 at 16.25.15

2023-02-27 at 16.27.00

Logical Expressions

Relational operators

Description	R Operator	Python Operator
Equivalency	==	==
Non-equivalency	!=	!=
Greater-than (or equal to)	> (>=)	> (>=)
Lesser-than (or equal to)	< (<=)	< (<=)
Negation	!x	not()

2023-02-28 at 11.58.43

Python:


xxxxxxxxxx
1
1
import numpy as np
2
a = np.array([23,5,7,9,12])
3
a > 10


xxxxxxxxxx
1
1
array([ True, False, False, False,  True])

Logical operators

Description	R operator	Python Operator
AND	&, &&	&, and
OR	\|,\|\|	\|, or
WITHIN	y %in% x	In, not in
identity	`identical()`	is, is not


xxxxxxxxxx
1
10
9
1
xx <- 1:10
2
xx == 6
3
xx !=6
4
xx >= 6
5
xx <- 1:6
6
# tails of a distribution
7
xx < 3 | xx >4 
8
# range in a distribution
9
xx > 3 & xx < 4