FE581 – Topics: R [Python and R for the ModernData Scientist]A python:R Bilingual DictionaryPackage ManagementInstalling a single packageInstalling specific package versionsInstalling multiple packagesLoading packagesAssign OperatorsTypesArithmetic OperatorsAttributesKeywordsFunctions and MethodsStyle and Naming ConventionsAnalogous Data Storage ObjectsData FramesLogical ExpressionsRelational operatorsLogical operatorsR for Pythonistas
Users often forget that Jupyter is shortfor “Julia, Python, and R” because it’s very Python-centric.
One last note: Python users refer to themselves as Pythonistias, which is a really coolname! There’s no real equivalent in R, and they also don’t get a really cool animal, butthat’s life when you’re a single-letter language. R users are typically called…wait forit…useRs! (Exclamation optional.) Indeed, the official annual conference is calleduseR! (exclamation obligatory), and the publisher Springer has an ongoing and veryexcellent series of books of the same name.
View online with AzatAI Datalore Server
11install.packages('tudyverse')
11pip install pandas
41devtools::install_version(
2"ggmap",
3version = "3.5.2"
4)
11pip install pandas==1.1.0
11install.packages(c("sf", "ggmap"))
21pip install pandas scikit-learn seaborn
2pip install -r requirements.txt
131# Multiple calls to library()
2library(MASS)
3library(nlme)
4library(psych)
5library(sf)
6# Install if not already available:
7if (!require(readr)) {
8install.packages("readr")
9library(readr)
10}
11# Check, install if necessary, and
12# load single or multiple packages:
13pacman::p_load(MASS, nlme, psych, sf)
111# Full package
2import math
3# Full package with alias
4import pandas as pd
5# Module
6from sklearn import datasets
7Module with alias
8import statsmodels.api as sm
9# Function
10from statsmodels.formula.api import ols
11# For ordinary least squares regression
Fucking R has so much ugly assignments.
The four most common user-defined atomic-vector types in R:
Type | Data frame shorthand | Tibble shorthand | Description | Example |
---|---|---|---|---|
Logical | logi | <lgl> | Binary data | TRUE/FALSE, T/F, 1/0 |
Integer | int | <int> | Whole numbers from | 7, 9 , 2, -4 |
Double | num | <dbl> | Real numbers from | 3.14, 2.78, 6.45 |
Character | chr | <chr> | All alpha-numeric characters, includingwhite spaces | “Apple,” “Dog” |
The four most common user-defined types in Python:
Type | Shorthand | Description | Example |
---|---|---|---|
Boolean | bool | Binary Data | True/False |
Integer | int | Whole numbers from | 7, 9 , 2, -4 |
Float | float | Real numbers from | 3.14, 2.78, 6.45 |
String | str | All alpha-numeric characters, including white spaces | “Apple,” “Dog” |
Common arithmetic operators:
Description | R Operator | Python Operator |
---|---|---|
Addition | + | + |
Substraction | - | - |
Multiplication | * | * |
Division (float) | / | / |
Exponentiation | ^ or ** | ** |
Integer Division (floor) | %/% | // |
Modulus | %% | % |
Class attributes:
x1# List attributes
2attributes(df)
3
4# Access functions
5dim(df)
6names(df)
7class(df)
8comment(df)
9
10# Add comment
11comment(df) <- "new info"
12
13# Add custom attribute
14attr(df, "custom") <- "alt info"
15attributes(df)$custom
xxxxxxxxxx
141# Definition of a class
2class Food:
3 name = 'toast'
4
5# An instance of a class
6breakfast = Food()
7
8# An attribute of the class
9# inherited by the instance
10breakfast.name
11
12# Setting an attibute
13breakfast.name = 'simis'
14# setattr(breakfast, 'name','simis')
Reserved words and keywords:
Reserved words or keywords means you can not use them to name ur var.
xxxxxxxxxx
61?reserved
2if, else, repeat, while, function,
3for, in, next, break, TRUE, FALSE,
4NULL, Inf, NaN, NA, NA_integer_,
5NA_real_, NA_complex_, NA_character_,
6... (..1, ..2, etc.)
xxxxxxxxxx
121# py Keywords
2import keyword
3print(keyword.kwlist)
4## ['False', 'None', 'True', 'and',
5'as', 'assert', 'async', 'await',
6'break', 'class', 'continue', 'def',
7'del', 'elif', 'else', 'except',
8'finally', 'for', 'from', 'global',
9'if', 'import', 'in', 'is', 'lambda',
10'nonlocal', 'not', 'or', 'pass',
11'raise', 'return', 'try', 'while',
12'with', 'yield']
xxxxxxxxxx
51# Basic definition
2myFunc <- function(x, ...){
3 x*10
4}
5myFunc(4)
40
xxxxxxxxxx
51# Multiple unnamed arguments
2myFunc <- function(...){
3 sum(...)
4}
5myFunc(100,200,300)
600
xxxxxxxxxx
61# Simple definition
2
3def my_func(x):
4 return x * 10
5
6my_func(4)
xxxxxxxxxx
51# Multiple named arguments, passed as a tuple
2def my_func(*x):
3 return x[2]
4
5my_func(100,200,300)
300
xxxxxxxxxx
71# Multiple unknown arguments, saved as a dict
2def my_func(**num):
3 print("x: ", num['x'])
4 print("y: ", num['y'])
5
6
7my_func(x=40, y=100)
x: 40 y: 100
Style in R is generally more loosely defined than in Python. Nonetheless, see the Advanced R style guide by Hadley Wickham (CRC Press) or Google’s R Style guide forsuggestions.
For Python, see the PEP 8 style guide.
Analogous Python objects for common R objects:
R Structure | Python Analogous Structure |
---|---|
Vector (one-dimensional homogeneous) | ndarray , bnut also scalars , homogenous list and tuple |
Vector, matrix or array | NumPy n-dimensional array (ndarray ) |
Unnamed list (heterogenous) | list |
Named list (heterogenous) | Dictionary dict , but lacking order |
Environment (named, but unordered elements) | Dictionary, dict |
Variable/column in a data.frame | Pandas Series (pd.Series ) |
Two-dimensional data.frame | Pandas data frame (pd.DataFrame ) |
Analogous R objects for common Python objects:
Python Structure | R Analogous Structure |
---|---|
scalar | One-element long vector |
list ( homo) | Vector, but as if lacking vectorization |
list (hetero) | Unnamed list |
tuple immutable | Vector, list as separated output from a function |
Dictionary, dict, a key-value pair | Named list or better environment |
NumPy n-dimensional array (ndarray) | Vector, matrix, or array |
Pandas Series | Vector, variable/column in a data.frame |
Pandas Data Frame | Two-dimensional data.frame |
xxxxxxxxxx
31# Vectors
2cities <- c("Istanbul", "Urumqi", "Almaty")
3dist <- c(584,1054,653)
'Istanbul''Urumqi''Almaty' 584 1054 653
xxxxxxxxxx
31# Lists
2cities = ['Istanbul', 'Berlin', "Korla"]
3dist = [584, 1054, 653]
['Istanbul', 'Berlin', 'Korla']
[584, 1054, 653]
One-dimensional, heterogeneous key-value pairs (Lists in R,dictionaries in Python):
xxxxxxxxxx
181# A list of data frames
2cities <- list(
3 Munich = data.frame(
4 dist=584,
5 pop=143275,
6 country="DE"
7 ),
8 Istanbul = data.frame(
9 dist=5584,
10 pop=1423275,
11 country="TR"
12 ),
13 Almaty = data.frame(
14 dist=84,
15 pop=275,
16 country="KZ"
17 )
18)
xxxxxxxxxx
21# As a list object
2cities[1]
xxxxxxxxxx
11cities['Istanbul']
xxxxxxxxxx
21# As a data.frame object (table)
2cities[[1]]
A data.frame: 1 × 3
dist | pop | country |
---|---|---|
584 | 143275 | DE |
xxxxxxxxxx
11cities$Almaty
A data.frame: 1 × 3
dist | pop | country |
---|---|---|
84 | 275 | KZ |
xxxxxxxxxx
41# A list of heterogeneous data
2lm_list <- lm(weight ~ group, data = PlantGrowth)
3length(lm_list)
4names(lm_list)
13 'coefficients''residuals''effects''rank''fitted.values''assign''qr''df.residual''contrasts''xlevels''call''terms''model'
xxxxxxxxxx
71# lists
2
3city = ['Munich', 'Paris', 'Amsterdam']
4dist = [584, 1054, 653]
5pop = [1484226, 2175601, 1558755]
6area = [310.43, 105.4, 219.32]
7country = ['DE', 'FR', 'NL']
xxxxxxxxxx
51import numpy as np
2
3# NumPy arrays
4city_a = np.array(city)
5city_a
xxxxxxxxxx
11array(['Munich', 'Paris', 'Amsterdam'], dtype='%lt;U9')
xxxxxxxxxx
11pop_a = np.array(pop)
xxxxxxxxxx
61# Dictionaries
2yy = {'city': ['Munich', 'Paris', 'Amsterdam'],
3 'dist': [584, 1054, 653],
4 'pop': [1484226, 2175601, 1558755],
5 'area': [310.43, 105.4, 219.32],
6 'country': ['DE', 'FR', 'NL']}
xxxxxxxxxx
51{'city': ['Munich', 'Paris', 'Amsterdam'],
2 'dist': [584, 1054, 653],
3 'pop': [1484226, 2175601, 1558755],
4 'area': [310.43, 105.4, 219.32],
5 'country': ['DE', 'FR', 'NL']}
Data Frames in Python:
xxxxxxxxxx
51# class pd.DataFrame
2import pandas as pd
3
4# From a dictionary, yy
5yy_df = pd.DataFrame(yy)
xxxxxxxxxx
61# From lists
2# names
3list_names = ['city', 'dist', 'pop', 'area', 'country']
4
5# columns are a list of lists
6list_cols = [city, dist, pop, area, country]
['city', 'dist', 'pop', 'area', 'country']
[['Munich', 'Paris', 'Amsterdam'], [584, 1054, 653], [1484226, 2175601, 1558755], [310.43, 105.4, 219.32], ['DE', 'FR', 'NL']]
xxxxxxxxxx
61# A zipped list of tuples
2zip_list = list(
3 zip(
4 list_cols, list_names
5 )
6)
[(['Munich', 'Paris', 'Amsterdam'], 'city'), ([584, 1054, 653], 'dist'), ([1484226, 2175601, 1558755], 'pop'), ([310.43, 105.4, 219.32], 'area'), (['DE', 'FR', 'NL'], 'country')]
xxxxxxxxxx
11zip_df = pd.DataFrame(zip_list)
xxxxxxxxxx
111# Easier
2# import pandas library
3import pandas as pd
4
5# init list of lists
6list_rows = [['Munich', 584, 1484226, 310.43, 'DE'],
7 ['Paris', 1054, 2175601, 105.40, 'FR'],
8 ['Amsterdam', 653, 1558755, 219.32, 'NL']]
9
10# Create the pandas data frame
11df = pd.DataFrame(list_rows, columns=list_names)
Two-dimensional, heterogenous, tabular data frames in R:
xxxxxxxxxx
61# class data.frame from vectors
2cities_df <- data.frame(city = c("Munich", "Paris", "Amsterdam"),
3dist = c(584, 1054, 653),
4pop = c(1484226, 2175601, 1558755),
5area = c(310.43, 105.4, 219.32),
6country = c("DE", "FR", "NL"))
Multidimensional arrays:
R:
xxxxxxxxxx
61# array
2
3arr_r <- array(c(1:4,
4seq(10, 40, 10),
5seq(100, 400, 100)),
6dim = c(2,2,3) )
xxxxxxxxxx
11rowSums(arr_r,dims = 2)
xxxxxxxxxx
11rowSums(arr_r,dims = 1)
444 666
xxxxxxxxxx
11colSums(arr_r,dims = 1)
xxxxxxxxxx
11colSums(arr_r,dims = 2)
Python:
xxxxxxxxxx
81arr = np.array(
2 [[[1, 2],
3 [3, 4]],
4 [[10, 20],
5 [30, 40]],
6 [[100, 200],
7 [300, 400]]]
8)
Description | R Operator | Python Operator |
---|---|---|
Equivalency | == | == |
Non-equivalency | != | != |
Greater-than (or equal to) | > (>=) | > (>=) |
Lesser-than (or equal to) | < (<=) | < (<=) |
Negation | !x | not() |
Python:
xxxxxxxxxx
11import numpy as np
2a = np.array([23,5,7,9,12])
3a > 10
xxxxxxxxxx
11array([ True, False, False, False, True])
Description | R operator | Python Operator |
---|---|---|
AND | &, && | &, and |
OR | |,|| | |, or |
WITHIN | y %in% x | In, not in |
identity | identical() | is, is not |
xxxxxxxxxx
11091xx <- 1:10
2xx == 6
3xx !=6
4xx >= 6
5xx <- 1:6
6# tails of a distribution
7xx < 3 | xx >4
8# range in a distribution
9xx > 3 & xx < 4