I often switched between R and Python in different projects when I was a data scientist. Thus, after moving to higher education, I told students, instead of asking “which should I learn?” it would be better to master both. However, sometimes writing R and Python at the same time is confusing. To make the coding more convenient, I listed the commonly-used script in the two languages side by side for data wrangling, the most time-consuming step in analytics, as a quick reference.
First, let’s simulate a sample data set with some missing values using R. The dimension variables are Character and the measure variables are numeric.
set.seed(12)
N <- 1000
df <- data.frame(
dimension1 = sample(c("I", "II", "III"), N, replace = T),
dimension2 = sample(c("A", "B", "C"), N, replace = T),
measure1 = sample(1:10, N, replace = T),
measure2 = sample(1:10, N, replace = T),
stringsAsFactors=FALSE
)
df <- as.data.frame(lapply(df,
function(r)
r[sample(c(TRUE, NA),
prob = c(0.85, 0.15),
size = length(r),
replace = TRUE)]
),
stringsAsFactors=FALSE
)
head(df)
## dimension1 dimension2 measure1 measure2
## 1 <NA> C 2 7
## 2 II <NA> NA 2
## 3 <NA> A 9 7
## 4 <NA> C 10 NA
## 5 II B 1 10
## 6 I B 3 9
In order to share the data with Python, I save data in feather files.
library(feather)
write_feather(df, "sample_data.feather")
# row count
nrow(df)
## [1] 1000
# column count
ncol(df)
## [1] 4
# shape
dim(df)
## [1] 1000 4
import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.read_feather("sample_data.feather")
# row count
df.shape[0]
## 1000
# column count
df.shape[1]
## 4
# shape
df.shape
## (1000, 4)
# column names
names(df)
## [1] "dimension1" "dimension2" "measure1" "measure2"
# data type by column
lapply(df, class)
## $dimension1
## [1] "character"
##
## $dimension2
## [1] "character"
##
## $measure1
## [1] "integer"
##
## $measure2
## [1] "integer"
# column names
list(df)
## ['dimension1', 'dimension2', 'measure1', 'measure2']
# data type by column
df.dtypes
## dimension1 object
## dimension2 object
## measure1 float64
## measure2 float64
## dtype: object
# count missing values by columns
colSums(is.na(df))
## dimension1 dimension2 measure1 measure2
## 177 158 147 169
# fill all missing values by 0
df %>% replace(is.na(.), 0) %>% head() # df[is.na(df)] <- 0 also works
## dimension1 dimension2 measure1 measure2
## 1 0 C 2 7
## 2 II 0 0 2
## 3 0 A 9 7
## 4 0 C 10 0
## 5 II B 1 10
## 6 I B 3 9
# fill missing values by columns
df %>% replace_na(list(dimension1 = "Unknown", measure1 = 0)) %>% head()
## dimension1 dimension2 measure1 measure2
## 1 Unknown C 2 7
## 2 II <NA> 0 2
## 3 Unknown A 9 7
## 4 Unknown C 10 NA
## 5 II B 1 10
## 6 I B 3 9
# count missing values by columns
df.isnull().sum()
## dimension1 177
## dimension2 158
## measure1 147
## measure2 169
## dtype: int64
# fill all missing values by 0
df.fillna(0).head()
## dimension1 dimension2 measure1 measure2
## 0 0 C 2.0 7.0
## 1 II 0 0.0 2.0
## 2 0 A 9.0 7.0
## 3 0 C 10.0 0.0
## 4 II B 1.0 10.0
# fill missing values by columns
df.fillna(value = {"dimension1": "Unknown", "measure1": 0}).head()
## dimension1 dimension2 measure1 measure2
## 0 Unknown C 2.0 7.0
## 1 II None 0.0 2.0
## 2 Unknown A 9.0 7.0
## 3 Unknown C 10.0 NaN
## 4 II B 1.0 10.0
# unique rows
df %>% distinct()
# unique rows by columns
df %>% distinct(dimension1, dimension2, .keep_all = T)
# unique rows
df.drop_duplicates()
# unique rows by columns
df.drop_duplicates(subset=('dimension1', 'dimension2'))
# frequency count by variable
table(df$dimension1)
##
## I II III
## 276 262 285
# contingency table
table(df$dimension1, df$dimension2)
##
## A B C
## I 94 65 72
## II 60 73 76
## III 85 90 76
# frequency count by variable
df.dimension1.value_counts()
## III 285
## I 276
## II 262
## Name: dimension1, dtype: int64
# contingency table
pd.crosstab(df.dimension1, df.dimension2)
## dimension2 A B C
## dimension1
## I 94 65 72
## II 60 73 76
## III 85 90 76
Let’s group the data by dimension1, and compute
# summarize the total, percent, ratio, and ratio relativity
df %>%
group_by(dimension1) %>%
summarise(
measure1_sum = sum(measure1, na.rm = T),
measure2_sum = sum(measure2, na.rm = T),
count = n()
) %>%
mutate(
measure1_sum_percent = measure1_sum/sum(measure1_sum),
ratio = measure1_sum/measure2_sum
) %>%
mutate(
ratio_relativity = ratio/(sum(measure1_sum)/sum(measure2_sum))
)
## # A tibble: 4 × 7
## dimension1 measure1_sum measure2_sum count measure1_sum_percent ratio
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 I 1316 1348 276 0.281 0.976
## 2 II 1172 1154 262 0.250 1.02
## 3 III 1325 1346 285 0.283 0.984
## 4 <NA> 868 817 177 0.185 1.06
## # … with 1 more variable: ratio_relativity <dbl>
# summarize the total, percent, ratio, and relativity
def oneway(g):
return(
pd.Series({
"measure1_sum": g.measure1.sum(),
"measure2_sum": g.measure2.sum(),
"count": len(g),
"measure1_sum_percent": g.measure1.sum()/df.measure1.sum(),
"ratio": g.measure1.sum()/g.measure2.sum(),
"ratio_relativity":
(g.measure1.sum()/g.measure2.sum())/
(df.measure1.sum()/df.measure2.sum())
})
)
df.groupby("dimension1", dropna=False).apply(oneway)
## measure1_sum measure2_sum count measure1_sum_percent ratio \
## dimension1
## I 1316.0 1348.0 276.0 0.281137 0.976261
## II 1172.0 1154.0 262.0 0.250374 1.015598
## III 1325.0 1346.0 285.0 0.283059 0.984398
## NaN 868.0 817.0 177.0 0.185430 1.062424
##
## ratio_relativity
## dimension1
## I 0.972924
## II 1.012127
## III 0.981033
## NaN 1.058792
Tidy data requires that each observation is a row. However, our sample data has two observations: measure1 and measure2. To clean it, we need to reshape the data to make it from a wide shape to a long shape. The new key variable defines the measure type, and the new values variable shows the values.
Note that data also can be reshaped from wide to long shape although it is not used that often. In this case, the key must be uniquely defined. You can Google “reshape from wide to long” if you are interested in it.
df_reshape <- df %>% gather(c("measure1", "measure2"),
key = "measure", value = "values")
head(df_reshape)
## dimension1 dimension2 measure values
## 1 <NA> C measure1 2
## 2 II <NA> measure1 NA
## 3 <NA> A measure1 9
## 4 <NA> C measure1 10
## 5 II B measure1 1
## 6 I B measure1 3
df_reshape = pd.melt(df,
id_vars=['dimension1', 'dimension2'],
value_vars=['measure1', 'measure2'])
df_reshape.head()
## dimension1 dimension2 variable value
## 0 None C measure1 2.0
## 1 II None measure1 NaN
## 2 None A measure1 9.0
## 3 None C measure1 10.0
## 4 II B measure1 1.0
To merge two data sets, let’s fill the missing values in dimention1 and dimention2, and build another data set using these two variables as the key, i.e., the row is uniquely defined by the two variables.
options(dplyr.summarise.inform = FALSE)
df <- df %>% replace_na(list(dimension1 = "Unknown", dimension2 = "Unknown"))
df2 <- df %>%
group_by(dimension1, dimension2) %>%
summarise(
measure3 = sum(measure1, na.rm = T),
measure4 = sum(measure2, na.rm = T)
) %>%
rename(dimension3 = dimension1, dimension4 = dimension2)
df2
## # A tibble: 16 × 4
## # Groups: dimension3 [4]
## dimension3 dimension4 measure3 measure4
## <chr> <chr> <int> <int>
## 1 I A 435 432
## 2 I B 295 295
## 3 I C 357 378
## 4 I Unknown 229 243
## 5 II A 301 221
## 6 II B 321 328
## 7 II C 329 356
## 8 II Unknown 221 249
## 9 III A 403 406
## 10 III B 429 454
## 11 III C 355 318
## 12 III Unknown 138 168
## 13 Unknown A 274 246
## 14 Unknown B 232 238
## 15 Unknown C 234 212
## 16 Unknown Unknown 128 121
write_feather(df, "sample_data_filledNa.feather")
write_feather(df2, "sample_data2.feather")
# left join tables by two key variables
left_join(df, df2,
by = c(
"dimension1" = "dimension3",
"dimension2" = "dimension4"
)
) %>% head()
## dimension1 dimension2 measure1 measure2 measure3 measure4
## 1 Unknown C 2 7 234 212
## 2 II Unknown NA 2 221 249
## 3 Unknown A 9 7 274 246
## 4 Unknown C 10 NA 234 212
## 5 II B 1 10 321 328
## 6 I B 3 9 295 295
import pandas as pd
df2 = pd.read_feather("sample_data2.feather")
df = pd.read_feather("sample_data_filledNa.feather")
# left join tables by two key variables
pd.merge(df, df2,
how="left",
left_on = ["dimension1", "dimension2"],
right_on =["dimension3", "dimension4"]
).head()
## dimension1 dimension2 measure1 measure2 dimension3 dimension4 measure3 \
## 0 Unknown C 2.0 7.0 Unknown C 234
## 1 II Unknown NaN 2.0 II Unknown 221
## 2 Unknown A 9.0 7.0 Unknown A 274
## 3 Unknown C 10.0 NaN Unknown C 234
## 4 II B 1.0 10.0 II B 321
##
## measure4
## 0 212
## 1 249
## 2 246
## 3 212
## 4 328