Data Manipulation the R way with Python

Daniel Boadzie
12 min readFeb 13, 2023

--

Data manipulation is a critical aspect of data analysis, as it involves cleaning, transforming, and shaping data into a format that can be easily analyzed. The R and Python programming languages are widely used for data manipulation due to their versatility and powerful libraries. In this article, we will explore the techniques for data manipulation in both R and Python, and demonstrate how to integrate R’s way into Python for an even more robust data analysis process. Whether you are a seasoned data analyst or just starting out, this article will provide valuable insights into the world of data manipulation with R and Python.

In this article, we will explore the usage of siuba, a powerful package for data manipulation in Python. siuba aims to replicate the functionality of R’s dplyr package and makes it easy to process and analyze data using a similar syntax and familiar verbs. It also includes some additional features and improvements over R’s implementation, for example it natively works with Pandas Dataframes and support parallelization.

By the end of this article, you will have a solid understanding of how to use siuba to perform data manipulation in Python, and will be able to apply these techniques to your own data analysis projects.

Getting Started with Siuba

Siuba is a data manipulation library for Python that aims to provide a similar interface to dplyr in R. It is a part of the ongoing effort to create a package in Python that replicates the user-friendly and efficient experience of working with R’s dplyr library for data manipulation.

Siuba Core verbs

arrange Sort rows based on one or more columns

count Count observations by group.

distinct Count observations by group.

filter Keep rows that match condition.

head Keep the first n rows of data.

mutate transmute Create or replace columns.

rename Rename columns.

select Keep, drop, or rename specific columns.

summarize Calculate a single number per grouping.

group_by ungroup Specify groups for splitting rows of data.

~ source https://siuba.org/api/

To get started with Siuba, you will first need to install it using pip. You can do this by running the following command in your terminal:

!pip install siuba

Once siuba is installed, you can start using it by importing it and any necessary libraries, such as pandas, into your Python script. To import Siuba, you can use the standard import statement, import siuba . Alternatively, you can use a wildcard import, from siuba import * , to import all functions and classes provided by Siuba. However, this approach is not recommended, as it can lead to naming conflicts and make it harder to track the origin of functions and classes used in your script.

from siuba import _, select
import pandas as pd

You can then use Siuba to work with data in much the same way as you would with dplyr. For example, you can use the select() function to select specific columns from a dataframe:

from siuba import _, select, filter, head, group_by
from siuba.data import penguins
import pandas as pd

penguins >> select(_.species) >> head()

In the above code, >> operator is used to chain together multiple Siuba operations. It is simmilar to the %>% operator in R’s dplyr.

In addition to the select() function, Siuba also provides a variety of other functions for data manipulation such as filter(), group_by(), summarize() and many more. You can find the documentation and examples for these functions on the Siuba’s official website.

5.2 Using Siuba

Using Siuba is easy, you can use its functions to manipulate and analyze your data. Some examples of how to use Siuba include:

5.2.1 select()

Selecting specific columns from a DataFrame: You can use the select() function to select specific columns from a DataFrame. For example:

# selecting columns
penguins >> select(_.bill_length_mm) >> head()

The tilde symbol (~) is used to negate or reverse the selection in Siuba. For example, if you want to select all columns except one, you can use the ~ symbol to negate the selection for that column.

penguins >> select(~_.bill_length_mm) >> head()# select all but

You can also select multiple columns with the following code:

(penguins >>
select(_.species, _["island" : "body_mass_g"] )
>> head()) # select multiple columns

The select() function can also be used with other verbs:

from siuba.dply.vector import n
penguins >> select(_.species) >> n(_) # count the values
# 344

5.2.2 filter()

Filtering rows based on a condition: You can use the filter() function to filter rows based on a condition. For example:

# filter rows by a condition
penguins >> filter(_.bill_depth_mm > 20) >> head()

You can also filter by multiple condition using condition logic. For example:

(penguins >>
filter((_.species == "Adelie")
| (_.body_mass_g > 4000))
>> head()
) # multiple conditions OR
(penguins >>
filter((_.species == "Adelie")
& (_.body_mass_g > 4000))
>> head()
) # multiple conditions AND

The filter() verb can also be used on a group data:

# filter on  grouped data
penguins >> group_by(_.species) >> filter(_.sex == "male") >> head()

(grouped data frame)

5.2.3 mutate()

mutate() and transmute() are functions provided by Siuba that allow you to create or modify columns in a DataFrame.

# mutate columns(create columns or reasign them)
from siuba import mutate
penguins >> mutate(avg_body_mass = _.body_mass_g.mean()) >> head()

The transmute() function only returns the newly created or modified columns, discarding the original ones. This means that if you use transmute() to create or modify a column, the original DataFrame will not include the original columns and will only contain the newly created or modified columns. It’s important to note that this behavior is different than the mutate() function which preserves the original columns, allowing you to work with the full DataFrame.

# transmute -> return new column
from siuba import transmute

(penguins
>> transmute(bill_length_mult = _.bill_length_mm * 2 )
>> head()
)

If you want the changes made by mutate() or transmute() to be permanent, you must reassign the modified DataFrame back to the original DataFrame variable.

# to make mutate permanent reasign to the dataframe
penquin_copy = penguins.copy() # make a copy

# modify the data and save
penquin_copy = (penquin_copy
>> mutate(avg_body_mass = _.body_mass_g.mean().round(2))
)

# read the data
penquin_copy >> head()

5.2.4 summarize()

The summarize() function in Siuba is used to create a summary or aggregation of one or multiple columns in a DataFrame. It allows you to compute various statistical measures such as mean, median, sum, count, etc, as well as group data by one or multiple columns.

# summarize to aggregate
from siuba import summarize

penguins >> summarize(body_sum = _.body_mass_g.median())

Using it with grouped data looks like this:

from siuba import summarize

# group summary
(penguins
>> group_by(_.species)
>> summarize(body_sum = _.body_mass_g.count())
>> head()
)

In this example, the group_by() function groups the data by the column species and the summarize() function computes the count of the column, creating new columns ‘body_sum’ in the resulting DataFrame.

5.2.5 group_by()

Grouping data and summarizing it: You can use the group_by() and summarize() functions to group data and summarize it.

# group by columns
(penguins
>> group_by(_.species, _.sex)
>> summarize(counts = n(_))
)

5.3 Using Siuba with other packages

One of the great things about Siuba is its compatibility with other libraries in the Python ecosystem. This means that it can be seamlessly integrated into your existing data analysis workflow and used in conjunction with other popular data analysis libraries such as pandas, matplotlib, and seaborn.

For example, you can use Siuba to clean and manipulate your data, and then use pandas to perform more advanced data manipulation and analysis. And also, you can use Siuba to preprocess the data and pass it to matplotlib or seaborn for visualization. The following are some use case:

penguins >> _.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 344 entries, 0 to 343
# Data columns (total 8 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 species 344 non-null object
# 1 island 344 non-null object
# 2 bill_length_mm 342 non-null float64
# 3 bill_depth_mm 342 non-null float64
# 4 flipper_length_mm 342 non-null float64
# 5 body_mass_g 342 non-null float64
# 6 sex 333 non-null object
# 7 year 344 non-null int64
# dtypes: float64(4), int64(1), object(3)
# memory usage: 21.6+ KB

The code penguins >> _.info() is a short way of calling the info() method on the penguins DataFrame using the Siuba library.

The penguins variable refers to a DataFrame containing the data, in this case, the data can be any type of data, it could be a dataframe loaded from a CSV, database, or created programmatically.

The >> operator is used to chain together multiple Siuba operations, it is similar to the %>% operator in R’s dplyr. The _ symbol is a placeholder variable used to represent the DataFrame in Siuba operations.

The info() method is a built-in method provided by the pandas library and is used to get the summary of the dataframe, it provides information such as the number of rows and columns, data types of each column, non-null values and memory usage.

So the penguins >> _.info() code is equivalent to the following pandas code:

penguins.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 344 entries, 0 to 343
# Data columns (total 8 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 species 344 non-null object
# 1 island 344 non-null object
# 2 bill_length_mm 342 non-null float64
# 3 bill_depth_mm 342 non-null float64
# 4 flipper_length_mm 342 non-null float64
# 5 body_mass_g 342 non-null float64
# 6 sex 333 non-null object
# 7 year 344 non-null int64
# dtypes: float64(4), int64(1), object(3)
# memory usage: 21.6+ KB

It’s worth noting that Siuba leverages the power of pandas to perform Data manipulation and wrangling, so the returned values and actions are the same as in pandas, in this case, it gives you the information of the Dataframe, just like calling .info() on a pandas Dataframe. But siuba is clealy more intuitive than Pandas.

It is true that Siuba offers a more intuitive syntax for data manipulation and analysis compared to pandas. It uses a piping operator >> similar to R’s dplyr, which makes it more similar to the R programming language and less verbose than pandas code. Also, it utilizes a placeholder variable _ to represent the DataFrame, making it more readable and easy to understand, especially for those familiar with R.

However, it’s worth noting that pandas has a lot of functionality and options, while Siuba is a wrapper that helps to simplify the usage of pandas by providing a more concise and readable syntax.

Also, since it’s based on pandas, it can leverage all the functionalities and libraries that pandas provides, so it’s not a replacement for pandas, but rather a more efficient way of working with it.

It’s ultimately a matter of preference and familiarity. Some developers may find Siuba more intuitive, while others may prefer the more comprehensive and powerful functionality of pandas.

It’s also worth exploring both libraries and choosing the one that best suits your needs and preferences.

Let’s see more usage examples.

5.3.0.1 With Pandas

# get the sum of na values
penguins >> _.isnull().sum()
# species 0
# island 0
# bill_length_mm 2
# bill_depth_mm 2
# flipper_length_mm 2
# body_mass_g 2
# sex 11
# year 0
# dtype: int64
penguins >> _.species.str.lower() # lowercase all species values
# 0 adelie
# 1 adelie
# 2 adelie
# 3 adelie
# 4 adelie
# ...
# 339 chinstrap
# 340 chinstrap
# 341 chinstrap
# 342 chinstrap
# 343 chinstrap
# Name: species, Length: 344, dtype: object
#penguins >> _.corr().plot(kind="bar"); # plot correlation
#/home/boadzie/.local/lib/python3.10/site-packages/siuba/siu/calls.py:189: FutureWarning:

#The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
from siuba import count

# plot a group of species
(penguins
>> group_by(_.species)
>> count()
>>_.plot(kind="bar", xlabel="Species", ylabel="Counts()")
);

5.3.0.2 With Seaborn

from siuba.siu import call
import seaborn as sns


(penguins
>> call(sns.barplot, x=_.species,y=_.species.index, data=_)
)

The from siuba.siu import call statement is importing the call() function from the siuba library. The call() function allows you to call any function, including functions from external libraries, such as seaborn, and pass the result of a Siuba operation as an argument.

It takes in the function you want to call as the first argument, and the rest of the arguments are passed to the function.

In this example, penguins is a DataFrame, and sns.barplot is the function from the seaborn library. The call() function is used to pass the result of the Siuba operation to sns.barplot.

The x and y arguments of the function are being set to specific columns of the DataFrame,_.species and _.species.index respectively, and the data argument is passed the dataframe, _.

This will create a barplot of the penguin species with the index of the species on the y axis and the count of the species on the x axis.

So the complete code uses Siuba to pass the Dataframe to the seaborn function, creating a visualization that you can use to explore and understand the data. By using Siuba to handle the data manipulation.

# create a heatmap to correlation in the data
(penguins
>> call(sns.heatmap, annot=True, data=_.corr(numeric_only=True))
)

Another benefit of Siuba is its compatibility with other libraries in the Python ecosystem, such as pandas, matplotlib, and seaborn. This allows you to integrate Siuba into your existing data analysis workflow and use it in conjunction with other popular data analysis libraries.

It is important to remind readers that while Siuba offers a more intuitive and readable syntax, it’s not a replacement for pandas, which offers more comprehensive and powerful functionality. Also, when using wildcard imports like from siuba import *, it can lead to naming conflicts and make it harder to track the origin of functions and classes used in your script. It’s recommended to import the specific functions or modules you need, rather than importing everything at once.

Overall, Siuba is a useful tool for data analysis, and it’s worth exploring for those who want a more concise and readable way of working with data, especially those coming from R. It can be used to simplify and streamline data manipulation and analysis, making it easier to extract insights from data.

5.4 Summary

In this article, we delved into the world of data manipulation using the Siuba library. We learnt how to use Siuba to perform various data manipulation tasks such as filtering, selecting, grouping, and summarizing data, using a more concise and readable syntax than pandas. Siuba is definitely worth your time, so consider looking into it.

--

--

Daniel Boadzie
Daniel Boadzie

Written by Daniel Boadzie

Data scientist | AI Engineer |Software Engineering|Trainer|Svelte Entusiast. Find out more about my me here https://www.linkedin.com/in/boadzie/

No responses yet