Data Visualization the R way with Python
Data visualization is an essential tool for data scientists to communicate complex data insights to stakeholders. It is the process of creating graphical representations of data that help to reveal patterns, trends, and insights that might not be apparent from just looking at the raw data.
R and Python are two of the most popular languages for data science. R has long been known for its powerful and flexible visualization packages, such as ggplot2, lattice, etc. which provide a wide range of options for creating high-quality visualizations. Python, on the other hand, has emerged as a leading language for data science, with libraries like Matplotlib, Seaborn, and Plotly, that offer a variety of visualization options.
In this article, we will explore the benefits of adopting R way of visualizing data in Python using the lets-plot
library. We will provide a step-by-step guide to creating visualizations using lets-plot
. By the end of this article, data scientists will have a better understanding of how they can leverage the power of lets-plot
in Python to create beautiful, interactive and informative visualizations.
The R way
When it comes to visualizing data, R is a popular choice among data scientists due to its vast collection of data visualization packages. One of the most powerful and flexible packages for data visualization in R is ggplot2, which is based on the grammar of graphics.
The grammar of graphics is a powerful system for constructing and describing data visualizations. It is a framework for thinking about and constructing graphics in a modular way. The grammar of graphics breaks down graphics into their fundamental components, allowing data scientists to create complex visualizations by combining these components.
The grammar of graphics consists of a set of rules for how to map data to aesthetics (such as color, size, and shape), how to layer geometric objects (such as points, lines, and bars), how to position elements on the plot (such as axes, titles, and legends), and how to add statistical transformations (such as smoothing lines and density curves) to the data.
By following these rules, data scientists can create a wide variety of visualizations that are both expressive and easy to understand. The grammar of graphics also makes it easy to customize visualizations to meet specific needs, and to create reusable code that can be applied to new datasets.
In ggplot2, the grammar of graphics is implemented through a set of functions and operators that allow data scientists to construct plots in a modular way. For example, the ggplot()
function is used to initialize a new plot object, while the aes()
function is used to map variables to aesthetics.
Other functions such as geom_point()
and geom_line()
are used to add geometric objects to the plot, while functions like scale_x_continuous()
and scale_fill_gradient()
are used to customize the appearance of the plot. This can be illustrated like the following:
data + aesthetics + geom + scale + coord + facet + theme
Here’s a brief explanation of each element:
data
: The data frame or data source to be plotted.aesthetics
: The mapping of variables in the data to visual properties of the plot, such as position, color, and size.geom
: The geometric object that represents the type of plot to be drawn, such as points, lines, bars, or histograms.scale
: The scaling function that maps data values to visual properties, such as color or size.coord
: The coordinate system that defines how the plot is spatially represented, such as Cartesian or polar coordinates.facet
: The specification of how to split the plot into multiple panels based on levels of one or more variables.theme
: The overall visual appearance of the plot, including background color, font, and spacing.
By combining these elements using the “+” symbol, you can create a customized plot that represents your data in a meaningful way.
Overall, the R way of visualizing data using the grammar of graphics provides a powerful and flexible system for creating expressive and informative visualizations.
Introduction to lets-plot
In the field of data science, visualizing data is a crucial step in understanding patterns and relationships within datasets. Lets-Plot, an open-source plotting library for Python, provides an easy-to-use and flexible solution for creating high-quality statistical visualizations that is also interactive. The library was created by JetBrains, the company known for developing popular developer IDEs like PyCharm.
NB:
Lets-plot
is one of the implementations of The Grammar of Graphics in Python, but it is not the only one. Another popular library that implements The Grammar of Graphics in Python isplotnine
Lets-Plot is based on the Grammar of Graphics, which is a system for constructing data graphics. The library’s API is largely inspired by the ggplot2 package in R, a popular tool for creating data visualizations. This ggplot2-like API makes it easy for R users to transition to Python while also providing Python users with a familiar way to create visualizations.
In this section, we will explore Lets-Plot and its features, including its ggplot2-like API, which makes it easy to create complex visualizations with a few lines of code. We will also discuss the importance of data visualization in statistical data analysis, highlighting how Lets-Plot
can help data scientists gain insight into their data.
Let’s-Plot offers the following features as stated on thier official website:
ggplot2-like API
A bridge between R (ggplot2) and Python Data visualization.
Grouping Plots
GGBunch
shows a collection of plots on one figure. Each plot in the collection can have an arbitrary location and size.Suitable for Scientist and Developer
Works in computational notebooks (Jupyter, Datalore, Kaggle, Colab, Deepnote) and in JetBrains professional IDEs — PyCharm.
Customizable Tooltips
You can customize the content, values formatting and appearance of tooltip for any geometry layer in your plot. Learn more.
Kotlin API
R, Python, what’s next? Right. Lets-Plot Kotlin API enables data visualization in JVM and Kotlin/JS applications as well as in scientific notebooks like Jupyter and Datalore.
Formatting
Lets-Plot supports formatting of numeric and date-time values in tooltips, legends, on the axes and text geometry layer. Learn more.
Geospatial Visualization
Find spatial objects with the help of our powerful and easy to use Geocoding module. In case you already have
GeoDataFrame
on hand - plot it straight away.Sampling
Sampling is a special technique of data transformation, which helps to deal with large datasets and overplotting. Learn more.
Interactive Maps
Interactive maps allow zooming and panning around your geospatial data with customizable vector or raster basemaps as a backdrop. Learn more.
Export to SVG and HTML
The
ggsave()
function is an easy way to export plot to a file in SVG or HTML formats.‘No Javascript’ and Offline Mode
In the ‘no javascript’ mode Lets-Plot generates plots as bare-bones SVG images. Plots in the notebook with option
offline=True
will be working without an Internet connection. Learn more.
Now that we know all that the lets-plot
library has to offer, let get started using it by installing it
!pip install lets-plot
To get started with lets-plot
we will be using data from this book THE HITCHHIKER’S GUIDE TO GGPLOT2. WE will try to replicate some of the chart in the book using lets-plot
.
import pandas as pd
from lets_plot import *
LetsPlot.setup_html()
df = pd.read_csv("data.csv")
df.head()
This code imports the pandas library and renames it as pd
. Then it imports the LetsPlot
class from the lets_plot
module, and calls the setup_html()
method on it, which configures the library for use in a Jupyter notebook.
Next, it loads a CSV file named “data.csv” into a Pandas DataFrame object named df
. The DataFrame object can be thought of as a table of data, similar to a spreadsheet, with rows and columns.
This code prepares the environment for data visualization using Lets-Plot, including loading the necessary libraries and data.
Line Chart
Now if we check our dataframe, we will see the following:
Let’s create our first chart with lets_plot
:
p = ggplot(df, aes(x="year", y="export", color="product")) + \
geom_line(size=1.5) + flavor_darcula() + \
theme(legend_position = "bottom", legend_direction = "horizontal", legend_title = element_blank())
p
The code above uses the ggplot2-like API of Lets-Plot library to create a line plot from a pandas DataFrame df
with the x-axis representing the "year" column, the y-axis representing the "export" column and the color of the line representing the "product" column.
Breaking it down, the code can be explained as follows:
ggplot(df, aes(x="year", y="export", color="product"))
creates the base plot with x-axis representing the "year" column, y-axis representing the "export" column and the color of the line representing the "product" column.geom_line(size=1.5)
adds a layer of a line to the plot with a line width of 1.5 units.flavor_darcula()
sets the color scheme of the plot to the "Darcula" theme.theme(legend_position = "bottom", legend_direction = "horizontal", legend_title = element_blank())
sets the position of the legend at the bottom of the plot and changes the orientation to horizontal. It also removes the title of the legend.
Finally, the +
symbol is used to combine all the layers together to create the final plot and assign it to the variable p
.
Notice that the chart is also interactive when you hover over the points.
We can take this even further by adding extra feature to make it publication ready like scaling the axes and adding titles.
import numpy as np
p1 = p + scale_x_continuous(breaks=np.arange(2006, 2016, 1))
p1
Then add labels:
p1 + labs(title = "Composition of Exports to China ($)",
subtitle = "Source: The Observatory of Economic Complexity") + \
labs(x = "Year", y = "USD million")
Area Chart
Next let’s explore our data further with an area chart:
ggplot(df, aes(y = "export", x = "year", fill = "product")) + geom_area() + flavor_high_contrast_light() + \
theme(legend_position = "bottom", legend_direction = "horizontal", legend_title = element_blank()) + \
scale_x_continuous(breaks=np.arange(2006, 2016, 1)) + \
labs(title = "Composition of Exports to China ($)",
subtitle = "Source: The Observatory of Economic Complexity") + \
labs(x = "Year", y = "USD million")
This code uses the ggplot2-like API of Lets-Plot library to create an area plot from a pandas DataFrame df
with the x-axis representing the "year" column, the y-axis representing the "export" column, and the fill color of the area representing the "product" column.
Breaking it down, the code can be explained as follows:
ggplot(df, aes(y = "export", x = "year", fill = "product"))
creates the base plot with y-axis representing the "export" column, x-axis representing the "year" column, and the fill color of the area representing the "product" column.geom_area()
adds a layer of an area to the plot.flavor_high_contrast_light()
sets the color scheme of the plot to the "High Contrast Light" theme.theme(legend_position = "bottom", legend_direction = "horizontal", legend_title = element_blank())
sets the position of the legend at the bottom of the plot and changes the orientation to horizontal. It also removes the title of the legend.scale_x_continuous(breaks=np.arange(2006, 2016, 1))
sets the x-axis to display tick marks at 1-year intervals between 2006 and 2016.labs(title = "Composition of Exports to China ($)", subtitle = "Source: The Observatory of Economic Complexity")
adds a title and subtitle to the plot.labs(x = "Year", y = "USD million")
adds x and y axis labels to the plot.
Bar Chart
export_new = round(df.export/1000000000, 2)
(ggplot(df, aes(y = "export", x = "year", fill = "product")) +
geom_bar(stat="identity", tooltips="none") +
theme(legend_position = "bottom", legend_direction = "horizontal", legend_title = element_blank()) +
scale_x_continuous(breaks=np.arange(2006, 2016, 1)) + \
geom_text(aes(label=export_new), position= position_stack(vjust = 0.5),
label_format='{.1f}B', angle=45, color='#ffffff', vjust=8 ) +
labs(title = "Composition of Exports to China ($)",
subtitle = "Source: The Observatory of Economic Complexity") +
labs(x = "Year", y = "USD million") + flavor_darcula() + ggsize(width=650, height=400)
)
This code first creates a new column in the dataframe df
called export_new
which is calculated by dividing the export
column by 1000000000 and rounding to two decimal places.
Then, the code creates a bar plot using ggplot
with the export
column on the y-axis and the year
column on the x-axis, grouped and colored by the product
column. The geom_bar
function is used with stat="identity"
to plot the actual values of export
as bars. The tooltips="none"
parameter is used to turn off tooltips when hovering over the bars.
The theme
function is used to customize the legend position, direction, and title. The scale_x_continuous
function is used to customize the x-axis tick marks.
The geom_text
function is used to add labels to the bars with the values from the export_new
column, positioned at the top of each bar using position_stack(vjust = 0.5)
. The label_format
parameter is used to format the labels with one decimal place and a "B" suffix. The angle
parameter rotates the labels by 45 degrees, the color
parameter sets the label text color to white, and the vjust
parameter adjusts the vertical position of the labels.
The labs
function is used to add a title and subtitle to the plot, as well as labels for the x and y axes. The flavor_darcula()
function is used to set a dark color scheme for the plot, and ggsize
is used to adjust the plot size.
Scatter plot
The scatterplot helps us to visualize relationships between the variables of our data set. To create a scatterplot we will use the penguins
dataset that is available in siuba: our dplyr
equivalent for Python:
from siuba.data import penguins
penguins.head()
Now let’s create our scatterplot with lets_plot
:
(ggplot(penguins, aes("bill_length_mm", "flipper_length_mm", color="species"))
+ geom_point(size=4, alpha=0.7) + flavor_darcula() +
ggsize(width=700, height=400) +
theme(legend_position = "bottom", legend_direction = "horizontal") +
labs(title = "Penguins ",
subtitle = "Source: New York State Department of Conservation") +
labs(x = "Species", y = "Bill length") )
We can also visualize the relationship between bill length and body mass using a regression line:
ggplot(penguins, aes(x ="bill_length_mm", y ="body_mass_g")) + \
geom_point(size=4, alpha=0.5) + geom_smooth()
Multiple Chart
Another cool feature of lets-plot
is the ability to create multiple charts on the same figure, making it easy to compare visualizations.
np.random.seed(42)
n = 100
x = np.arange(n)
y = np.random.normal(size=n)
w, h = 450, 180
p = ggplot({'x': x, 'y': y}, aes(x='x', y='y')) + ggsize(w, h)
bunch = GGBunch()
bunch.add_plot(p + geom_point(), 0, 0)
bunch.add_plot(p + geom_histogram(bins=3), w, 0)
bunch.add_plot(p + geom_line(), 0, h, 2*w, h)
bunch.show()
This code generates a GGBunch object that contains three separate plots, arranged horizontally on a single figure.
First, the NumPy seed is set to 42 to ensure reproducibility. A random normal distribution with 100 points is created for the y'
variable, and an array of numbers from 0 to 99 is created for the x
variable. The dimensions of each plot, w
and h
, are also defined.
Then, a ggplot object is created with ‘x’ as the x-axis and ‘y’ as the y-axis. The ggsize
function is used to set the dimensions of the plot.
Next, the GGBunch
object is created and each plot is added to the bunch using the add_plot method. The first plot is a scatter plot of 'x' and 'y' using geom_point
. The second plot is a histogram of 'y' using geom_histogram with three bins. The third plot is a line plot of 'x' and 'y' using geom_line.
Finally, the show
method is used to display the GGBunch object with all three plots arranged horizontally on a single figure.
Conclusion
In conclusion, we have explored how the Grammar of Graphics, a powerful data visualization framework, has been implemented in Python with Lets-Plot. With its ggplot2-like syntax, Lets-Plot offers a user-friendly and flexible way of creating complex and aesthetically pleasing visualizations. Its compatibility with Pandas and NumPy makes it easy to integrate into existing data analysis workflows. Furthermore, the interactive and dynamic nature of Lets-Plot charts enhances the user experience, allowing for deeper exploration and understanding of data. Overall, Lets-Plot provides Python users with a valuable tool for exploring and presenting their data in an efficient and visually compelling way.