AI-driven Data Analysis with Sketch

Daniel Boadzie
7 min readMar 15, 2023

--

Data analysis has become an essential process across various industries, as businesses seek to extract insights from the massive amounts of data they collect. With the advent of artificial intelligence (AI), there is now a tremendous opportunity to revolutionize the way we perform data analysis. AI-driven data analysis has the potential to provide new and advanced techniques to extract insights from data and automate specific aspects of the data analysis process.

In this article, we aim to explore the possibilities of AI-driven data analysis. Specifically, we will examine the benefits and challenges of this approach, explore the techniques involved in the process, and offer insights into its future possibilities. We will provide a comprehensive understanding of AI-driven data analysis, comparing it to traditional data analysis methods, and highlighting the advantages of using AI in data analysis.

To demonstrate the possibilities of AI-driven data analysis, we will use Sketch, an innovative code-writing assistant designed specifically for pandas users. Sketch simplifies and streamlines the data-analysis workflow by facilitating various tasks, such as data cataloging, data engineering, and data analysis. Using Sketch, we will explore the importance of data analysis in various industries and how AI can enhance the process.

What is AI-driven Data Analysis

AI-driven data analysis is the application of artificial intelligence techniques, such as machine learning and natural language processing, to perform data analysis tasks. Traditional data analysis methods usually involve manually preparing and processing data, creating models and visualizations, and interpreting results. In contrast, AI-driven data analysis leverages algorithms that can automatically learn from data, identify patterns, and make predictions or decisions.

The use of AI in data analysis offers several advantages over traditional methods. For example, AI-driven data analysis can handle large and complex datasets with ease, often in a fraction of the time that traditional methods require. Additionally, AI algorithms can identify patterns and correlations that might be missed by human analysts, leading to more accurate insights and predictions. AI-driven data analysis also has the potential to automate certain aspects of the data analysis process, freeing up analysts to focus on higher-level tasks. However, it’s important to note that AI-driven data analysis is not a replacement for human expertise and critical thinking. Instead, it’s a tool that can complement and enhance the capabilities of human analysts.

AI-driven data analysis is the application of artificial intelligence techniques, such as machine learning and natural language processing, to perform data analysis tasks.

Other benefits of AI-driven data analysis include the following:

  1. Increased accuracy and speed: AI-driven data analysis algorithms can process vast amounts of data much faster than traditional manual methods. Additionally, AI can provide more accurate results as it eliminates the possibility of human error in analysis.
  2. Enhanced decision-making capabilities: By leveraging the power of AI-driven data analysis, organizations can make better-informed decisions. AI can provide insights that were previously difficult or impossible to identify, helping organizations to make informed decisions that can have a significant impact on their bottom line.
  3. Identification of previously unseen patterns and insights: AI algorithms can detect patterns and trends in data that may not be obvious to humans. This can lead to the identification of previously unseen insights that can be used to drive innovation and improve processes.
  4. Reduced costs: With the automation of data analysis processes, organizations can save significant amounts of time and money. AI-driven data analysis can help to reduce the costs associated with manual data analysis, such as hiring additional staff, training, and infrastructure.
  5. Scalability: AI-driven data analysis can easily scale to accommodate large and complex datasets. This means that organizations can process and analyze data from multiple sources quickly and efficiently.
  6. Continuous learning: AI algorithms can continuously learn and adapt to new data, making them highly valuable for organizations looking to stay ahead of the curve. As data sources and types continue to evolve, AI-driven data analysis can provide organizations with the flexibility they need to stay competitive.

Introduction to Sketch

Sketch is an innovative AI code-writing assistant designed for pandas users that can effectively understand the context of their data. With its capability to comprehend the data's context, Sketch provides highly relevant suggestions to users. What's more, Sketch is extremely user-friendly, requiring no plugin installation in your IDE, and is ready for use in seconds.

One can install Sketch by simply running the command pip install sketch. Sketch is capable of simplifying and streamlining the data-analysis workflow by facilitating various tasks. These include general tagging, metadata generation, data cleaning, data masking, derived feature creation and extraction, data questions, and data visualization.

Sketch’s Natural Language interface can navigate effortlessly through the entire data stack landscape, providing answers to questions such as “Which columns are integer type?”

Let’s now installl and use use sketch for ai-driven data analysis

!pip install sketch

Sketch provides three interfaces for interacting with it:

  1. The Ask interface, which allows users to ask natural language questions about their data and receive answers based on the summary statistics and description of the data. This interface is useful for getting a quick understanding of the data, as well as generating better column names and asking hypothetical questions.
  2. The Howto interface, which provides code-writing prompts that users can copy and paste to perform common data analysis tasks, such as cleaning and normalizing data, creating new features, plotting data, and building models. This interface is useful for users who may not have extensive programming experience, as it simplifies the process of generating code for common tasks.
  3. The Apply interface, which is more advanced and is designed for data generation tasks. This interface is built on top of the Lambdaprompt library and requires users to set up a free account with OpenAI and set an environment variable with their API key. With this interface, users can parse fields, generate new features, and more, using natural language prompts.

In order to showcase the potential of Sketch, we will be using the penguins dataset available in the popular suiba data manipulation library:

import sketch
from siuba import *
from siuba.data import penguins

penguins >> head()
  species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
# check our data
penguins >> _.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 species 344 non-null object
1 island 344 non-null object
2 bill_length_mm 342 non-null float64
3 bill_depth_mm 342 non-null float64
4 flipper_length_mm 342 non-null float64
5 body_mass_g 342 non-null float64
6 sex 333 non-null object
7 year 344 non-null int64
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB

.sketch.ask

Now, let’s see Sketch in action with the ask interface, which allows users to ask natural language questions about their data.

penguins.sketch.ask("which columns are strings")
Output: species, island, sex
penguins.sketch.ask("which columns are floats")
Output: bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g

When you compare the output from Sketch with other methods, you will notice the remarkable accuracy it provides. We can go further to ask some statistical questions:

penguins.sketch.ask("what is the median value of the body_mass column")
Output:  The median value of the body_mass column is 4050.0 g.
penguins.sketch.ask("which species of penguins is most represented without missing values")
Output:  Adelie penguins are the most represented species without missing values in the penguins dataframe.
penguins.sketch.ask("what is the number of values for adeline without NaNs")
Output: The number of values for Adelie without NaNs is 152.

Now let’s check the answer to the last two question with siuba:

penguins.species.value_counts()
Output: 
Adelie 152
Gentoo 124
Chinstrap 68
Name: species, dtype: int64

.sketch.howto

penguins.sketch.howto("how can i chart the species in plotly")
Output: 
import plotly.express as px

fig = px.bar(penguins, x='species', y='body_mass_g', color='sex', barmode='group')
fig.show()

When you have Plotly installed and run the generated code in a new cell, you will see the following chart:

.sketch.apply

The apply interface in Sketch allows you to apply a function to a specific column of your data. For example, you can use it to create a new column in our dataframe:

penguins['species_size'] = penguins.sketch.apply("{{species}}_{{'large' if body_mass_g > 4000 else 'small'}}")

In the code above, we use the apply interface to generate a new column called species_size. We pass a template string to the apply method that references the species and body_mass_g columns. The if condition in the template string is used to determine if the penguin is large or small based on its body_mass_g. The resulting species_size column will have values such as Adelie_large, Gentoo_small, and so on.

There is no limit to the possibilities of tools like Sketch, as it continues to evolve and improve through advancements in machine learning and natural language processing. With Sketch, the data analysis process is streamlined and simplified, allowing for faster and more accurate insights. As technology continues to progress, we can expect more innovative tools like Sketch to emerge, transforming the way we analyze and make decisions based on data. The future of data analysis looks bright, and we can only imagine the endless possibilities of what we can achieve with tools like Sketch.

Conclusion

In conclusion, the integration of AI into data analysis has revolutionized the field, providing new and advanced techniques to extract insights from data and automate certain aspects of the data analysis process. Sketch is an innovative tool that offers a simplified and streamlined data analysis workflow by facilitating various tasks such as data cataloging, data engineering, and data analysis. Throughout this article, we have explored the benefits and compared traditional data analysis methods to AI-driven ones, and highlighted the advantages of using AI in data analysis. We have also provided an in-depth look at Sketch and demonstrated how it can be used to improve the data analysis process with various interfaces, including ask, tell, and apply. The possibilities for AI-driven data analysis are vast, and we have only just begun to tap into its potential. With tools like Sketch, we can look forward to more efficient and accurate data analysis, leading to better decision-making and valuable insights across a range of industries.

--

--

Daniel Boadzie

Data scientist | AI Engineer |Software Engineering|Trainer|Svelte Entusiast. Find out more about my me here https://www.linkedin.com/in/boadzie/