A Jupyter Notebook to Start (And Finish) Data Science Projects Faster
“Time-to-delivery” is one of the most important yet unsung metrics in data science. It’s tricky to track, but thinking about it conceptually can help us identify ways to (a) help data scientists work faster and (b) allow business stakeholders to get the information they need faster.
There are many ways you can reduce your analytics team’s time to insight. Today, we’ll focus on reducing the time to complete simple data science tasks.
The way we’ve approached it is by creating a Google Colab Quickstart Notebook. It provides a simple notebook setup and code snippets for common data science tasks. Specifically, we focus on:
- Notebook Setup
- Importing Data
- Profiling Data
- Visualizing Data
- Exporting Data
- Resources
We’re writing a post because this notebook has a set of modular code snippets that will help data scientists work faster.
Setting Up The Notebook
Below are code snippets often used in the header of a Google Colab notebook. You’ll want to import libraries and declare global variables here as well.
Authenticating Google Colab
This snippet connects the Colab instance to your Google profile, giving the instance the same authentication as your personal profile.
# Provide your credentials to the runtime
from google.colab import auth
auth.authenticate_user()
Loading Collab Settings
You can customize your dataframe displays, use cell magic, make bash calls, and more with a simple set of API calls.
# Colab includes the google.colab.data_table package that can be used
# to display large pandas dataframes as an interactive data table.
%load_ext google.colab.data_table
# If you want to return to classic dataframe view
%unload_ext google.colab.data_table
Installing nbextensions
Notebook extensions, or nbextensions, are a helpful addition to Jupyter Notebooks. We can add much of its functionality to Colab notebooks as well. This is particularly useful when trying to visualize HTML or interactive elements.
!jupyter contrib nbextension install --user
Loading Data
Inputting and outputting data is one of the most common tasks for a data scientist. However, setting up the connections to do so can be time-consuming. In our Quickstart Notebook we’ve included exports to the following destinations:
- From BigQuery to a dataframe
- From a dataframe to BigQuery
- From a dataframe to Google Sheets
- From Colab to a local directory
- From Colab to Google Storage
See the notebook, External data: Local Files, Drive, Sheets, and Cloud Storage, for more examples of how to ingest and send data from Google Colab.
From BigQuery to a dataframe
In March 2023, the BigQuery team updated the BigQuery Client API to allow users to output a query directly as a dataframe. This removes the need to use pandas-gbq (a personal favorite for many years!).
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client(project=GOOGLE_PROJECT_ID)
sql = """
SELECT name, SUM(number) as count
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
ORDER BY count DESC
LIMIT 10
"""
# Call the API
df_names = client.query(sql).to_dataframe()
# View the first 5 rows of the dataframe
df_names.head()
From a dataframe to BigQuery
It’s often helpful to load a dataframe back to BigQuery. For instance, we may want to record the outputs of a model back to our database. We can use the Google Cloud API to do so.
With the `bigquery.LoadJobConfig` parameter, schema, we can give specifics about our schema, including its name, field type (e.g. STRING, FLOAT), mode (e.g. NULLABLE, REQUIRED), and description.
# Get (or create) a dataframe
# This dataset is available as a sample dataset in Google Colab
df = pd.read_csv('/content/sample_data/california_housing_test.csv')
# Specify table_id to the ID of the table to create.
table_id = "my-project-id.test.california_housing_test"
job_config = bigquery.LoadJobConfig(
# Specify a (partial) schema. All columns are always written to the
# table. The schema is used to assist in data type definitions.
schema=[
# Specify the type of columns whose type cannot be auto-detected.
# Field definitions will be added to the BigQuery console.
bigquery.SchemaField(
"households",
bigquery.enums.SqlTypeNames.INTEGER,
description="The number of households in the geographic area"
),
bigquery.SchemaField(
"median_income",
bigquery.enums.SqlTypeNames.FLOAT,
description="The log of the median income"
),
],
# Optionally, set the write disposition. BigQuery appends loaded rows
# to an existing table by default, but with WRITE_TRUNCATE write
# disposition it replaces the table with the loaded data.
write_disposition="WRITE_TRUNCATE",
)
# Load dataframe to BigQuery
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result() # Wait for the job to complete.
# Get BigQuery table object
table = client.get_table(table_id)
print("Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
)
)
As you’ll see below, the field descriptions we’ve provided are now in the BigQuery console.
From a dataframe to Google Sheets
Loading data to Google Sheets can be helpful, especially within a business context. Many business stakeholders are comfortable doing analysis in Google Sheets (or Excel) but aren’t comfortable with Python.
We can easily bridge this technical gap by completing a portion of an analysis in Python and loading the relevant results in Google Sheets for the business stakeholders.
import gspread
from google.auth import default
creds, _ = default()
# Authorize gspread
gc = gspread.authorize(creds)
First, we get the spreadsheet object using the gspread API. We can do this via the spreadsheet ID, URL, or name. See each example below.
# Open a Google Sheets spreadsheet by ID
spreadsheet_id = '1VlIdq73MUBPSJdVcpZMGD1Dj-VuksY_CNH3r5DIe0cs'
spreadsheet = gc.open_by_key(spreadsheet_id)
# Open a Google Sheets spreadsheet by URL
spreadsheet_url = "https://docs.google.com/spreadsheets/d/{}".format(spreadsheet_id)
spreadsheet = gc.open_by_url(spreadsheet_url)
# # Get the Google Sheets spreadsheet by name
spreadsheet = gc.open('Liquidator Lookout: product map features')
Next, we select a specific worksheet (or tab) in the spreadsheet. We can do this via its index or name. From there, we use get_all_values() to get the data out of the worksheet and load it to a dataframe.
# Get the Worksheet by Spreadsheet tab index
worksheet = spreadsheet.get_worksheet(3)
# Get the Worksheet by name
# You can combine gc.open() with the line below to load the Worksheet directly.
worksheet = spreadsheet.worksheet('product_information')
# get_all_values gives a list of rows
rows = worksheet.get_all_values()
# Load rows (a list of lists) to a dataframe
df_products = pd.DataFrame(rows[1:], columns=rows[0])
From Colab to a local directory
Loading data from your local dictionary is simple using google.colab.files. This generates a pop-up that allows you to browse and select the files you’d like to upload.
from google.colab import files
# Plugin to load local files
uploaded = files.upload()
Once you load a file or dataset, it will be available in the content/ folder in your notebook instance (see below).
From Colab to a local directory
We can then do the inverse — loading a dataset from Google Colab back to our local computer.
from google.colab import files
# Download the file to your local computer
files.download('/content/sample_data/california_housing_test.csv')
From Colab to Google Storage
Google Storage is a useful tool for data scientists, as it allows us to save unstructured data (e.g. text files, images, audio files) on a cloud server. This saves space on your computer, and eaises collaboration through a centralized data source.
There are two main ways we can load data to Google Storage — with gsutil or the Google Python API. Regardless of the method, we need to connect to Google Storage and create a bucket.
# Authenticate the user.
from google.colab import auth
auth.authenticate_user()
# Specify tyour Google Project ID.
project_id = 'my-google-project-id'
# Create a unique bucket name.
# The bucket name must be unique across all GCS sources globally.
# Use uuid to create a universally unique ID.
import uuid
bucket_name = 'colab-sample-bucket-' + str(uuid.uuid1())
print(bucket_name)
From there, we can run the following set of commands to upload or download files using gsutil.
# Configure gsutil to use the project we specified above by using gcloud.
!gcloud config set project {project_id}
# Make a bucket to which we'll upload the file
!gsutil mb gs://{bucket_name}
# Copy the file to our new bucket
!gsutil cp /content/sample_data/mnist_test.csv gs://{bucket_name}/
# Dump the contents of our newly copied file to make sure everything worked
!gsutil cat gs://{bucket_name}/mnist_test.csv
And we can download files from GCS with the following command.
# Copy file from GCS to local directory
!gsutil cp gs://{bucket_name}/mnist_test.csv /content/mnist_test.csv
You can see how to load data to or from GCS in the Quickstart Notebook.
Profiling Data
Once we get our data into our notebook, we need to get a high-level understanding of the data and account for any missing, inaccurate, or outlying values. This is where data profiling comes in.
Generally, I’m against a ‘one size fits all’ approach to EDA. However, these libraries are very useful when getting a base understanding of your data.
- pandas_profiling
- sweetviz
- AutoViz
In the Quickstart Notebook, we provide simple examples of data profiling for each of these libraries, along with several resources on data profiling.
pandas_profiling
# Install pandas-profiling library
!pip install -U pandas-profiling
# Install Jupyter's nbextension
# Allows other extensions to be added to this notebook
!jupyter nbextension enable --py widgetsnbextension
# Import libraries
import pandas as pd
from pandas_profiling import ProfileReport
# Run profiling report
profile = ProfileReport(
df_ohlc,
title="SPY Data",
html={'style' : {'full_width':True}}
)
# Load to colab file
profile.to_file(output_file="pandas_profiling.html")
# Display the pandas_profiling output
# You may not be able to view pandas_profiling about inline
# Therefore, use this code to view it (or any HTML file).
import IPython
IPython.display.HTML(filename='pandas_profiling.html')
sweetviz
Sweetviz enables data scientists and analysts to effortlessly generate comprehensive reports that provide insights into the structure, distributions, and relationships within their datasets.
# Install sweetviz library
!pip install sweetviz
# Import sweetviz librart
import sweetviz as sv
my_report = sv.analyze(df_ohlc)
my_report.show_html('sweetviz_report.html')
# Display the SweetViz output
import IPython
IPython.display.HTML(filename='sweetviz_report.html')
You can also use Sweetviz to compare two datasets. Here, we’re comparing two stock tickers. However, this is particularly useful if you want to compare training and test datasets.
# Get stock OHLC data
df_aapl = web.DataReader('AAPL', 'stooq', start='01/01/2023', end='06/01/2023')
df_amzn = web.DataReader('AMZN', 'stooq', start='01/01/2023', end='06/01/2023')
import pandas as pd
import numpy as np
import sweetviz as sv
# Specify the two datasets and the reporting name,
# along with the target variable (optional).
my_report = sv.compare([df_aapl, 'Apple'],[df_amzn, 'Amazon'], 'Close')
my_report.show_html(filepath='sweetviz_comparison_report.html')
# Test/Train Example:
# my_report = sv.compare([df_train, 'Train'],[df_test, 'Test'], 'target_var')
import IPython
IPython.display.HTML(filename='/content/sweetviz_comparison_report.html')
Visualizing Data
One of the main challenges with plotting in Jupyter is that customizing charts takes a LOT of code and time. Creating interactive charts is even more difficult.
We’ve focused on Pareto efficiencies here — we have included Colab’s chart suggestion feature as our go-to plotting solution. However, Colab has several good data viz notebooks from other libraries.
- Data Visualization.ipynb by QuantUniversity
- Introduction to data visualization in Python live session_full.ipynb by DataCamp
Colab Chart Suggestions
As mentioned, Colab Chart Suggestions is a quick way to build charts. It uses matplotlib and Seaborn (sns) under-the-hood but abstracts many of the more complex parts of using these libraries.
The user flow for Colab Chart Suggestions is:
- Create a Pandas dataframe
- Click the ‘Chart Suggestions’ icon
- Click the chart you’d like to use
- Click ‘Accept’ to create a new code block that contains a function to output the plot
Resources (and Conclusion)
We use the modules above to quickly start data science projects. However, we use several other templates depending on the methods we use in the analysis. Generally, using templated code will allow your analytics team to start and finish their work faster.
On the shoulders of giants …
In both our Quickstart Notebook and this article, we’ve referenced several other notebooks and articles. We encourage you to review them as well. We’ve included some of our favorites below.
Importing & Exporting Data
Profiling Data
- Google Colab | Exploratory Data Analysis
- Medium: 3 Tools for Fast Data Profiling
- StackOverflow | Installing jupyter-contrib-nbextension on Google Colab
- StackOverflow | Pandas Profiling error within google colab
- Kaggle | EDA Using sweetviz and autoviz
Visualizing Data
Useful Notebooks
- Google Colab Notebook Library
- Welcome To Colaboratory.ipynb
- colab-tutorial.ipynb. From Stanford’s CS231 class on Python.
- Advanced Python Tutorial.ipynb
- bqml-online-prediction.ipynb. BigQueryML + Vertex AI are used for churn prediction from Google Analytics event-level data.