A Jupyter Notebook to Start (And Finish) Data Science Projects Faster

Gordon Silvera
9 min readSep 13, 2023

--

“Time-to-delivery” is one of the most important yet unsung metrics in data science. It’s tricky to track, but thinking about it conceptually can help us identify ways to (a) help data scientists work faster and (b) allow business stakeholders to get the information they need faster.

There are many ways you can reduce your analytics team’s time to insight. Today, we’ll focus on reducing the time to complete simple data science tasks.

The way we’ve approached it is by creating a Google Colab Quickstart Notebook. It provides a simple notebook setup and code snippets for common data science tasks. Specifically, we focus on:

  • Notebook Setup
  • Importing Data
  • Profiling Data
  • Visualizing Data
  • Exporting Data
  • Resources

We’re writing a post because this notebook has a set of modular code snippets that will help data scientists work faster.

Setting Up The Notebook

Below are code snippets often used in the header of a Google Colab notebook. You’ll want to import libraries and declare global variables here as well.

Authenticating Google Colab

This snippet connects the Colab instance to your Google profile, giving the instance the same authentication as your personal profile.

# Provide your credentials to the runtime
from google.colab import auth
auth.authenticate_user()

Loading Collab Settings

You can customize your dataframe displays, use cell magic, make bash calls, and more with a simple set of API calls.

# Colab includes the google.colab.data_table package that can be used 
# to display large pandas dataframes as an interactive data table.
%load_ext google.colab.data_table

# If you want to return to classic dataframe view
%unload_ext google.colab.data_table

Installing nbextensions

Notebook extensions, or nbextensions, are a helpful addition to Jupyter Notebooks. We can add much of its functionality to Colab notebooks as well. This is particularly useful when trying to visualize HTML or interactive elements.

!jupyter contrib nbextension install --user

Loading Data

Inputting and outputting data is one of the most common tasks for a data scientist. However, setting up the connections to do so can be time-consuming. In our Quickstart Notebook we’ve included exports to the following destinations:

  • From BigQuery to a dataframe
  • From a dataframe to BigQuery
  • From a dataframe to Google Sheets
  • From Colab to a local directory
  • From Colab to Google Storage

See the notebook, External data: Local Files, Drive, Sheets, and Cloud Storage, for more examples of how to ingest and send data from Google Colab.

From BigQuery to a dataframe

In March 2023, the BigQuery team updated the BigQuery Client API to allow users to output a query directly as a dataframe. This removes the need to use pandas-gbq (a personal favorite for many years!).

from google.cloud import bigquery
import pandas as pd

client = bigquery.Client(project=GOOGLE_PROJECT_ID)

sql = """
SELECT name, SUM(number) as count
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
ORDER BY count DESC
LIMIT 10
"""

# Call the API
df_names = client.query(sql).to_dataframe()

# View the first 5 rows of the dataframe
df_names.head()

From a dataframe to BigQuery

It’s often helpful to load a dataframe back to BigQuery. For instance, we may want to record the outputs of a model back to our database. We can use the Google Cloud API to do so.

With the `bigquery.LoadJobConfig` parameter, schema, we can give specifics about our schema, including its name, field type (e.g. STRING, FLOAT), mode (e.g. NULLABLE, REQUIRED), and description.

# Get (or create) a dataframe
# This dataset is available as a sample dataset in Google Colab
df = pd.read_csv('/content/sample_data/california_housing_test.csv')

# Specify table_id to the ID of the table to create.
table_id = "my-project-id.test.california_housing_test"

job_config = bigquery.LoadJobConfig(
# Specify a (partial) schema. All columns are always written to the
# table. The schema is used to assist in data type definitions.
schema=[
# Specify the type of columns whose type cannot be auto-detected.
# Field definitions will be added to the BigQuery console.
bigquery.SchemaField(
"households",
bigquery.enums.SqlTypeNames.INTEGER,
description="The number of households in the geographic area"
),
bigquery.SchemaField(
"median_income",
bigquery.enums.SqlTypeNames.FLOAT,
description="The log of the median income"
),
],
# Optionally, set the write disposition. BigQuery appends loaded rows
# to an existing table by default, but with WRITE_TRUNCATE write
# disposition it replaces the table with the loaded data.
write_disposition="WRITE_TRUNCATE",
)

# Load dataframe to BigQuery
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result() # Wait for the job to complete.

# Get BigQuery table object
table = client.get_table(table_id)
print("Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
)
)

As you’ll see below, the field descriptions we’ve provided are now in the BigQuery console.

Field descriptions are stored in the BigQuery console via the google.cloud.bigquery API.

From a dataframe to Google Sheets

Loading data to Google Sheets can be helpful, especially within a business context. Many business stakeholders are comfortable doing analysis in Google Sheets (or Excel) but aren’t comfortable with Python.

We can easily bridge this technical gap by completing a portion of an analysis in Python and loading the relevant results in Google Sheets for the business stakeholders.

import gspread
from google.auth import default
creds, _ = default()

# Authorize gspread
gc = gspread.authorize(creds)

First, we get the spreadsheet object using the gspread API. We can do this via the spreadsheet ID, URL, or name. See each example below.

# Open a Google Sheets spreadsheet by ID
spreadsheet_id = '1VlIdq73MUBPSJdVcpZMGD1Dj-VuksY_CNH3r5DIe0cs'
spreadsheet = gc.open_by_key(spreadsheet_id)

# Open a Google Sheets spreadsheet by URL
spreadsheet_url = "https://docs.google.com/spreadsheets/d/{}".format(spreadsheet_id)
spreadsheet = gc.open_by_url(spreadsheet_url)

# # Get the Google Sheets spreadsheet by name
spreadsheet = gc.open('Liquidator Lookout: product map features')

Next, we select a specific worksheet (or tab) in the spreadsheet. We can do this via its index or name. From there, we use get_all_values() to get the data out of the worksheet and load it to a dataframe.

# Get the Worksheet by Spreadsheet tab index
worksheet = spreadsheet.get_worksheet(3)

# Get the Worksheet by name
# You can combine gc.open() with the line below to load the Worksheet directly.
worksheet = spreadsheet.worksheet('product_information')

# get_all_values gives a list of rows
rows = worksheet.get_all_values()

# Load rows (a list of lists) to a dataframe
df_products = pd.DataFrame(rows[1:], columns=rows[0])

From Colab to a local directory

Loading data from your local dictionary is simple using google.colab.files. This generates a pop-up that allows you to browse and select the files you’d like to upload.

from google.colab import files

# Plugin to load local files
uploaded = files.upload()

Once you load a file or dataset, it will be available in the content/ folder in your notebook instance (see below).

Find your loaded data in the Files tab in the /content folder.

From Colab to a local directory

We can then do the inverse — loading a dataset from Google Colab back to our local computer.

from google.colab import files

# Download the file to your local computer
files.download('/content/sample_data/california_housing_test.csv')

From Colab to Google Storage

Google Storage is a useful tool for data scientists, as it allows us to save unstructured data (e.g. text files, images, audio files) on a cloud server. This saves space on your computer, and eaises collaboration through a centralized data source.

There are two main ways we can load data to Google Storage — with gsutil or the Google Python API. Regardless of the method, we need to connect to Google Storage and create a bucket.

# Authenticate the user.
from google.colab import auth
auth.authenticate_user()

# Specify tyour Google Project ID.
project_id = 'my-google-project-id'

# Create a unique bucket name.
# The bucket name must be unique across all GCS sources globally.
# Use uuid to create a universally unique ID.
import uuid
bucket_name = 'colab-sample-bucket-' + str(uuid.uuid1())
print(bucket_name)

From there, we can run the following set of commands to upload or download files using gsutil.

# Configure gsutil to use the project we specified above by using gcloud.
!gcloud config set project {project_id}

# Make a bucket to which we'll upload the file
!gsutil mb gs://{bucket_name}

# Copy the file to our new bucket
!gsutil cp /content/sample_data/mnist_test.csv gs://{bucket_name}/

# Dump the contents of our newly copied file to make sure everything worked
!gsutil cat gs://{bucket_name}/mnist_test.csv

And we can download files from GCS with the following command.

# Copy file from GCS to local directory
!gsutil cp gs://{bucket_name}/mnist_test.csv /content/mnist_test.csv

You can see how to load data to or from GCS in the Quickstart Notebook.

Profiling Data

Once we get our data into our notebook, we need to get a high-level understanding of the data and account for any missing, inaccurate, or outlying values. This is where data profiling comes in.

Generally, I’m against a ‘one size fits all’ approach to EDA. However, these libraries are very useful when getting a base understanding of your data.

  • pandas_profiling
  • sweetviz
  • AutoViz

In the Quickstart Notebook, we provide simple examples of data profiling for each of these libraries, along with several resources on data profiling.

pandas_profiling

# Install pandas-profiling library
!pip install -U pandas-profiling

# Install Jupyter's nbextension
# Allows other extensions to be added to this notebook
!jupyter nbextension enable --py widgetsnbextension

# Import libraries
import pandas as pd
from pandas_profiling import ProfileReport

# Run profiling report
profile = ProfileReport(
df_ohlc,
title="SPY Data",
html={'style' : {'full_width':True}}
)

# Load to colab file
profile.to_file(output_file="pandas_profiling.html")

# Display the pandas_profiling output
# You may not be able to view pandas_profiling about inline
# Therefore, use this code to view it (or any HTML file).
import IPython
IPython.display.HTML(filename='pandas_profiling.html')

sweetviz

Sweetviz enables data scientists and analysts to effortlessly generate comprehensive reports that provide insights into the structure, distributions, and relationships within their datasets.

# Install sweetviz library
!pip install sweetviz

# Import sweetviz librart
import sweetviz as sv

my_report = sv.analyze(df_ohlc)
my_report.show_html('sweetviz_report.html')

# Display the SweetViz output
import IPython
IPython.display.HTML(filename='sweetviz_report.html')

You can also use Sweetviz to compare two datasets. Here, we’re comparing two stock tickers. However, this is particularly useful if you want to compare training and test datasets.

# Get stock OHLC data
df_aapl = web.DataReader('AAPL', 'stooq', start='01/01/2023', end='06/01/2023')
df_amzn = web.DataReader('AMZN', 'stooq', start='01/01/2023', end='06/01/2023')

import pandas as pd
import numpy as np
import sweetviz as sv

# Specify the two datasets and the reporting name,
# along with the target variable (optional).
my_report = sv.compare([df_aapl, 'Apple'],[df_amzn, 'Amazon'], 'Close')
my_report.show_html(filepath='sweetviz_comparison_report.html')

# Test/Train Example:
# my_report = sv.compare([df_train, 'Train'],[df_test, 'Test'], 'target_var')

import IPython
IPython.display.HTML(filename='/content/sweetviz_comparison_report.html')

Visualizing Data

One of the main challenges with plotting in Jupyter is that customizing charts takes a LOT of code and time. Creating interactive charts is even more difficult.

We’ve focused on Pareto efficiencies here — we have included Colab’s chart suggestion feature as our go-to plotting solution. However, Colab has several good data viz notebooks from other libraries.

Colab Chart Suggestions

As mentioned, Colab Chart Suggestions is a quick way to build charts. It uses matplotlib and Seaborn (sns) under-the-hood but abstracts many of the more complex parts of using these libraries.

The user flow for Colab Chart Suggestions is:

  1. Create a Pandas dataframe
  2. Click the ‘Chart Suggestions’ icon
  3. Click the chart you’d like to use
  4. Click ‘Accept’ to create a new code block that contains a function to output the plot
Google Colab makes it fast and easy to plot charts in your Jupyter Notebook

Resources (and Conclusion)

We use the modules above to quickly start data science projects. However, we use several other templates depending on the methods we use in the analysis. Generally, using templated code will allow your analytics team to start and finish their work faster.

On the shoulders of giants …

In both our Quickstart Notebook and this article, we’ve referenced several other notebooks and articles. We encourage you to review them as well. We’ve included some of our favorites below.

Importing & Exporting Data

Profiling Data

Visualizing Data

Useful Notebooks

--

--

Gordon Silvera
Gordon Silvera

Written by Gordon Silvera

We help startups and scaleups become data-driven. Get a data scientist on-demand, or advice on analytical data stacks. See more at www.thedatastrategist.com.

No responses yet