The Top 10 Python Functions used by Data Scientists

We analyzed 200K+ IPython notebooks to find out.

Gordon Silvera
5 min readAug 24, 2022

“The Most Python” report attempts to provide a simple view of how Python is used for data science. We did this by analyzing 200K+ IPython (aka Jupyter) notebooks and 2M+ StackOverflow questions, which resulted in:

  1. A list of the most-used Python functions in IPython notebooks
  2. For each function, the most common questions from StackOverflow
  3. For each function, code samples from the most popular notebooks on Github

See the full report here or check out the top 10 functions below.

See the full, interactive “The Most Python” report.

Methodology

Github makes code from open-source licensed available in BigQuery’s public data. We filtered this data to “.ipynb” files and, after much regex and data manipulation, we parsed the data into notebook cells, functions, and libraries. From this, we have a (mostly) clean connection from Github repositories → file paths → files (i.e. notebooks) → notebook cells → code, functions, or libraries. Seen the code here.

Top 10 Most-Used Python Functions

We’ve collected a list of the top 10 most-used Python functions in Jupyter notebooks. This is based on 200K+ open-source Jupyter notebooks on Github. Then we cross-referenced this data with 2M+ StackOverflow questions to identify the most common questions about the most common Python functions.

10) np.zeros

Produces a Numpy array of zeros. This is particularly useful when creating vectors in TensorFlow or for other machine learning applications. See more in the Numpy docs or see how np.zeros and TensorFlow are used in sentiment analysis, with code from a Udacity Deep Learning course.

StackOverflow’s Most Common Questions

9) float

Converts a string or integer to a floating-point number. Float is used in 9.5% of notebooks and is used 19 times per notebook on average. See more about float() at w3 schools.

StackOverflow’s Most Common Questions

  • How to limit floats to two decimal points.
  • How to check if a number is a float.

8) __init__

When we create a Python object by running a class, __init__ initializes the data stored in the object. From there, we can apply methods (aka functions) to transform the data in our object. See more about __init__ in the docs from GeeksforGeeks.

The example below shows how we use __init__ in a simple Python class.

# Sample class with init method
class Person:

# init method or constructor
def __init__(self, name):
self.name = name

# user-defined method
def say_hello(self):
print('Hello, my name is', self.name)

p = Person('Roger')
p.say_hello()

StackOverflow’s Most Common Questions

7) np.array

Numpy’s array function outputs an n-dimensional array, based on the inputs specified. Numpy arrays are often used in machine learning, as they allow for smaller data storage and faster processing [i] than Python data objects.

StackOverflow’s Most Common Questions

6) format

The format function provides a simple way to print strings dynamically. See examples below or in “The Most Python” report. Or see more about the format function from w3 schools.

import numpy as np 
x, y = np.full(4, 1.0), np.full(4, 2.0)
print("{} + {} = {}").format(x, y, x + y)

Here the data scientist is using .format() to build filepath references and display the first image within each subfolder.

for str in ['A', 'B', 'C', 'D', 'E', 'F']: 
root = 'notMNIST_small'
path = os.listdir('{}/{}'.format(root, str))[0]
display(Image('{}/{}/{}'.format(root, str, path)))

StackOverflow’s Most Common Questions

5) int

The function, int(), converts an input into an integer. This input can be a string, number, or bytes object. It’s used in 15% of notebooks. See examples below or read more from Programiz.com.

StackOverflow’s Most Common Questions

4) str

The function, str(), converts inputs into a string. It is found in 14% of notebooks and is used an average 25 times per notebook. The example below converts a set of dates into strings to be included filenames.

for row, item in publications.iterrows():
md_filename = str(item.pub_date) + "-" + item.url_slug + ".md"
html_filename = str(item.pub_date) + "-" + item.url_slug
year = item.pub_date[:4]

StackOverflow’s Most Common Questions

3) range

Range returns a sequential list of numbers. Range takes the inputs: start, stop, and step, where:

  • Start is the starting number of the list, 0 by default
  • Stop is the number at which the range should end
  • Step is the value to increment in each step

Range() is commonly used in for() loops, partly why it’s found in 36% of notebooks. Here’s an application of range in a for loop:

# Return a list of batch size pairs
def get_batches(x, y, batch_size=100):
n_batches = len(x)//batch_size
x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]

for ii in range(0, len(x), batch_size):
yield x[ii:ii+batch_size], y[ii:ii+batch_size]

StackOverflow’s Most Common Questions

2) len

len() returns the length of a string, list, dataframe, or any othe type of Python object. Needless to say, this is commonly-used in IPython notebooks, as data scientists regularly check the size of the data they’re transforming. 38% of notebooks use len().

See an example below, where we use len() to get for non-empty values from a list, and use it again to print summary information about the output list.

# Get all index values with non-empty reviews
non_zero_idx = [
ii for ii, review in enumerate(reviews_ints) if len(review) != 0]
# Print the number of non-empty reviews
print(len(non_zero_idx))

1) print

No other function is used as often as the print() in IPython notebooks. This makes sense, as data scientists output information at the end of nearly every cell. One-third of notebooks use print() and it’s used 31 times per notebook on average.

import numpy as np 
x, y = np.full(4, 1.0), np.full(4, 2.0)
print("{} + {} = {}".format(x, y, x + y))
# Print using list comprehension
print([x + y for x, y in zip([1.0] * 4, [2.0] * 4)])

In the following example, the data scientist uses print to provide status updates during a data load.

import tensorflow as tf
from tensorflow.examples.tutorials.mnist
import input_data
print('Getting MNIST Dataset...')
mnist = input_data.read_data_sets(\"MNIST_data/\", one_hot=True) print('Data Extracted.')

StackOverflow’s Most Common Questions

View “The Most Python” report on Python Functions

Check out “The Most Python” report to see 100+ of the most-used Python functions, along with common questions and code samples for each.

--

--

Gordon Silvera
Gordon Silvera

Written by Gordon Silvera

We help startups and scaleups become data-driven. Get a data scientist on-demand, or advice on analytical data stacks. See more at www.thedatastrategist.com.

No responses yet