The Top 10 Python Functions used by Data Scientists
We analyzed 200K+ IPython notebooks to find out.
“The Most Python” report attempts to provide a simple view of how Python is used for data science. We did this by analyzing 200K+ IPython (aka Jupyter) notebooks and 2M+ StackOverflow questions, which resulted in:
- A list of the most-used Python functions in IPython notebooks
- For each function, the most common questions from StackOverflow
- For each function, code samples from the most popular notebooks on Github
See the full report here or check out the top 10 functions below.
Methodology
Github makes code from open-source licensed available in BigQuery’s public data. We filtered this data to “.ipynb” files and, after much regex and data manipulation, we parsed the data into notebook cells, functions, and libraries. From this, we have a (mostly) clean connection from Github repositories → file paths → files (i.e. notebooks) → notebook cells → code, functions, or libraries. Seen the code here.
Top 10 Most-Used Python Functions
We’ve collected a list of the top 10 most-used Python functions in Jupyter notebooks. This is based on 200K+ open-source Jupyter notebooks on Github. Then we cross-referenced this data with 2M+ StackOverflow questions to identify the most common questions about the most common Python functions.
10) np.zeros
Produces a Numpy array of zeros. This is particularly useful when creating vectors in TensorFlow or for other machine learning applications. See more in the Numpy docs or see how np.zeros and TensorFlow are used in sentiment analysis, with code from a Udacity Deep Learning course.
StackOverflow’s Most Common Questions
- My data type is not understood when using np.zeros.
- How do I delete a row in a numpy array which contains a zero?
9) float
Converts a string or integer to a floating-point number. Float is used in 9.5% of notebooks and is used 19 times per notebook on average. See more about float() at w3 schools.
StackOverflow’s Most Common Questions
- How to limit floats to two decimal points.
- How to check if a number is a float.
8) __init__
When we create a Python object by running a class, __init__ initializes the data stored in the object. From there, we can apply methods (aka functions) to transform the data in our object. See more about __init__ in the docs from GeeksforGeeks.
The example below shows how we use __init__ in a simple Python class.
# Sample class with init method
class Person:
# init method or constructor
def __init__(self, name):
self.name = name
# user-defined method
def say_hello(self):
print('Hello, my name is', self.name)
p = Person('Roger')
p.say_hello()
StackOverflow’s Most Common Questions
- How to fix “Attempted relative import in non-package” even with __init__.py.
- How to return a value from __init__ in Python?
7) np.array
Numpy’s array function outputs an n-dimensional array, based on the inputs specified. Numpy arrays are often used in machine learning, as they allow for smaller data storage and faster processing [i] than Python data objects.
StackOverflow’s Most Common Questions
- How do I convert a PIL Image into a NumPy array?
- How do I get indices of N maximum values in a NumPy array?
- Concatenating two one-dimensional NumPy arrays.
6) format
The format function provides a simple way to print strings dynamically. See examples below or in “The Most Python” report. Or see more about the format function from w3 schools.
import numpy as np
x, y = np.full(4, 1.0), np.full(4, 2.0)
print("{} + {} = {}").format(x, y, x + y)
Here the data scientist is using .format() to build filepath references and display the first image within each subfolder.
for str in ['A', 'B', 'C', 'D', 'E', 'F']:
root = 'notMNIST_small'
path = os.listdir('{}/{}'.format(root, str))[0]
display(Image('{}/{}/{}'.format(root, str, path)))
StackOverflow’s Most Common Questions
- How do I print curly-brace characters in a string while using .format?
- How to print a string at a fixed width?
5) int
The function, int(), converts an input into an integer. This input can be a string, number, or bytes object. It’s used in 15% of notebooks. See examples below or read more from Programiz.com.
StackOverflow’s Most Common Questions
- Convert all strings in a list to int.
- How can I read inputs as numbers?
- How to convert an int to a hex string?
4) str
The function, str(), converts inputs into a string. It is found in 14% of notebooks and is used an average 25 times per notebook. The example below converts a set of dates into strings to be included filenames.
for row, item in publications.iterrows():
md_filename = str(item.pub_date) + "-" + item.url_slug + ".md"
html_filename = str(item.pub_date) + "-" + item.url_slug
year = item.pub_date[:4]
StackOverflow’s Most Common Questions
3) range
Range returns a sequential list of numbers. Range takes the inputs: start, stop, and step, where:
- Start is the starting number of the list, 0 by default
- Stop is the number at which the range should end
- Step is the value to increment in each step
Range() is commonly used in for() loops, partly why it’s found in 36% of notebooks. Here’s an application of range in a for loop:
# Return a list of batch size pairs
def get_batches(x, y, batch_size=100):
n_batches = len(x)//batch_size
x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
for ii in range(0, len(x), batch_size):
yield x[ii:ii+batch_size], y[ii:ii+batch_size]
StackOverflow’s Most Common Questions
2) len
len() returns the length of a string, list, dataframe, or any othe type of Python object. Needless to say, this is commonly-used in IPython notebooks, as data scientists regularly check the size of the data they’re transforming. 38% of notebooks use len().
See an example below, where we use len() to get for non-empty values from a list, and use it again to print summary information about the output list.
# Get all index values with non-empty reviews
non_zero_idx = [
ii for ii, review in enumerate(reviews_ints) if len(review) != 0]# Print the number of non-empty reviews
print(len(non_zero_idx))
1) print
No other function is used as often as the print() in IPython notebooks. This makes sense, as data scientists output information at the end of nearly every cell. One-third of notebooks use print() and it’s used 31 times per notebook on average.
import numpy as np
x, y = np.full(4, 1.0), np.full(4, 2.0)
print("{} + {} = {}".format(x, y, x + y))# Print using list comprehension
print([x + y for x, y in zip([1.0] * 4, [2.0] * 4)])
In the following example, the data scientist uses print to provide status updates during a data load.
import tensorflow as tf
from tensorflow.examples.tutorials.mnist
import input_data print('Getting MNIST Dataset...')
mnist = input_data.read_data_sets(\"MNIST_data/\", one_hot=True) print('Data Extracted.')
StackOverflow’s Most Common Questions
View “The Most Python” report on Python Functions
Check out “The Most Python” report to see 100+ of the most-used Python functions, along with common questions and code samples for each.