4 statistical processes that every data scientist should know

6 min readJul 7, 2018

The depth and variety of skills that fit under the analytics umbrella are extensive. Different roles — such as strategic analysts, digital analysts, data scientists, data engineers — require distinct skillsets and varying levels of technical expertise. However, a handful of statistical processes are so common that every analyst should be acquainted with them. Further, it’s beneficial to know how to code these in at least one programming language (or if not, in Excel).

Below, are 4 of the most common and versatile statistical methods used in business, along with examples and educational sources.

1. T-Tests (aka A/B Testing)

Overview. A/B Testing is likely the most commonly used stats tool in digital marketing and software development. A “t-test” is a statistical process that determines whether two normally distributed groups (or “samples”) are significantly different from one another. Student’s t-test measures differences in volume metrics (e.g. daily revenue, delivery time, customer demand), whereas binomial tests measure the difference in conversion metrics or binary outcomes (e.g. click-through rate, win percentages, product malfunction rates). The true power of T-tests is not the math behind it, but its strategic power when a business continually implements A/B testing to push products or processes to ever-improving levels.

Example. On-site optimization software such as Optimizely uses T-tests to determine whether a new version of a webpage outperforms its incumbent. For a fun, interactive example of A/B Testing, check out Which Test Won.

Where to Learn It

Duke’s Coursera course, Inferential Statistics, taught by Dr. Mine Cetinkaya-Rundel, is an excellent introduction to statistical methods.
A useful resource to conduct your own t-tests is getdatadriven.com’s significance calculator.

2. Linear Regressions

Overview. This statistical process predicts the value of a particular outcome variable based on a number of input variables and gives this prediction a certain level of confidence. When using a simplistic form of linear regression, both the model’s predictions and its coefficients are valuable. Coefficients of linear models can generally be interpreted as “all else equal, a unit increase in the explanatory variable leads to a Y change in the outcome variable.” The relationships between the outcome variable and explanatory variables (which is quantified by these coefficients) are central to marketing mix modeling, stock analysis, and many other business applications.

Example. Predicting Hotel Occupancy. A hotel operator wants to determine how many rooms it will sell a given day 2 or more months in advance. The analyst can use historical data to train a model that predicts the number of rooms sold on a given day (which is the dependent or outcome variable). The historical data we will use (i.e. the independent or input variables) may include:

Days until the reservation date
Number of rooms sold to date
Number of rooms sold “same time last year” or week
The property’s price relative to its competitive set
Citywide occupancy levels
The day of the week (this would be a seasonality variable)

Where to Learn It

Coursera’s Data Science Specialization from Johns Hopkins has a helpful course called Regression Models. It walks students through the conceptual components of regression models, as well as teaches students how to build models in R.
Coursera’s Strategic Analytics Specialization — taught by ESSEC Business School and Accenture — is an excellent way to learn how statistics can be applied in a variety of business situations. In Week 3 of the course, Foundations of Strategic Business Analytics, students learn how to build linear regressions using R.
Coursera’s Machine Learning course from Stanford provides in depth lessons on regression modeling and other advanced statistical processes. Week 2 focuses on linear regressions. Note, however, that this course uses Octave (or Matlab) for programming, a language that isn’t commonly used in corporate analytics. Also, this course is less geared towards business than the other courses mentioned in this article.

3. Logistic Regressions

Overview. Logistic regressions are conceptually similar to linear regressions; however, this model predicts the likelihood that a certain occurrence will happen (or, said differently, the likelihood of a success among a sample of trials). These “events” or “successes” could include:

A customer converting (or purchasing) after visiting a website
A product being improperly manufactured on a factory line
Whether a salesperson will make a sale
A baseball player hitting when at bat

Because outcome variables in these models are binomially distributed (rather than normally distributed as in linear regressions), the underlying math and interpretations of the coefficients are slightly more challenging. Nonetheless, logistic regression models can provide value in a myriad of business applications.

Example. Predicting Customer Purchases. By connecting click stream, email, and POS data, an e-commerce business can predict whether a user will make an online purchase within a particular time period. In this scenario, analysts can use a logistic regression to determine the likelihood any user will make a purchase on their site within the next month. From there, they can predict users’ purchase likelihood on an ongoing basis (a process called “scoring”), allowing the business to better target users for promotions and marketing.

Where to Learn It

Week 4 of Coursera’s Regression Models course, by Johns Hopkins, reviews Logistic Regression models in R.
edX’s Predictive Analytics course, available through IIMB, teaches linear and logistic regression using SPSS and SAS. Disclaimer: I have not taken this course)
Week 3 of Stanford’s Machine Learning course focuses on logistic regressions.

4. Cluster Modeling

Overview. Clustering is a process that allows an analyst to group items based on collective similarities across a number of variables. The end result of this process is a pre-specified number of groups (or clusters) that have distinct properties. The two most common statistical models that perform clustering are hierarchical clustering and k-means clustering.

One of the interesting differences with clustering (vis-a-vis the other processes we’ve discussed) is that there is a considerable amount of “art” to this process. Depending on the business goals, analysts decide how many clusters to form and what variables to input. Therefore, the resulting clusters can be very different even if the same data set is used. Also, the analyst must describe/characterize clusters based on how each cluster’s aggregate metrics differ from other clusters.

Example. Understanding types of shopping trips users make. For this approach, analysts would calculate a variety metrics, grouped by each trip a customer takes (i.e. sum metrics by trip_id). Some of these metrics may include:

Total spend
Number of products purchased
Distinct number of categories purchased
Number of items purchased by category (e.g. Beverages, Canned Goods, Frozen Foods, Meat)

After running the clusterong model, some (hypothetical) clusters may include the following groups. Note that these are sample descriptions. The model output would contain each trip_id with its corresponding cluster number (1 to 5). The analyst would then aggregate the input metrics by cluster number to decide what she/he should call each cluster.

Grab and Go: a trip with very few items from “impulse purchase” categories (e.g. alcohol, candy, personal care)
Pantry Reload: a high-value trip that contains several categories
Bare Necessities: a small basket that includes staple categories (e.g. dairy, bread, fruit)
Dinner for One: a mid-sized basket with an over-index in frozen goods (or other staple products for bachelors)
Special Occasion: a trip (regardless of the basket size) that contains purchases from non-conventional categories (e.g. furniture, toys)

Where to Learn It

Week 1 of Strategic Business Analytics shows how clustering can be performed in R, as well as how it can be applied in business situations.
Week 8 of Stanford’s Machine Learning course provides instruction on cluster analysis in Octave. This is a more technically in-depth lesson than what is taught in the Strategic Business Analytics course.

4 statistical processes that every data scientist should know

1. T-Tests (aka A/B Testing)

2. Linear Regressions

3. Logistic Regressions

4. Cluster Modeling

Written by Gordon Silvera