Visualization of the 1854 London Cholera Outbreak

This post attempts to visualize the 1854 London Cholera Outbreak based on data collected by Dr. John Snow and provided in the HistData R package. Dr. Snow was able to identify that cholera was a water borne disease by visualizing his data in 1854 and was able to bring the Cholera outbreak to an end. This dataset and analysis speaks to power of geospatial data and its importance in decision making.

What caused the Challenger disaster?

The motivation for this blog is to examine the reasons behind the explosion of the USA Space Shuttle Challenger on 28 January, 1986. The night before the launch a decision had to be made regarding launch safety and engineers recommended that the launch be postponed in the event the temperature at launch was below freezing as this adversely impacted the integrity of O-rings, a key component holding in field joints. The engineers advice was ignored and disaster ensued. Let's dive in!

Regression in R

My latest publicly available R notebook created in IBM's Data Science Experience is here!  This notebook provides a tutorial on:

This notebook covers:
Fitting and interpreting linear models;Evaluating model assumptions; andSelecting among competing models.I hope you enjoy this notebook.  Please feel free to share and let me know your thoughts.

My latest notebook: Regression in R h/t — Venky Rao (@VRaoRao) October 15, 2017

Coefficient of Alienation

If you thought the coefficient of alienation referred to the hostility I receive from my family as I update my blog on a Saturday afternoon, I would not fault you too much.  However, this is a blog about predictive analytics which is based on Statistics.  So let's keep that in mind as we understand what the "Coefficient of Alienation" means.

Apart from being one of the coolest sounding Statistical terms, the Coefficient of Alienation measures the proportion of variation in the outcome not “explained” by the variables on the right-hand side of a simple linear regression (ordinary least squares) equation.

The Coefficient of Alienation is also known as the Coefficient of Non-Determination since the formula for calculating it is:


And now before my personal (and non-Statistical) Coefficient of Alienation reaches the point of no return, I will bring this post to an end.

Homoscedasticity and heteroscedasticity

Homoscedasticity and heteroscedasticity - two of the scariest sounding terms in all of Statistics!  So what do they mean?

When one calculates the variance or standard deviation of a dataset of random variables, one assumes that the variance is constant across the entire population.  This assumption is homoscedasticity.  The opposite of this assumption is heteroscedasticity.

In other words, a collection of random variables is heteroscedastic if there are sub-populations within the dataset that have different variances from others (source:  Another way of describing homoscedasticity is constant variance and another way of describing heteroscedasticity is variable variance.

Jeremy J Taylor in his blog provides a great example of a distribution that is heteroscedastic.  In his example, the independent variable is "age" and the predictor variable is "income".  The example discusses how incomes start to vary more as age …

Standard Deviation versus Absolute Mean Deviation

One of the first things that any student of statistics learns is 2 popular measures of descriptive statistics: mean and standard deviation.

Has the approach to calculating Standard Deviation ever got you wondering about the need to square the distances from the mean in order to remove negatives instead of just using the average of the absolute values to eliminate negatives?  Well, you are certainly not alone.

As it turns out, squaring the distances from the mean and then calculating their square root to arrive at the Standard Deviation of a distribution is more as a result of convention than anything else.  In fact, there is a measure called the Absolute Mean Deviation that does not take the squared distances from the mean to eliminate negative values.  Instead, it just takes the absolute values of the differences from the mean and calculates the average of the sum of those values to determine deviation from the mean.

The convention of course is to use Standard Deviation in most case…