cells:
- code: |
%matplotlib inline
metadata:
slideshow:
slide_type: skip
- markdown: |
# Data Analysis: Simple Statistics with `numpy`
### Prabhu Ramachandran
### The FOSSEE Python group &
### Department of Aerospace Engineering
### IIT Bombay
metadata:
slideshow:
slide_type: slide
- markdown: |
## Introduction
- Already exposed to `numpy`
- Provides convenient statistical functions
metadata:
slideshow:
slide_type: slide
- markdown: |
- `np.mean`, `np.std`, etc.
- `np.random.random` etc.
- `np.percentile`
metadata:
slideshow:
slide_type: fragment
- markdown: |
## Simple NumPy functions
- `mean`, `std`, `var`, `median`
metadata:
slideshow:
slide_type: slide
- code: |
import numpy as np
data = [1.0, 4.5, 2.3, -0.5, 0.5, 2.8]
metadata:
slideshow:
slide_type: fragment
- code: |
np.mean(data)
- code: |
np.median(data)
- code: |
np.std(data)
- code: |
np.var(data)
- markdown: |
## Degrees of freedom?
- Sample standard deviation: $S^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$
metadata:
slideshow:
slide_type: slide
- markdown: |
- Use the `ddof` keyword argument (defaults to zero)
- Denominator is `n - ddof`
metadata:
slideshow:
slide_type: fragment
- code: |
# ddof defaults to zero
np.std(data, ddof=1)
- markdown: |
## Multi-dimensional data
- What if you have a multi-dimensional array?
metadata:
slideshow:
slide_type: slide
- code: |
md = np.arange(16)
md.shape = 4, 4
- code: |
md
- code: |
np.mean(md)
metadata:
slideshow:
slide_type: slide
- code: |
np.std(md)
- code: |
np.mean(md, axis=0)
- code: |
np.mean(md, axis=1)
- markdown: |
## Not-a-Number: `NaN`
- Part of the number system: `np.nan`
- Like `inf`: `np.inf`
- `nan`: Used to denote missing values in data
metadata:
slideshow:
slide_type: slide
- code: |
np.nan + 1
metadata:
slideshow:
slide_type: fragment
- code: |
data = [1.0, 2.1, np.nan, 3.0]
metadata:
slideshow:
slide_type: fragment
- code: |
np.mean(data), np.std(data)
- markdown: |
## Dealing with Nans?
- Use `np.nanmean, np.nanmedian, np.nanstd` etc.
metadata:
slideshow:
slide_type: slide
- code: |
np.nanmean(data)
- code: |
np.nanstd(data)
- code: |
# Try `np.nan`
np.nan
- markdown: |
## Pseudo Random Numbers
- `np.random.*`
metadata:
slideshow:
slide_type: slide
- code: |
data = np.random.random(5)
metadata:
slideshow:
slide_type: fragment
- code: |
x = np.random.random((3, 3))
x.shape
- code: |
# randint(low, high, size)
np.random.randint(-5, 10, size=5)
- code: |
# loc: mean, scale: std-dev
np.random.normal(loc=0.0, scale=1.0, size=5)
- markdown: |
- `size` keyword argument to specify shape
- markdown: |
## Other distributions
- Many univariate distributions
- A few multi-variate distributions
- Draw samples from these distributions
metadata:
slideshow:
slide_type: slide
- code: |
np.random?
- markdown: |
## Some exploration of the random variables
- Let us plot a few of these distributions
metadata:
slideshow:
slide_type: slide
- code: |
data = np.random.normal(size=1000)
- code: |
from matplotlib import pyplot as plt
plt.hist(data);
- code: |
data = np.random.normal(size=20)
metadata:
slideshow:
slide_type: slide
- code: |
plt.hist(data);
- code: |
plt.hist(data, bins=6);
- code: |
data = np.random.normal(size=10000)
metadata:
slideshow:
slide_type: slide
- code: |
plt.hist(data, normed=True);
- code: |
plt.hist(data, cumulative=True);
metadata:
slideshow:
slide_type: slide
- code: |
plt.hist(data, bins=50, normed=True, cumulative=True);
- code: |
data = np.random.poisson(lam=0.5, size=10000)
ax1 = plt.subplot(1, 2, 1)
plt.ylabel('PDF')
ax1.hist(data, normed=True)
ax2 = plt.subplot(1, 2, 2)
ax2.hist(data, cumulative=True, normed=True);
metadata:
slideshow:
slide_type: slide
- markdown: |
## Subplots
- `plt.subplot(nrows, ncols, plot_number)`
- `plot_number` starts from 1
- Axes returned can be used like `plt`
metadata:
slideshow:
slide_type: slide
- code: |
for i in range(1, 5):
ax = plt.subplot(2, 2, i)
ax.text(0.0, 0.5, 'plot number %d' % i)
metadata:
slideshow:
slide_type: slide
- markdown: |
## $\chi^2$ distributions
metadata:
slideshow:
slide_type: slide
- code: |
data = np.random.chisquare(7, size=10000)
ax1 = plt.subplot(1, 2, 1)
ax1.hist(data, bins=50, normed=True)
ax2 = plt.subplot(1, 2, 2)
ax2.hist(data, bins=50, normed=True, cumulative=True);
- markdown: |
## Repeatable random numbers
- `np.random.xxx` gives different results each time
- Use `np.random.seed` to make this deterministic
metadata:
slideshow:
slide_type: slide
- code: |
np.random.seed(27)
metadata:
slideshow:
slide_type: fragment
- code: |
np.random.random()
- code: |
np.random.seed(27)
np.random.random()
- markdown: |
## Computing percentiles
- Use `np.percentile`
- Or `np.nanpercentile`
metadata:
slideshow:
slide_type: slide
- code: |
data = np.random.normal(loc=10, scale=2, size=1000)
np.percentile(data, 50)
metadata:
slideshow:
slide_type: fragment
- code: |
np.median(data)
- code: |
np.percentile(data, [25, 50, 75])
- markdown: |
## Some useful tools
- For computational work
- `np.random.shuffle`
- `np.random.permutation`
- `np.random.choice`
metadata:
slideshow:
slide_type: slide
- code: |
data = np.random.randint(0, 100, size=5)
metadata:
slideshow:
slide_type: fragment
- code: |
np.random.shuffle(data)
- code: |
np.random.permutation(10)
- code: |
np.random.permutation(data)
- code: |
data = np.random.permutation(5)
metadata:
slideshow:
slide_type: slide
- code: |
np.choice(data)
metadata:
slideshow:
slide_type: fragment
- code: |
np.random.choice(data, size=5)
metadata:
slideshow:
slide_type: fragment
- code: |
np.random.choice(data, size=10)
metadata:
slideshow:
slide_type: fragment
- code: |
np.random.choice(data, size=4, replace=False)
metadata:
slideshow:
slide_type: slide
- code: |
# Won't work!
np.random.choice(data, size=10, replace=False)
- markdown: |
## Summary
- Basic `numpy` stats functions
- Random number generators
- Plotting histograms and subplots
- Odds and ends
metadata:
slideshow:
slide_type: slide