cells: - code: | %matplotlib inline metadata: slideshow: slide_type: skip - markdown: | # Data Analysis: Simple Statistics with `numpy` ### Prabhu Ramachandran ### The FOSSEE Python group & ### Department of Aerospace Engineering ### IIT Bombay metadata: slideshow: slide_type: slide - markdown: | ## Introduction - Already exposed to `numpy` - Provides convenient statistical functions
metadata: slideshow: slide_type: slide - markdown: | - `np.mean`, `np.std`, etc. - `np.random.random` etc. - `np.percentile` metadata: slideshow: slide_type: fragment - markdown: | ## Simple NumPy functions - `mean`, `std`, `var`, `median` metadata: slideshow: slide_type: slide - code: | import numpy as np data = [1.0, 4.5, 2.3, -0.5, 0.5, 2.8] metadata: slideshow: slide_type: fragment - code: | np.mean(data) - code: | np.median(data) - code: | np.std(data) - code: | np.var(data) - markdown: | ## Degrees of freedom? - Sample standard deviation: $S^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$ metadata: slideshow: slide_type: slide - markdown: | - Use the `ddof` keyword argument (defaults to zero) - Denominator is `n - ddof` metadata: slideshow: slide_type: fragment - code: | # ddof defaults to zero np.std(data, ddof=1) - markdown: | ## Multi-dimensional data - What if you have a multi-dimensional array? metadata: slideshow: slide_type: slide - code: | md = np.arange(16) md.shape = 4, 4 - code: | md - code: | np.mean(md) metadata: slideshow: slide_type: slide - code: | np.std(md) - code: | np.mean(md, axis=0) - code: | np.mean(md, axis=1) - markdown: | ## Not-a-Number: `NaN` - Part of the number system: `np.nan` - Like `inf`: `np.inf` - `nan`: Used to denote missing values in data metadata: slideshow: slide_type: slide - code: | np.nan + 1 metadata: slideshow: slide_type: fragment - code: | data = [1.0, 2.1, np.nan, 3.0] metadata: slideshow: slide_type: fragment - code: | np.mean(data), np.std(data) - markdown: | ## Dealing with Nans? - Use `np.nanmean, np.nanmedian, np.nanstd` etc. metadata: slideshow: slide_type: slide - code: | np.nanmean(data) - code: | np.nanstd(data) - code: | # Try `np.nan` np.nan - markdown: | ## Pseudo Random Numbers - `np.random.*` metadata: slideshow: slide_type: slide - code: | data = np.random.random(5) metadata: slideshow: slide_type: fragment - code: | x = np.random.random((3, 3)) x.shape - code: | # randint(low, high, size) np.random.randint(-5, 10, size=5) - code: | # loc: mean, scale: std-dev np.random.normal(loc=0.0, scale=1.0, size=5) - markdown: | - `size` keyword argument to specify shape - markdown: | ## Other distributions - Many univariate distributions - A few multi-variate distributions - Draw samples from these distributions metadata: slideshow: slide_type: slide - code: | np.random? - markdown: | ## Some exploration of the random variables - Let us plot a few of these distributions metadata: slideshow: slide_type: slide - code: | data = np.random.normal(size=1000) - code: | from matplotlib import pyplot as plt plt.hist(data); - code: | data = np.random.normal(size=20) metadata: slideshow: slide_type: slide - code: | plt.hist(data); - code: | plt.hist(data, bins=6); - code: | data = np.random.normal(size=10000) metadata: slideshow: slide_type: slide - code: | plt.hist(data, normed=True); - code: | plt.hist(data, cumulative=True); metadata: slideshow: slide_type: slide - code: | plt.hist(data, bins=50, normed=True, cumulative=True); - code: | data = np.random.poisson(lam=0.5, size=10000) ax1 = plt.subplot(1, 2, 1) plt.ylabel('PDF') ax1.hist(data, normed=True) ax2 = plt.subplot(1, 2, 2) ax2.hist(data, cumulative=True, normed=True); metadata: slideshow: slide_type: slide - markdown: | ## Subplots - `plt.subplot(nrows, ncols, plot_number)` - `plot_number` starts from 1 - Axes returned can be used like `plt` metadata: slideshow: slide_type: slide - code: | for i in range(1, 5): ax = plt.subplot(2, 2, i) ax.text(0.0, 0.5, 'plot number %d' % i) metadata: slideshow: slide_type: slide - markdown: | ## $\chi^2$ distributions metadata: slideshow: slide_type: slide - code: | data = np.random.chisquare(7, size=10000) ax1 = plt.subplot(1, 2, 1) ax1.hist(data, bins=50, normed=True) ax2 = plt.subplot(1, 2, 2) ax2.hist(data, bins=50, normed=True, cumulative=True); - markdown: | ## Repeatable random numbers - `np.random.xxx` gives different results each time - Use `np.random.seed` to make this deterministic metadata: slideshow: slide_type: slide - code: | np.random.seed(27) metadata: slideshow: slide_type: fragment - code: | np.random.random() - code: | np.random.seed(27) np.random.random() - markdown: | ## Computing percentiles - Use `np.percentile` - Or `np.nanpercentile` metadata: slideshow: slide_type: slide - code: | data = np.random.normal(loc=10, scale=2, size=1000) np.percentile(data, 50) metadata: slideshow: slide_type: fragment - code: | np.median(data) - code: | np.percentile(data, [25, 50, 75]) - markdown: | ## Some useful tools - For computational work - `np.random.shuffle` - `np.random.permutation` - `np.random.choice` metadata: slideshow: slide_type: slide - code: | data = np.random.randint(0, 100, size=5) metadata: slideshow: slide_type: fragment - code: | np.random.shuffle(data) - code: | np.random.permutation(10) - code: | np.random.permutation(data) - code: | data = np.random.permutation(5) metadata: slideshow: slide_type: slide - code: | np.choice(data) metadata: slideshow: slide_type: fragment - code: | np.random.choice(data, size=5) metadata: slideshow: slide_type: fragment - code: | np.random.choice(data, size=10) metadata: slideshow: slide_type: fragment - code: | np.random.choice(data, size=4, replace=False) metadata: slideshow: slide_type: slide - code: | # Won't work! np.random.choice(data, size=10, replace=False) - markdown: | ## Summary - Basic `numpy` stats functions - Random number generators - Plotting histograms and subplots - Odds and ends metadata: slideshow: slide_type: slide