diff options
author | Prabhu Ramachandran | 2018-06-09 23:54:51 +0530 |
---|---|---|
committer | Prabhu Ramachandran | 2018-06-09 23:54:51 +0530 |
commit | 12e10ca44dc9e977553d0b9b394d41158857dc72 (patch) | |
tree | a575471fe6b17e4646c9e9f3e372fcec201085d7 /data_analysis/02_numpy_stats.ipyml | |
parent | c05978bd1d5464988300eead1c1d8af98f732a42 (diff) | |
download | python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.tar.gz python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.tar.bz2 python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.zip |
Adding some initial content for data analysis.
Diffstat (limited to 'data_analysis/02_numpy_stats.ipyml')
-rw-r--r-- | data_analysis/02_numpy_stats.ipyml | 441 |
1 files changed, 441 insertions, 0 deletions
diff --git a/data_analysis/02_numpy_stats.ipyml b/data_analysis/02_numpy_stats.ipyml new file mode 100644 index 0000000..a412271 --- /dev/null +++ b/data_analysis/02_numpy_stats.ipyml @@ -0,0 +1,441 @@ +cells: + +- code: | + %matplotlib inline + + metadata: + slideshow: + slide_type: skip + +- markdown: | + # Data Analysis: Simple Statistics with `numpy` + + ### Prabhu Ramachandran + ### The FOSSEE Python group & + ### Department of Aerospace Engineering + ### IIT Bombay + + metadata: + slideshow: + slide_type: slide + +- markdown: | + ## Introduction + + - Already exposed to `numpy` + - Provides convenient statistical functions + + <br/> + + metadata: + slideshow: + slide_type: slide + +- markdown: | + - `np.mean`, `np.std`, etc. + - `np.random.random` etc. + - `np.percentile` + + metadata: + slideshow: + slide_type: fragment + +- markdown: | + ## Simple NumPy functions + + - `mean`, `std`, `var`, `median` + + metadata: + slideshow: + slide_type: slide + +- code: | + import numpy as np + data = [1.0, 4.5, 2.3, -0.5, 0.5, 2.8] + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.mean(data) + +- code: | + np.median(data) + +- code: | + np.std(data) + +- code: | + np.var(data) + +- markdown: | + ## Degrees of freedom? + + - Sample standard deviation: $S^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$ + + + metadata: + slideshow: + slide_type: slide + +- markdown: | + - Use the `ddof` keyword argument (defaults to zero) + - Denominator is `n - ddof` + + metadata: + slideshow: + slide_type: fragment + +- code: | + # ddof defaults to zero + np.std(data, ddof=1) + + +- markdown: | + ## Multi-dimensional data + + - What if you have a multi-dimensional array? + + metadata: + slideshow: + slide_type: slide + +- code: | + md = np.arange(16) + md.shape = 4, 4 + +- code: | + md + +- code: | + np.mean(md) + + metadata: + slideshow: + slide_type: slide + +- code: | + np.std(md) + +- code: | + np.mean(md, axis=0) + +- code: | + np.mean(md, axis=1) + +- markdown: | + ## Not-a-Number: `NaN` + + - Part of the number system: `np.nan` + - Like `inf`: `np.inf` + - `nan`: Used to denote missing values in data + + metadata: + slideshow: + slide_type: slide + +- code: | + np.nan + 1 + + metadata: + slideshow: + slide_type: fragment + +- code: | + data = [1.0, 2.1, np.nan, 3.0] + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.mean(data), np.std(data) + +- markdown: | + ## Dealing with Nans? + + - Use `np.nanmean, np.nanmedian, np.nanstd` etc. + + metadata: + slideshow: + slide_type: slide + +- code: | + np.nanmean(data) + +- code: | + np.nanstd(data) + +- markdown: | + - Do `np.nan<TAB>` to see more + + +- markdown: | + ## Pseudo Random Numbers + + - `np.random.random` etc. + + metadata: + slideshow: + slide_type: slide + +- code: | + data = np.random.random(5) + + metadata: + slideshow: + slide_type: fragment + +- code: | + x = np.random.random((3, 3)) + x.shape + +- code: | + # randint(low, high, size) + np.random.randint(-5, 10, size=5) + +- code: | + # loc: mean, scale: std-dev + np.random.normal(loc=0.0, scale=1.0, size=5) + +- markdown: | + - `size` keyword argument to specify shape + +- markdown: | + ## Other distributions + + - Many univariate distributions + - A few multi-variate distributions + - Draw samples from these distributions + + metadata: + slideshow: + slide_type: slide + +- code: | + np.random? + +- markdown: | + ## Some exploration of the random variables + + - Let us plot a few of these distributions + + metadata: + slideshow: + slide_type: slide + +- code: | + data = np.random.normal(size=1000) + +- code: | + from matplotlib import pyplot as plt + plt.hist(data) + + +- code: | + data = np.random.normal(size=20) + + metadata: + slideshow: + slide_type: slide + +- code: | + plt.hist(data) + +- code: | + plt.hist(data, bins=6) + + +- code: | + data = np.random.normal(size=10000) + + metadata: + slideshow: + slide_type: slide + +- code: | + plt.hist(data, normed=True) + +- code: | + plt.hist(data, cumulative=True) + + metadata: + slideshow: + slide_type: slide + +- code: | + data = np.random.poisson(lam=0.5, size=10000) + ax1 = plt.subplot(1, 2, 1) + ax1.hist(data, normed=True) + ax2 = plt.subplot(1, 2, 2) + ax2.hist(data, cumulative=True) + + metadata: + slideshow: + slide_type: slide + +- markdown: | + ## Subplots + + - `plt.subplot(nrows, ncols, plot_number)` + - `plot_number` starts from 1 + - Axes returned can be used like `plt` + + metadata: + slideshow: + slide_type: slide + +- code: | + for i in range(1, 5): + ax = plt.subplot(2, 2, i) + ax.text(0.0, 0.5, 'plot number %d' % i) + + metadata: + slideshow: + slide_type: slide + +- code: | + data = np.random.chisquare(7, size=10000) + ax1 = plt.subplot(1, 2, 1) + ax1.hist(data, normed=True) + ax2 = plt.subplot(1, 2, 2) + ax2.hist(data, cumulative=True) + + metadata: + slideshow: + slide_type: slide + + +- markdown: | + ## Repeatable random numbers + + - `np.random.xxx` gives different results each time + + - Use `np.random.seed` to make this deterministic + + metadata: + slideshow: + slide_type: slide + +- code: | + np.random.seed(27) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.random.random() + +- code: | + np.random.seed(27) + np.random.random() + + +- markdown: | + ## Computing percentiles + + - Use `np.percentile` + - Or `np.nanpercentile` + + metadata: + slideshow: + slide_type: slide + +- code: | + data = np.random.normal(loc=10, scale=2, size=1000) + np.percentile(data, 50) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.median(data) + +- code: | + np.percentile(data, [25, 50, 75]) + + +- markdown: | + ## Some useful tools + + - For computational work + - `np.shuffle` + - `np.permutation` + - `np.choice` + + metadata: + slideshow: + slide_type: slide + + +- code: | + data = np.random.randint(0, 100, size=5) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.shuffle(data) + +- code: | + np.permutation(10) + +- code: | + np.permutation(data) + + +- code: | + data = np.random.permutation(5) + + metadata: + slideshow: + slide_type: slide + +- code: | + np.choice(data) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.choice(data, size=5) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.choice(data, size=10) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.choice(data, size=4, replace=False) + + metadata: + slideshow: + slide_type: slide + +- code: | + # Won't work! + np.choice(data, size=10, replace=False) + + +- markdown: | + ## Summary + + - Basic `numpy` stats functions + - Random number generators + - Plotting histograms and subplots + - Odds and ends + + metadata: + slideshow: + slide_type: slide |