diff options
author | Prabhu Ramachandran | 2018-06-09 23:54:51 +0530 |
---|---|---|
committer | Prabhu Ramachandran | 2018-06-09 23:54:51 +0530 |
commit | 12e10ca44dc9e977553d0b9b394d41158857dc72 (patch) | |
tree | a575471fe6b17e4646c9e9f3e372fcec201085d7 /data_analysis | |
parent | c05978bd1d5464988300eead1c1d8af98f732a42 (diff) | |
download | python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.tar.gz python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.tar.bz2 python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.zip |
Adding some initial content for data analysis.
Diffstat (limited to 'data_analysis')
-rw-r--r-- | data_analysis/01_intro.ipyml | 124 | ||||
-rw-r--r-- | data_analysis/02_numpy_stats.ipyml | 441 | ||||
-rw-r--r-- | data_analysis/README.md | 67 | ||||
-rw-r--r-- | data_analysis/rise.css | 16 |
4 files changed, 648 insertions, 0 deletions
diff --git a/data_analysis/01_intro.ipyml b/data_analysis/01_intro.ipyml new file mode 100644 index 0000000..39d258e --- /dev/null +++ b/data_analysis/01_intro.ipyml @@ -0,0 +1,124 @@ +cells: + +- markdown: | + # Introduction to Data Analysis with Python + + ### Prabhu Ramachandran + ### The FOSSEE Python group & + ### Department of Aerospace Engineering + ### IIT Bombay + + metadata: + slideshow: + slide_type: slide + +- markdown: | + ## Introduction + + - A world of data! + + - Can we use data to drive decisions and form opinions? + + metadata: + slideshow: + slide_type: slide + + +- markdown: | + ## Real data is not perfect + + - Partial information + - Uncertainty + - Errors + + <br/> + <br/> + + - Important to check and clean data + + metadata: + slideshow: + slide_type: subslide + + +- markdown: | + ## Statistical approach + + - Data collection + + <br/> + <br/> + + - Visualization + - Inference + - Modeling + - Prediction + + metadata: + slideshow: + slide_type: subslide + + +- markdown: | + ## Importance of computers + + - Datasets are large + - Easy to process on the computer + - Simulation! + + metadata: + slideshow: + slide_type: subslide + +- markdown: | + ## This course + + - Use Python for data analysis + - Exposes you to the basic tools available + - Does not teach you statistics! + - Will point out resources for this + + metadata: + slideshow: + slide_type: slide + + +- markdown: | + ## Pre-requisites + + - Basic Python programming + - NumPy + - Python 3.x, Jupyter, scipy, matplotlib, pandas, statsmodels + + - Mathematics (12th grade) + - Introduction to statistics + + + metadata: + slideshow: + slide_type: slide + +- markdown: | + ## Tools and Topics + + - Simple statistics with `numpy` + - Statistical plots with `matplotlib` + - Random variables with `scipy.stats` + - Using `pandas` for data ingestion and analysis + - Introduction to `statsmodel` for regression + + + metadata: + slideshow: + slide_type: slide + +- markdown: | + ## Summary + + - Introduction to data analysis + - Pre-requisites for this course + - Tools covered + + metadata: + slideshow: + slide_type: slide diff --git a/data_analysis/02_numpy_stats.ipyml b/data_analysis/02_numpy_stats.ipyml new file mode 100644 index 0000000..a412271 --- /dev/null +++ b/data_analysis/02_numpy_stats.ipyml @@ -0,0 +1,441 @@ +cells: + +- code: | + %matplotlib inline + + metadata: + slideshow: + slide_type: skip + +- markdown: | + # Data Analysis: Simple Statistics with `numpy` + + ### Prabhu Ramachandran + ### The FOSSEE Python group & + ### Department of Aerospace Engineering + ### IIT Bombay + + metadata: + slideshow: + slide_type: slide + +- markdown: | + ## Introduction + + - Already exposed to `numpy` + - Provides convenient statistical functions + + <br/> + + metadata: + slideshow: + slide_type: slide + +- markdown: | + - `np.mean`, `np.std`, etc. + - `np.random.random` etc. + - `np.percentile` + + metadata: + slideshow: + slide_type: fragment + +- markdown: | + ## Simple NumPy functions + + - `mean`, `std`, `var`, `median` + + metadata: + slideshow: + slide_type: slide + +- code: | + import numpy as np + data = [1.0, 4.5, 2.3, -0.5, 0.5, 2.8] + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.mean(data) + +- code: | + np.median(data) + +- code: | + np.std(data) + +- code: | + np.var(data) + +- markdown: | + ## Degrees of freedom? + + - Sample standard deviation: $S^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$ + + + metadata: + slideshow: + slide_type: slide + +- markdown: | + - Use the `ddof` keyword argument (defaults to zero) + - Denominator is `n - ddof` + + metadata: + slideshow: + slide_type: fragment + +- code: | + # ddof defaults to zero + np.std(data, ddof=1) + + +- markdown: | + ## Multi-dimensional data + + - What if you have a multi-dimensional array? + + metadata: + slideshow: + slide_type: slide + +- code: | + md = np.arange(16) + md.shape = 4, 4 + +- code: | + md + +- code: | + np.mean(md) + + metadata: + slideshow: + slide_type: slide + +- code: | + np.std(md) + +- code: | + np.mean(md, axis=0) + +- code: | + np.mean(md, axis=1) + +- markdown: | + ## Not-a-Number: `NaN` + + - Part of the number system: `np.nan` + - Like `inf`: `np.inf` + - `nan`: Used to denote missing values in data + + metadata: + slideshow: + slide_type: slide + +- code: | + np.nan + 1 + + metadata: + slideshow: + slide_type: fragment + +- code: | + data = [1.0, 2.1, np.nan, 3.0] + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.mean(data), np.std(data) + +- markdown: | + ## Dealing with Nans? + + - Use `np.nanmean, np.nanmedian, np.nanstd` etc. + + metadata: + slideshow: + slide_type: slide + +- code: | + np.nanmean(data) + +- code: | + np.nanstd(data) + +- markdown: | + - Do `np.nan<TAB>` to see more + + +- markdown: | + ## Pseudo Random Numbers + + - `np.random.random` etc. + + metadata: + slideshow: + slide_type: slide + +- code: | + data = np.random.random(5) + + metadata: + slideshow: + slide_type: fragment + +- code: | + x = np.random.random((3, 3)) + x.shape + +- code: | + # randint(low, high, size) + np.random.randint(-5, 10, size=5) + +- code: | + # loc: mean, scale: std-dev + np.random.normal(loc=0.0, scale=1.0, size=5) + +- markdown: | + - `size` keyword argument to specify shape + +- markdown: | + ## Other distributions + + - Many univariate distributions + - A few multi-variate distributions + - Draw samples from these distributions + + metadata: + slideshow: + slide_type: slide + +- code: | + np.random? + +- markdown: | + ## Some exploration of the random variables + + - Let us plot a few of these distributions + + metadata: + slideshow: + slide_type: slide + +- code: | + data = np.random.normal(size=1000) + +- code: | + from matplotlib import pyplot as plt + plt.hist(data) + + +- code: | + data = np.random.normal(size=20) + + metadata: + slideshow: + slide_type: slide + +- code: | + plt.hist(data) + +- code: | + plt.hist(data, bins=6) + + +- code: | + data = np.random.normal(size=10000) + + metadata: + slideshow: + slide_type: slide + +- code: | + plt.hist(data, normed=True) + +- code: | + plt.hist(data, cumulative=True) + + metadata: + slideshow: + slide_type: slide + +- code: | + data = np.random.poisson(lam=0.5, size=10000) + ax1 = plt.subplot(1, 2, 1) + ax1.hist(data, normed=True) + ax2 = plt.subplot(1, 2, 2) + ax2.hist(data, cumulative=True) + + metadata: + slideshow: + slide_type: slide + +- markdown: | + ## Subplots + + - `plt.subplot(nrows, ncols, plot_number)` + - `plot_number` starts from 1 + - Axes returned can be used like `plt` + + metadata: + slideshow: + slide_type: slide + +- code: | + for i in range(1, 5): + ax = plt.subplot(2, 2, i) + ax.text(0.0, 0.5, 'plot number %d' % i) + + metadata: + slideshow: + slide_type: slide + +- code: | + data = np.random.chisquare(7, size=10000) + ax1 = plt.subplot(1, 2, 1) + ax1.hist(data, normed=True) + ax2 = plt.subplot(1, 2, 2) + ax2.hist(data, cumulative=True) + + metadata: + slideshow: + slide_type: slide + + +- markdown: | + ## Repeatable random numbers + + - `np.random.xxx` gives different results each time + + - Use `np.random.seed` to make this deterministic + + metadata: + slideshow: + slide_type: slide + +- code: | + np.random.seed(27) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.random.random() + +- code: | + np.random.seed(27) + np.random.random() + + +- markdown: | + ## Computing percentiles + + - Use `np.percentile` + - Or `np.nanpercentile` + + metadata: + slideshow: + slide_type: slide + +- code: | + data = np.random.normal(loc=10, scale=2, size=1000) + np.percentile(data, 50) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.median(data) + +- code: | + np.percentile(data, [25, 50, 75]) + + +- markdown: | + ## Some useful tools + + - For computational work + - `np.shuffle` + - `np.permutation` + - `np.choice` + + metadata: + slideshow: + slide_type: slide + + +- code: | + data = np.random.randint(0, 100, size=5) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.shuffle(data) + +- code: | + np.permutation(10) + +- code: | + np.permutation(data) + + +- code: | + data = np.random.permutation(5) + + metadata: + slideshow: + slide_type: slide + +- code: | + np.choice(data) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.choice(data, size=5) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.choice(data, size=10) + + metadata: + slideshow: + slide_type: fragment + +- code: | + np.choice(data, size=4, replace=False) + + metadata: + slideshow: + slide_type: slide + +- code: | + # Won't work! + np.choice(data, size=10, replace=False) + + +- markdown: | + ## Summary + + - Basic `numpy` stats functions + - Random number generators + - Plotting histograms and subplots + - Odds and ends + + metadata: + slideshow: + slide_type: slide diff --git a/data_analysis/README.md b/data_analysis/README.md new file mode 100644 index 0000000..a00b95d --- /dev/null +++ b/data_analysis/README.md @@ -0,0 +1,67 @@ +# Introduction to data analysis with Python + +This material covers a short course on using Python for data analysis. + +The material assumes that the student is aware of basic mathematics and +statistics. While doing statistical analysis it always helps to know +statistics fairly well. We will attempt to provide some links to freely +available material that covers some of these basics. + +An excellent book on doing statistical analysis with Python is Allen Downey's +Think Stats book which is freely available. The material is not a traditional +approach to statistics but will get you thinking for sure. + +The emphasis of this course is to expose the student to the various libraries +and tools available in Python so they can embark on their own data analysis. +There is a lot of material already available. We will attempt to provide the +attendees links to some useful material. + +## Pre-requisites + +- Students should have completed the basic Python programming material. +- One should have a Python 3.x installation with the following packages: + - IPython, scipy, matplotlib + - pandas, statsmodels +- Use a reasonable editor, Canopy will work. +- If one desires a more advanced editor, I suggest VS Code + (https://code.visualstudio.com/) which is free, open source, and very + powerful. +- Knowledge of basic statistics. + +## Contents + +* Introduction + +* Simple statistics with `numpy` + * Basic stats functions, mean, std etc. + * Percentiles + * Random numbers: normal, random, choice, shuffle + +* Statistical plots + * hist + * boxplot + * scatter + * pie chart + +* Using `scipy.stats` + * pdf + * cdf + * rvs + +* Using `pandas` + * Quick introduction + * Categorical vs numerical data + * Data frames + * Basic operations + * String operations + * simple plots + * Groupby + * Pivot + * Maps + * pdvega + +* Using `statsmodel` + * regression + * anova + +* Seaborn? diff --git a/data_analysis/rise.css b/data_analysis/rise.css new file mode 100644 index 0000000..0846e50 --- /dev/null +++ b/data_analysis/rise.css @@ -0,0 +1,16 @@ +/* Customizations for RISE slides + */ + +/* Increase font size for the code cells. +*/ + +div.cell.code_cell { + font-size: 150%; +} + +/* Tables were rendered out as tiny values + since the font-size was set to 12px somehow. +*/ +.rendered_html table { + font-size: 100% +} |