Adding some initial content for data analysis.

author: Prabhu Ramachandran 2018-06-09 23:54:51 +0530
committer: Prabhu Ramachandran 2018-06-09 23:54:51 +0530
commit: 12e10ca44dc9e977553d0b9b394d41158857dc72 (patch)
tree: a575471fe6b17e4646c9e9f3e372fcec201085d7 /data_analysis
parent: c05978bd1d5464988300eead1c1d8af98f732a42 (diff)
download: python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.tar.gz
python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.tar.bz2
python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.zip
4 files changed, 648 insertions, 0 deletions
diff --git a/data_analysis/01_intro.ipyml b/data_analysis/01_intro.ipyml
new file mode 100644
index 0000000..39d258e
--- /dev/null
+++ b/data_analysis/01_intro.ipyml
@@ -0,0 +1,124 @@
+cells:
+
+- markdown: |
+    # Introduction to Data Analysis with Python
+
+    ### Prabhu Ramachandran
+    ### The FOSSEE Python group &
+    ### Department of Aerospace Engineering
+    ### IIT Bombay
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- markdown: |
+    ## Introduction
+
+    - A world of data!
+
+    - Can we use data to drive decisions and form opinions?
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+
+- markdown: |
+    ## Real data is not perfect
+
+    - Partial information
+    - Uncertainty
+    - Errors
+
+    <br/>
+    <br/>
+
+    - Important to check and clean data
+
+  metadata:
+    slideshow:
+      slide_type: subslide
+
+
+- markdown: |
+    ## Statistical approach
+
+    - Data collection
+
+    <br/>
+    <br/>
+
+    - Visualization
+    - Inference
+    - Modeling
+    - Prediction
+
+  metadata:
+    slideshow:
+      slide_type: subslide
+
+
+- markdown: |
+    ## Importance of computers
+
+    - Datasets are large
+    - Easy to process on the computer
+    - Simulation!
+
+  metadata:
+    slideshow:
+      slide_type: subslide
+
+- markdown: |
+    ## This course
+
+    - Use Python for data analysis
+    - Exposes you to the basic tools available
+    - Does not teach you statistics!
+    - Will point out resources for this
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+
+- markdown: |
+    ## Pre-requisites
+
+    - Basic Python programming
+    - NumPy
+    - Python 3.x, Jupyter, scipy, matplotlib, pandas, statsmodels
+
+    - Mathematics (12th grade)
+    - Introduction to statistics
+
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- markdown: |
+    ## Tools and Topics
+
+    - Simple statistics with `numpy`
+    - Statistical plots with `matplotlib`
+    - Random variables with `scipy.stats`
+    - Using `pandas` for data ingestion and analysis
+    - Introduction to `statsmodel` for regression
+
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- markdown: |
+    ## Summary
+
+    - Introduction to data analysis
+    - Pre-requisites for this course
+    - Tools covered
+
+  metadata:
+    slideshow:
+      slide_type: slide
diff --git a/data_analysis/02_numpy_stats.ipyml b/data_analysis/02_numpy_stats.ipyml
new file mode 100644
index 0000000..a412271
--- /dev/null
+++ b/data_analysis/02_numpy_stats.ipyml
@@ -0,0 +1,441 @@
+cells:
+
+- code: |
+    %matplotlib inline
+
+  metadata:
+    slideshow:
+      slide_type: skip
+
+-  markdown: |
+    # Data Analysis: Simple Statistics with `numpy`
+
+    ### Prabhu Ramachandran
+    ### The FOSSEE Python group &
+    ### Department of Aerospace Engineering
+    ### IIT Bombay
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- markdown: |
+    ## Introduction
+
+    - Already exposed to `numpy`
+    - Provides convenient statistical functions
+
+    <br/>
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- markdown: |
+    - `np.mean`, `np.std`, etc.
+    - `np.random.random` etc.
+    - `np.percentile`
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- markdown: |
+    ## Simple NumPy functions
+
+    - `mean`, `std`, `var`, `median`
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    import numpy as np
+    data = [1.0, 4.5, 2.3, -0.5, 0.5, 2.8]
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- code: |
+    np.mean(data)
+
+- code: |
+    np.median(data)
+
+- code: |
+    np.std(data)
+
+- code: |
+    np.var(data)
+
+- markdown: |
+    ## Degrees of freedom?
+
+    - Sample standard deviation:   $S^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$
+
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- markdown: |
+    - Use the `ddof` keyword argument (defaults to zero)
+    - Denominator is `n - ddof`
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- code: |
+    # ddof defaults to zero
+    np.std(data, ddof=1)
+
+
+- markdown: |
+    ## Multi-dimensional data
+
+    - What if you have a multi-dimensional array?
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    md = np.arange(16)
+    md.shape = 4, 4
+
+- code: |
+    md
+
+- code: |
+    np.mean(md)
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    np.std(md)
+
+- code: |
+    np.mean(md, axis=0)
+
+- code: |
+    np.mean(md, axis=1)
+
+- markdown: |
+    ## Not-a-Number: `NaN`
+
+    - Part of the number system: `np.nan`
+    - Like `inf`: `np.inf`
+    - `nan`: Used to denote missing values in data
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    np.nan + 1
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- code: |
+    data = [1.0, 2.1, np.nan, 3.0]
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- code: |
+    np.mean(data), np.std(data)
+
+- markdown: |
+    ## Dealing with Nans?
+
+    - Use `np.nanmean, np.nanmedian, np.nanstd` etc.
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    np.nanmean(data)
+
+- code: |
+    np.nanstd(data)
+
+- markdown: |
+    - Do `np.nan<TAB>` to see more
+
+
+- markdown: |
+    ## Pseudo Random Numbers
+
+    - `np.random.random` etc.
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    data = np.random.random(5)
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- code: |
+    x = np.random.random((3, 3))
+    x.shape
+
+- code: |
+    # randint(low, high, size)
+    np.random.randint(-5, 10, size=5)
+
+- code: |
+    # loc: mean, scale: std-dev
+    np.random.normal(loc=0.0, scale=1.0, size=5)
+
+- markdown: |
+    - `size` keyword argument to specify shape
+
+- markdown: |
+    ## Other distributions
+
+    - Many univariate distributions
+    - A few multi-variate distributions
+    - Draw samples from these distributions
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    np.random?
+
+- markdown: |
+    ## Some exploration of the random variables
+
+    - Let us plot a few of these distributions
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    data = np.random.normal(size=1000)
+
+- code: |
+    from matplotlib import pyplot as plt
+    plt.hist(data)
+
+
+- code: |
+    data = np.random.normal(size=20)
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    plt.hist(data)
+
+- code: |
+    plt.hist(data, bins=6)
+
+
+- code: |
+    data = np.random.normal(size=10000)
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    plt.hist(data, normed=True)
+
+- code: |
+    plt.hist(data, cumulative=True)
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    data = np.random.poisson(lam=0.5, size=10000)
+    ax1 = plt.subplot(1, 2, 1)
+    ax1.hist(data, normed=True)
+    ax2 = plt.subplot(1, 2, 2)
+    ax2.hist(data, cumulative=True)
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- markdown: |
+    ## Subplots
+
+    - `plt.subplot(nrows, ncols, plot_number)`
+    - `plot_number` starts from 1
+    - Axes returned can be used like `plt`
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    for i in range(1, 5):
+        ax = plt.subplot(2, 2, i)
+        ax.text(0.0, 0.5, 'plot number %d' % i)
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    data = np.random.chisquare(7, size=10000)
+    ax1 = plt.subplot(1, 2, 1)
+    ax1.hist(data, normed=True)
+    ax2 = plt.subplot(1, 2, 2)
+    ax2.hist(data, cumulative=True)
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+
+- markdown: |
+    ## Repeatable random numbers
+
+    - `np.random.xxx` gives different results each time
+
+    - Use `np.random.seed` to make this deterministic
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    np.random.seed(27)
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- code: |
+    np.random.random()
+
+- code: |
+    np.random.seed(27)
+    np.random.random()
+
+
+- markdown: |
+    ## Computing percentiles
+
+    - Use `np.percentile`
+    - Or `np.nanpercentile`
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    data = np.random.normal(loc=10, scale=2, size=1000)
+    np.percentile(data, 50)
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- code: |
+    np.median(data)
+
+- code: |
+    np.percentile(data, [25, 50, 75])
+
+
+- markdown: |
+    ## Some useful tools
+
+    - For computational work
+    - `np.shuffle`
+    - `np.permutation`
+    - `np.choice`
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+
+- code: |
+    data = np.random.randint(0, 100, size=5)
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- code: |
+    np.shuffle(data)
+
+- code: |
+    np.permutation(10)
+
+- code: |
+    np.permutation(data)
+
+
+- code: |
+    data = np.random.permutation(5)
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    np.choice(data)
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- code: |
+    np.choice(data, size=5)
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- code: |
+    np.choice(data, size=10)
+
+  metadata:
+    slideshow:
+      slide_type: fragment
+
+- code: |
+    np.choice(data, size=4, replace=False)
+
+  metadata:
+    slideshow:
+      slide_type: slide
+
+- code: |
+    # Won't work!
+    np.choice(data, size=10, replace=False)
+
+
+- markdown: |
+    ## Summary
+
+    - Basic `numpy` stats functions
+    - Random number generators
+    - Plotting histograms and subplots
+    - Odds and ends
+
+  metadata:
+    slideshow:
+      slide_type: slide
diff --git a/data_analysis/README.md b/data_analysis/README.md
new file mode 100644
index 0000000..a00b95d
--- /dev/null
+++ b/data_analysis/README.md
@@ -0,0 +1,67 @@
+# Introduction to data analysis with Python
+
+This material covers a short course on using Python for data analysis.
+
+The material assumes that the student is aware of basic mathematics and
+statistics. While doing statistical analysis it always helps to know
+statistics fairly well. We will attempt to provide some links to freely
+available material that covers some of these basics.
+
+An excellent book on doing statistical analysis with Python is Allen Downey's
+Think Stats book which is freely available. The material is not a traditional
+approach to statistics but will get you thinking for sure.
+
+The emphasis of this course is to expose the student to the various libraries
+and tools available in Python so they can embark on their own data analysis.
+There is a lot of material already available. We will attempt to provide the
+attendees links to some useful material.
+
+## Pre-requisites
+
+- Students should have completed the basic Python programming material.
+- One should have a Python 3.x installation with the following packages:
+    - IPython, scipy, matplotlib
+    - pandas, statsmodels
+- Use a reasonable editor, Canopy will work.
+- If one desires a more advanced editor, I suggest VS Code
+  (https://code.visualstudio.com/) which is free, open source, and very
+  powerful.
+- Knowledge of basic statistics.
+
+## Contents
+
+* Introduction
+
+* Simple statistics with `numpy`
+    * Basic stats functions, mean, std etc.
+    * Percentiles
+    * Random numbers: normal, random, choice, shuffle
+
+* Statistical plots
+    * hist
+    * boxplot
+    * scatter
+    * pie chart
+
+* Using `scipy.stats`
+    * pdf
+    * cdf
+    * rvs
+
+* Using `pandas`
+    * Quick introduction
+    * Categorical vs numerical data
+    * Data frames
+    * Basic operations
+    * String operations
+    * simple plots
+    * Groupby
+    * Pivot
+    * Maps
+    * pdvega
+
+* Using `statsmodel`
+    * regression
+    * anova
+
+* Seaborn?
diff --git a/data_analysis/rise.css b/data_analysis/rise.css
new file mode 100644
index 0000000..0846e50
--- /dev/null
+++ b/data_analysis/rise.css
@@ -0,0 +1,16 @@
+/* Customizations for RISE slides
+ */
+
+/* Increase font size for the code cells.
+*/
+
+div.cell.code_cell {
+    font-size: 150%;
+}
+
+/* Tables were rendered out as tiny values
+ since the font-size was set to 12px somehow.
+*/
+.rendered_html table {
+    font-size: 100%
+}
author	Prabhu Ramachandran	2018-06-09 23:54:51 +0530
committer	Prabhu Ramachandran	2018-06-09 23:54:51 +0530
commit	12e10ca44dc9e977553d0b9b394d41158857dc72 (patch)
tree	a575471fe6b17e4646c9e9f3e372fcec201085d7 /data_analysis
parent	c05978bd1d5464988300eead1c1d8af98f732a42 (diff)
download	python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.tar.gz python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.tar.bz2 python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.zip