summaryrefslogtreecommitdiff
path: root/data_analysis
diff options
context:
space:
mode:
authorPrabhu Ramachandran2018-06-09 23:54:51 +0530
committerPrabhu Ramachandran2018-06-09 23:54:51 +0530
commit12e10ca44dc9e977553d0b9b394d41158857dc72 (patch)
treea575471fe6b17e4646c9e9f3e372fcec201085d7 /data_analysis
parentc05978bd1d5464988300eead1c1d8af98f732a42 (diff)
downloadpython-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.tar.gz
python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.tar.bz2
python-workshops-12e10ca44dc9e977553d0b9b394d41158857dc72.zip
Adding some initial content for data analysis.
Diffstat (limited to 'data_analysis')
-rw-r--r--data_analysis/01_intro.ipyml124
-rw-r--r--data_analysis/02_numpy_stats.ipyml441
-rw-r--r--data_analysis/README.md67
-rw-r--r--data_analysis/rise.css16
4 files changed, 648 insertions, 0 deletions
diff --git a/data_analysis/01_intro.ipyml b/data_analysis/01_intro.ipyml
new file mode 100644
index 0000000..39d258e
--- /dev/null
+++ b/data_analysis/01_intro.ipyml
@@ -0,0 +1,124 @@
+cells:
+
+- markdown: |
+ # Introduction to Data Analysis with Python
+
+ ### Prabhu Ramachandran
+ ### The FOSSEE Python group &
+ ### Department of Aerospace Engineering
+ ### IIT Bombay
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- markdown: |
+ ## Introduction
+
+ - A world of data!
+
+ - Can we use data to drive decisions and form opinions?
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+
+- markdown: |
+ ## Real data is not perfect
+
+ - Partial information
+ - Uncertainty
+ - Errors
+
+ <br/>
+ <br/>
+
+ - Important to check and clean data
+
+ metadata:
+ slideshow:
+ slide_type: subslide
+
+
+- markdown: |
+ ## Statistical approach
+
+ - Data collection
+
+ <br/>
+ <br/>
+
+ - Visualization
+ - Inference
+ - Modeling
+ - Prediction
+
+ metadata:
+ slideshow:
+ slide_type: subslide
+
+
+- markdown: |
+ ## Importance of computers
+
+ - Datasets are large
+ - Easy to process on the computer
+ - Simulation!
+
+ metadata:
+ slideshow:
+ slide_type: subslide
+
+- markdown: |
+ ## This course
+
+ - Use Python for data analysis
+ - Exposes you to the basic tools available
+ - Does not teach you statistics!
+ - Will point out resources for this
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+
+- markdown: |
+ ## Pre-requisites
+
+ - Basic Python programming
+ - NumPy
+ - Python 3.x, Jupyter, scipy, matplotlib, pandas, statsmodels
+
+ - Mathematics (12th grade)
+ - Introduction to statistics
+
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- markdown: |
+ ## Tools and Topics
+
+ - Simple statistics with `numpy`
+ - Statistical plots with `matplotlib`
+ - Random variables with `scipy.stats`
+ - Using `pandas` for data ingestion and analysis
+ - Introduction to `statsmodel` for regression
+
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- markdown: |
+ ## Summary
+
+ - Introduction to data analysis
+ - Pre-requisites for this course
+ - Tools covered
+
+ metadata:
+ slideshow:
+ slide_type: slide
diff --git a/data_analysis/02_numpy_stats.ipyml b/data_analysis/02_numpy_stats.ipyml
new file mode 100644
index 0000000..a412271
--- /dev/null
+++ b/data_analysis/02_numpy_stats.ipyml
@@ -0,0 +1,441 @@
+cells:
+
+- code: |
+ %matplotlib inline
+
+ metadata:
+ slideshow:
+ slide_type: skip
+
+- markdown: |
+ # Data Analysis: Simple Statistics with `numpy`
+
+ ### Prabhu Ramachandran
+ ### The FOSSEE Python group &
+ ### Department of Aerospace Engineering
+ ### IIT Bombay
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- markdown: |
+ ## Introduction
+
+ - Already exposed to `numpy`
+ - Provides convenient statistical functions
+
+ <br/>
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- markdown: |
+ - `np.mean`, `np.std`, etc.
+ - `np.random.random` etc.
+ - `np.percentile`
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- markdown: |
+ ## Simple NumPy functions
+
+ - `mean`, `std`, `var`, `median`
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ import numpy as np
+ data = [1.0, 4.5, 2.3, -0.5, 0.5, 2.8]
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- code: |
+ np.mean(data)
+
+- code: |
+ np.median(data)
+
+- code: |
+ np.std(data)
+
+- code: |
+ np.var(data)
+
+- markdown: |
+ ## Degrees of freedom?
+
+ - Sample standard deviation: $S^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$
+
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- markdown: |
+ - Use the `ddof` keyword argument (defaults to zero)
+ - Denominator is `n - ddof`
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- code: |
+ # ddof defaults to zero
+ np.std(data, ddof=1)
+
+
+- markdown: |
+ ## Multi-dimensional data
+
+ - What if you have a multi-dimensional array?
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ md = np.arange(16)
+ md.shape = 4, 4
+
+- code: |
+ md
+
+- code: |
+ np.mean(md)
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ np.std(md)
+
+- code: |
+ np.mean(md, axis=0)
+
+- code: |
+ np.mean(md, axis=1)
+
+- markdown: |
+ ## Not-a-Number: `NaN`
+
+ - Part of the number system: `np.nan`
+ - Like `inf`: `np.inf`
+ - `nan`: Used to denote missing values in data
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ np.nan + 1
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- code: |
+ data = [1.0, 2.1, np.nan, 3.0]
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- code: |
+ np.mean(data), np.std(data)
+
+- markdown: |
+ ## Dealing with Nans?
+
+ - Use `np.nanmean, np.nanmedian, np.nanstd` etc.
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ np.nanmean(data)
+
+- code: |
+ np.nanstd(data)
+
+- markdown: |
+ - Do `np.nan<TAB>` to see more
+
+
+- markdown: |
+ ## Pseudo Random Numbers
+
+ - `np.random.random` etc.
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ data = np.random.random(5)
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- code: |
+ x = np.random.random((3, 3))
+ x.shape
+
+- code: |
+ # randint(low, high, size)
+ np.random.randint(-5, 10, size=5)
+
+- code: |
+ # loc: mean, scale: std-dev
+ np.random.normal(loc=0.0, scale=1.0, size=5)
+
+- markdown: |
+ - `size` keyword argument to specify shape
+
+- markdown: |
+ ## Other distributions
+
+ - Many univariate distributions
+ - A few multi-variate distributions
+ - Draw samples from these distributions
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ np.random?
+
+- markdown: |
+ ## Some exploration of the random variables
+
+ - Let us plot a few of these distributions
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ data = np.random.normal(size=1000)
+
+- code: |
+ from matplotlib import pyplot as plt
+ plt.hist(data)
+
+
+- code: |
+ data = np.random.normal(size=20)
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ plt.hist(data)
+
+- code: |
+ plt.hist(data, bins=6)
+
+
+- code: |
+ data = np.random.normal(size=10000)
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ plt.hist(data, normed=True)
+
+- code: |
+ plt.hist(data, cumulative=True)
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ data = np.random.poisson(lam=0.5, size=10000)
+ ax1 = plt.subplot(1, 2, 1)
+ ax1.hist(data, normed=True)
+ ax2 = plt.subplot(1, 2, 2)
+ ax2.hist(data, cumulative=True)
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- markdown: |
+ ## Subplots
+
+ - `plt.subplot(nrows, ncols, plot_number)`
+ - `plot_number` starts from 1
+ - Axes returned can be used like `plt`
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ for i in range(1, 5):
+ ax = plt.subplot(2, 2, i)
+ ax.text(0.0, 0.5, 'plot number %d' % i)
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ data = np.random.chisquare(7, size=10000)
+ ax1 = plt.subplot(1, 2, 1)
+ ax1.hist(data, normed=True)
+ ax2 = plt.subplot(1, 2, 2)
+ ax2.hist(data, cumulative=True)
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+
+- markdown: |
+ ## Repeatable random numbers
+
+ - `np.random.xxx` gives different results each time
+
+ - Use `np.random.seed` to make this deterministic
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ np.random.seed(27)
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- code: |
+ np.random.random()
+
+- code: |
+ np.random.seed(27)
+ np.random.random()
+
+
+- markdown: |
+ ## Computing percentiles
+
+ - Use `np.percentile`
+ - Or `np.nanpercentile`
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ data = np.random.normal(loc=10, scale=2, size=1000)
+ np.percentile(data, 50)
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- code: |
+ np.median(data)
+
+- code: |
+ np.percentile(data, [25, 50, 75])
+
+
+- markdown: |
+ ## Some useful tools
+
+ - For computational work
+ - `np.shuffle`
+ - `np.permutation`
+ - `np.choice`
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+
+- code: |
+ data = np.random.randint(0, 100, size=5)
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- code: |
+ np.shuffle(data)
+
+- code: |
+ np.permutation(10)
+
+- code: |
+ np.permutation(data)
+
+
+- code: |
+ data = np.random.permutation(5)
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ np.choice(data)
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- code: |
+ np.choice(data, size=5)
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- code: |
+ np.choice(data, size=10)
+
+ metadata:
+ slideshow:
+ slide_type: fragment
+
+- code: |
+ np.choice(data, size=4, replace=False)
+
+ metadata:
+ slideshow:
+ slide_type: slide
+
+- code: |
+ # Won't work!
+ np.choice(data, size=10, replace=False)
+
+
+- markdown: |
+ ## Summary
+
+ - Basic `numpy` stats functions
+ - Random number generators
+ - Plotting histograms and subplots
+ - Odds and ends
+
+ metadata:
+ slideshow:
+ slide_type: slide
diff --git a/data_analysis/README.md b/data_analysis/README.md
new file mode 100644
index 0000000..a00b95d
--- /dev/null
+++ b/data_analysis/README.md
@@ -0,0 +1,67 @@
+# Introduction to data analysis with Python
+
+This material covers a short course on using Python for data analysis.
+
+The material assumes that the student is aware of basic mathematics and
+statistics. While doing statistical analysis it always helps to know
+statistics fairly well. We will attempt to provide some links to freely
+available material that covers some of these basics.
+
+An excellent book on doing statistical analysis with Python is Allen Downey's
+Think Stats book which is freely available. The material is not a traditional
+approach to statistics but will get you thinking for sure.
+
+The emphasis of this course is to expose the student to the various libraries
+and tools available in Python so they can embark on their own data analysis.
+There is a lot of material already available. We will attempt to provide the
+attendees links to some useful material.
+
+## Pre-requisites
+
+- Students should have completed the basic Python programming material.
+- One should have a Python 3.x installation with the following packages:
+ - IPython, scipy, matplotlib
+ - pandas, statsmodels
+- Use a reasonable editor, Canopy will work.
+- If one desires a more advanced editor, I suggest VS Code
+ (https://code.visualstudio.com/) which is free, open source, and very
+ powerful.
+- Knowledge of basic statistics.
+
+## Contents
+
+* Introduction
+
+* Simple statistics with `numpy`
+ * Basic stats functions, mean, std etc.
+ * Percentiles
+ * Random numbers: normal, random, choice, shuffle
+
+* Statistical plots
+ * hist
+ * boxplot
+ * scatter
+ * pie chart
+
+* Using `scipy.stats`
+ * pdf
+ * cdf
+ * rvs
+
+* Using `pandas`
+ * Quick introduction
+ * Categorical vs numerical data
+ * Data frames
+ * Basic operations
+ * String operations
+ * simple plots
+ * Groupby
+ * Pivot
+ * Maps
+ * pdvega
+
+* Using `statsmodel`
+ * regression
+ * anova
+
+* Seaborn?
diff --git a/data_analysis/rise.css b/data_analysis/rise.css
new file mode 100644
index 0000000..0846e50
--- /dev/null
+++ b/data_analysis/rise.css
@@ -0,0 +1,16 @@
+/* Customizations for RISE slides
+ */
+
+/* Increase font size for the code cells.
+*/
+
+div.cell.code_cell {
+ font-size: 150%;
+}
+
+/* Tables were rendered out as tiny values
+ since the font-size was set to 12px somehow.
+*/
+.rendered_html table {
+ font-size: 100%
+}