## Getting familiar with Pandas

In [None]:
## Pandas 
# there are several ways to change a column in a dataframe
# A short intro to pandas https://pandas.pydata.org/pandas-docs/stable/10min.html

import pandas as pd
import numpy as np
import random

In [None]:
# First step: make the dataframe
dates = pd.date_range('20130101', '20140101') #366
data = pd.DataFrame(np.random.randn(366,4), index=dates, columns=list('ABCD'))

### Exercise 1.1: Inspect the dataframe with the following commands: head(), tail(), describe.

In [None]:
# Solution


### Exercise 1.2:  The index is a time series, and pandas has a build-in command for re-sampling dataframes (documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html).  Use resample to get the median every 2 days and save this as a new dataframe.

In [None]:
#Solution: 


### Exercise 1.3: Inspect the new dataframe to see the difference in size compared to the inital dataframe.

In [None]:
#Solution: 


### Exercise 1.4:  Write your new dataframe to a csv file.

In [None]:
#Solution: 


### Exercise 1.5: Merge the two dataframes. There are several ways to do this, see also https://pandas.pydata.org/pandas-docs/stable/merging.html.

In [None]:
# Solution


### Exercise 1.6: There are several ways to perform actions on the dataframe columns. The dataframe has several columns containing negative values. For this exercise, find these negative values on a column, and create a new column with their absolute value, using a list comprehension, and after this, using a lambda function. You can use the magic timeit to see if there is a difference between these operations.

In [None]:
# Solution
# method 1: list comprehension


In [None]:
# Solution
# method 2: lambda function


## Supervised learning using scikit-learn - Classification of MNIST data

### Exercise 2.1: Download the digit ('MNIST original') dataset from  mldata.org, which is a public repository for machine learning data. Divide the data into training and testing. Please use 1/7 for training and the rest for testing. 

Hint: The sklearn.datasets package is able to directly download data sets from the repository using the function sklearn.datasets.fetch_mldata. Generate the training and testing set by importing train_test_split from sklearn.model_selection

 

In [None]:
# Solution

import sklearn 
from sklearn.datasets import fetch_mldata

# Download the MNIST original dataset

from sklearn.model_selection import train_test_split

# Split the images into training and testing


### Exercise 2.2: The optimal performance of many machine learning algorithms is affected by scale. Typically, you need to scale the features in your data before applying any algorithm. Normalize the data and plot some random images from the dataset.  

Hint: Use StandardScaler from sklearn.preprocessing to help you standardize the dataset’s features onto unit scale (mean = 0 and variance = 1)



In [None]:
# Solution

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit on training set only

# Apply transform to both the training set and the test set


In [None]:
# Solution (Visualization)
%matplotlib inline

import matplotlib.pyplot as plt


### Exercise 2.3: Logistic regression is one of the simplest linear classification algorithms. Fit a logistic regression model to the training images. Compute the accuracy of the classifier on the test images, and the time needed to train the model.¶

Hint: Use LogisticRegression from sklearn.linear_model. To increase speed, change the default solver to 'lbfgs'


In [None]:
# Solution

from time import time
from sklearn.linear_model import LogisticRegression

tic = time()
# Fit a linear regression model

# Compute the classification score

toc = time()
print('The total time is %s seconds ' % (toc-tic))


### Exercise 2.4: Apply Principle Component Analysis (PCA) to the training signals by keeping only (a) 25%, (b) 75%, and (c) 95% of the energy. For each of the three cases, output the number of the required principle components.Then, plot the Cumulative Explained Variance over PCA. Finally, choose a random image from the dataset, and show its approximation with the PCA components. 

Hint: For computing the Cumulative Explained Variance over PCA use:
```
pca.explained_variance_ratio_.cumsum()

```

In [None]:
# Solution 
from sklearn.decomposition import PCA

# Fit a PCA model

# Compute the number of PCA components


In [None]:
# Plot the Cumulative Explained Variance over PCA


In [None]:
# Choose a random image from the dataset, and show its approximation with the PCA components

plt.figure(figsize=(8,4));

# Original Image
plt.subplot(1, 2, 1);

# Approximation
plt.subplot(1, 2, 2);


### Exercise 2.5: Fit a logistic regression model to the approximation of the training images with 95% of explained variance. Compute the accuracy of the classifier and the time needed to train the model. Compare it to the one obtained in 2.3. What do you observe? 


In [None]:
# Solution

tic = time()

# Fit a logistic regression model on the PCA coefficients

toc = time()

print('The total time is %s seconds' % (toc-tic))


## Unsupervised learning with sklearn.cluster.KMeans()

###  Exercise 3.1: Generate a set of 6 isotropic Gaussian blobs, with 1000 samples each. Each sample should have 60 features. 

Hint: Use the sklearn.datasets.make_blobs to generate the data

In [None]:
# Solution

from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs

# Generate the data


###  Exercise 3.2: Apply PCA to the generated data. Store the first two principle components and their cluster index to a new dataframe.  Visualize the 6 blobs based only on these two components. 

In [None]:
# Solution

# Fit PCA to the data

# Generate a new dataframe and store the first two Principle Components and the true cluster index

# Vizualize the data by plotting their representation on the two Principle Components (x and y axis)


### Exercise 3.3: Set the number of clusters to 6 and apply Kmeans clustering to the data. Compute the accuracy score between the true labels and the ones estimated by the Kmeans algorithm. 

In [None]:
# Solution

from sklearn.cluster import KMeans

# Fit a Kmean model to the data

from scipy.stats import mode

# Uncomment this part to compute the accuracy score
#  y_true: the true cluster index
#  y_kmeans: the cluster index assigned by Kmeans

"""
labels = np.zeros_like(y_true)
for i in range(6):
    mask = (y_kmeans == i)
    labels[mask] = mode(y_true[mask])[0]
    
from sklearn.metrics import accuracy_score
accuracy_score(y_true, labels)
"""


### Exercise 3.4: Do the same by clustering the data using only the first 2 principle components. What do you observe? 

In [None]:
# Solution

# Fit a Kmeans model to the first 2 PCA coefficients of the data

# Uncomment this part to compute the accuracy score
# y_true: the true cluster index
# y_kmeans: the cluster index assigned by Kmeans

"""
labels = np.zeros_like(y_true)
for i in range(6):
    mask = (y_kmeans == i)
    labels[mask] = mode(y_true[mask])[0]
    
accuracy_score(y_true, labels)
"""
