Skip to content

Lab problems Spring 2021

Violet Yao edited this page Oct 21, 2021 · 37 revisions

Fix the titles and numbering for all the labs

Lab 2: note: error in lab 2--wrong variable names for ANES 2016

** Exercise 2** To make sure you understand the loc, iloc, and drop functions, try selecting the columns V4002 to V4008 with only the first 3 rows:

[in progress] Lab 4: Fix lab 4 on probability and bootstrapping: it uses ANES 2016 Pilot Study and not ANES 2016 and so must be rewritten to use ANES 2016

Lab 5: we have to fix Lab 5-—not only was the jury selection problem a bit problematic (and Wilson did fix it but in a somewhat difficult way; I guess he followed Adhikari and DeNero?), but the t-test part was wrong. I made a solution for it on the Datahub in my own directory and pushed it to this repo (Large n Solutions_jon_, and there is additional explanation from Wilson in the Sp20 directory (Large_n_jon is the filename)

Lab 7: Mapbox Bright tiles no longer available except by gaining access to API, so use "Open Street Map" for tiles instead, probably. James Weichert from class suggested as follows.

Most built-in tiles (map styles) for folium have been deprecated, so if you want more fancy styles for your map, you'll have to import the styles yourself (it's not too much work).

A list of custom styles can be found here: http://leaflet-extras.github.io/leaflet-providers/preview/

To import the tile you want, copy the URL generated for the tile you select on the above website (it starts with 'https' and ends with '.png'). Paste this as a string in the "tiles" attribute for the folium map on Python. However, folium will also want attribution for the custom styling, so you need to pass any string into the "attr" attribute (this can literally be any string).

The full code for generating a map with custom styles is:

folium.Map([x_coord, y_coord], tiles="https...png", attr="Insert attribution here")


Lab 11 Math in SciPy
lab solutions
syntax error in "integrate a normal distribution
count, bins, ignored = plt.hist(s, 30, normed=True)
should be
count, bins, ignored = plt.hist(s, 30, density=True)
math error right at the end in
Using the cdf, integrate the normal distribution from -0.5 to 0.5
norm.cdf(0.5)
should read
norm.cdf(0.5)-norm.cdf(-0.5)

Lab 22 Word Embedding Models
change in gensim requires change in syntax
instead of model.similarity('president','leadership') use model.wv.similarity('korea','nuclear')
instead of model.doesnt_match(['president', 'violent', 'leadership']) use model.wv.doesnt_match(['kenyatta', 'kimoon', 'kofi'])
and so on through, including changes to model parameter syntax--Ilya is going to fix this in Sp2021, I think
Lab 13 Model Selection
fix the lab title to match the lab number (do this for all the labs!)
fix the labels for the scatterplots to match the model that is being graphed (now they all say they're OLS)
no reason to normalize the OLS regression model constructor function; do we need to do standard_scaler for it? does not appear so, and yet Scikit Learn says that Ridge and Lasso both assume standardized data
it looks like the calculations in the lab solutions use standardized data without using standard_scaler
in my own lab notebook, where I use standard_scaler, I get the same answers as in the solution notebook for the standardized features
but different answers for the unstandardized features
so, somehow, by default Scikit Learn is using the standardized features in the solution notebook
but I don't know how this could be

Lab 12 Regression similar to lab 13, OLS input is normalized but inputs for Ridge and Lasso are not. add standard scaler for OLS, Ridge, and LASSO to be consistent with lab 13.

Lab 15 Text Preprocessing
the student version of the lab displays the parsed sentence tree using pretty but it is very hard to read
the solution version tries to use draw but it cannot render the tree without being told where to display it ($DISPLAY environment variable)
it may be better to see if we can get the draw method for nltk to work or just leave that cell out of both student and solution versions

Lab 16: Bag of Words
fix the syntax for the stopwords when using the CountVectorizer constructor function
cv = CountVectorizer(stop_words = 'english')
dtm = cv.fit_transform(tokens_list)
I think that Scikit-learn just changed the syntax

Lab 19 TF-IDF
the labels for the confusion matrices for the SVM classifier are wrong (they just copy and paste "multinomial logisitic regression")

Lab 21 Neural Networks
fitting the model takes more than 1GB of memory so given the current quota on Datahub the kernel dies every time
we need to either get more memory or have a preface to the lab that tells students to zip and download the lab folder and
run the lab locally! added colab version which allows students 1) use GPU 2) avoid local tensorflow dependency issues
Lab 25 Ensemble Methods
we search for the best set of parameters for the random forest model but then we don't use them in the model
also, there should be an explanation that 'n_estimators' is the number of trees in the forest


Ilya thinks we should have a BERT lab in the text part of the class, which is true given the currency of BERT
https://github.com/VincentK1991/BERT_summarization_1

Clone this wiki locally