From 32f3f04f6605e7d8c0b292d132cc5509c1ee452b Mon Sep 17 00:00:00 2001
From: Suraj Rampure The DataFrame The other columns are as follows: What type of test is being proposed above? ( ) Hypothesis test ( ) Permutation test Hypothesis test Permutation test We decide to help Nicole extract citation numbers from papers.
Consider the following four extracted lists. For each expression below, select the list it evaluates to, or select
“None of the above.” Answer: list3 This regex pattern Answer: list5 his pattern Answer: list2 This pattern is specifically designed to match digits that are
+enclosed in square brackets. The Answer: list4 Similar to the previous explanation but with a key difference: the
+entire pattern of digits within square brackets is captured, including
+the brackets themselves. The pattern Winter 2023 Final Exam
sat
contains one row for
most combinations of "Year"
and
-"State"
, where "Year"
ranges between
+"State"
, where "Year"
ranges between
2005
and 2015
and "State"
is one
of the 50 states (not including the District of Columbia).
Problem 2.4
+
@@ -888,12 +891,12 @@
Problem 4
in the string s
below.
= '''
- s In DSC 10 [3], you learned about babypandas, a strict subset
-of pandas [15][4]. It was designed [5] to provide programming
-beginners [3][91] just enough syntax to be able to perform
-meaningful tabular data analysis [8] without getting lost in
+In DSC 10 [3], you learned about babypandas, a strict subset
+of pandas [15][4]. It was designed [5] to provide programming
+beginners [3][91] just enough syntax to be able to perform
+meaningful tabular data analysis [8] without getting lost in
100s of details.
-'''
Problem 4
= ['3', '15', '4', '5', '3', '91', '8']
list2 = ['10', '3', '15', '4', '5', '3', '91', '8', '100']
list3 = ['[3]', '[15]', '[4]', '[5]', '[3]', '[91]', '[8]']
- list4 = ['1', '0', '3', '1', '5', '4', '5', '3',
+ list5 = ['1', '0', '3', '1', '5', '4', '5', '3',
list5 '9', '1', '8', '1', '0', '0']
\d+
matches one or more digits
+anywhere in the string. It doesn’t concern itself with the context of
+the digits, whether they are inside brackets or not. As a result, it
+extracts all sequences of digits in s, including ‘10’, ‘3’, ‘15’, ‘4’,
+‘5’, ‘3’, ‘91’, ‘8’, and ‘100’, which together form list3. This is
+because greedily matches all contiguous digits, capturing both the
+citation numbers and any other numbers present in the text.
[\d+]
is slightly misleading because the
+square brackets are used to define a character class, and the plus sign
+inside is treated as a literal character, not as a quantifier. However,
+since there are no plus signs in s, this detail does not affect the
+outcome. The character class atches any digit, so this pattern
+effectively matches individual digits throughout the string, resulting
+in list5. This list contains every single digit found in s, separated as
+individual string elements.
\[(\d+)\]
pattern looks
+for a sequence of one or more digits \d+
inside square
+brackets []
. The parentheses capture the digits as a group,
+excluding the brackets from the result. Therefore, it extracts just the
+citation numbers as they appear in s, matching list2 exactly. This
+method is precise for extracting citation numbers from a text formatted
+in the verbose numeric style.
\[\d+\]
specifically
+searches for sequences of digits surrounded by square brackets, and the
+parentheses around the entire pattern ensure that the match includes the
+brackets. This results in list4, which contains all the citation markers
+found in s, preserving the brackets to clearly denote them as
+citations.Problem 5
parse it with BeautifulSoup.
Suppose soup
is a BeautifulSoup object instantiated
using the following HTML document.
<college>Your score is ready!</college>
-
-<sat verbal="ready" math="ready">
-
- Your percentiles are as follows:<scorelist listtype="percentiles">
- <scorerow kind="verbal" subkind="per">
- <scorenum>84</scorenum>
- Verbal: </scorerow>
- <scorerow kind="math" subkind="per">
- <scorenum>99</scorenum>
- Math: </scorerow>
- </scorelist>
-
- And your actual scores are as follows:<scorelist listtype="scores">
- <scorerow kind="verbal">
- <scorenum>680</scorenum>
- Verbal: </scorerow>
- <scorerow kind="math">
- <scorenum>800</scorenum>
- Math: </scorerow>
- </scorelist>
- </sat>
<college>Your score is ready!</college>
+
+<sat verbal="ready" math="ready">
+ Your percentiles are as follows:
+ <scorelist listtype="percentiles">
+ <scorerow kind="verbal" subkind="per">
+ Verbal: <scorenum>84</scorenum>
+ </scorerow>
+ <scorerow kind="math" subkind="per">
+ Math: <scorenum>99</scorenum>
+ </scorerow>
+ </scorelist>
+ And your actual scores are as follows:
+ <scorelist listtype="scores">
+ <scorerow kind="verbal"> Verbal: <scorenum>680</scorenum> </scorerow>
+ <scorerow kind="math"> Math: <scorenum>800</scorenum> </scorerow>
+ </scorelist>
+</sat>
Which of the following expressions evaluate to `“verbal”}? Select all -that apply.
+Which of the following expressions evaluate to "verbal"
?
+Select all that apply.
soup.find("scorerow").get("kind")
soup.find("sat").get("ready")
Answer: Option 1, Option 3, Option 4
+Correct options:
+<scorerow>
element and
+retrieves its "kind"
attribute, which is
+"verbal"
for the first <scorerow>
+encountered in the HTML document.<scorerow>
tag,
+retrieves its text ("Verbal: 84")
, splits this text by “:”,
+and takes the first element of the resulting list
+("Verbal")
, converting it to lowercase to match
+"verbal"
"kind"
attributes for all
+<scorerow>
elements. The second to last (-2) element
+in this list corresponds to the "kind"
attribute of the
+first <scorerow>
in the second
+<scorelist>
tag, which is also
+"verbal"
Incorrect options:
+<sat>
tag, which does not exist as an attribute."kind"
attribute from a
+<scorelist>
tag, but <scorelist>
+does not have a "kind"
attribute.(6 pts) Consider the following function.
+Consider the following function.
def summer(tree):
if isinstance(tree, list):
@@ -1138,6 +1193,20 @@
Answer: a: "scorelist"
, b:
"scorelist", attrs={"listtype":"scores"}
, c:
"scorerow", attrs={"kind":"math"}
+soup.find("scorelist")
selects the first
+<scorelist>
tag, which includes both verbal and math
+percentiles (84 and 99)
. The function
+summer(tree)
sums these values to get 183
.
+This selects the <scorelist>
tag with
+listtype="scores"
, which contains the actual scores of
+verbal (680)
and math (800)
. The function sums
+these to get 1480
.
+This selects all <scorerow>
elements with
+kind="math"
, capturing both the percentile
+(99)
and the actual score (800)
. Since tree is
+now a list, summer(tree)
iterates through each
+<scorerow>
in the list, summing their
+<scorenum>
values to reach 899
.
Consider the following list of tokens.
-“py tokens = ["is", "the", "college", "board", "the", "board", "of", "college"] "
= ["is", "the", "college", "board", "the", "board", "of", "college"] tokens
Recall, a uniform language model is one in which each
unique token has the same chance of being sampled.
Suppose we instantiate a uniform language model on tokens
.
-The probability of the sentence “““the college board is” — that is,
-P(\text{the college board is}) — is of
-the form \frac{1}{a^b}, where P(\text{the college board is}) — is of the
+form \frac{1}{a^b}, where a and b are
both positive integers.
What are a and
Answer: a = 5, b = 4 In a uniform language model, each unique token has the same chance of
+being sampled. Given the list of tokens, there are 5 unique tokens:
+[“is”, “the”, “college”, “board”, “of”]. The probability of sampling any
+one token is \frac{1}{5}. For a
+sentence of 4 tokens (“the college board is”), the probability is \frac{1}{5^4} because each token is
+independently sampled. Thus, a = 5 and
+b = 4. Answer: (c, d) = (2, 9) or (8, 3) In a unigram language model, the probability of sampling a token is
+proportional to its frequency in the token list. The frequencies are:
+“is” = 1, “the” = 3, “college” = 2, “board” = 2, “of” = 1. The sentence
+“the college board is” has probabilities \frac{3}{8}, \frac{2}{8}, \frac{2}{8}, \frac{1}{8} for each word respectively, when
+considering the total number of tokens (8). The combined probability is
+\frac{3}{8} \cdot \frac{2}{8} \cdot
+\frac{2}{8} \cdot \frac{1}{8} = \frac{6}{512} = \frac{1}{2^9} or,
+simplifying, \frac{1}{8^3} since 512 = 8^3. Therefore, c = 2 and d =
+9 or c = 8 and d = 3, depending on how you represent the
+fraction. Answer: Sentence 4 A bigram model looks at the probability of a word given the previous
+word. Sentence 4, “the college board of college”, likely has higher
+probabilities for its bigrams (“the college”, “college board”, “board
+of”, “of college”) based on the original list of tokens, which contains
+all these pairs. This reasoning assumes that the given pairs appear more
+frequently or are more probable in sequence than the pairs in other
+sentences. Answer: Yes In the context of TF-IDF, if a word appears in every sentence, its
+inverse document frequency (IDF) part would be (() = 0), making the
+TF-IDF score 0 for that word across all documents. Since “the” appears
+in all five sentences, its IDF is zero, leading to a column of zeros in
+the TF-IDF matrix for “the”. Answer: Sentence 4 The word “college” likely has the highest TF-IDF in Sentence 4
+because it appears less frequently across all sentences and is
+relatively more important (i.e., has a higher term frequency) in
+Sentence 4 than in other sentences where it appears. TF-IDF rewards
+words that are unique to a document but penalizes those that are common
+across all documents. Answer: the smallest The DF-ITF score is lower for terms that are more unique (appear in
+fewer documents) and have a higher count in the document they appear in.
+A smaller DF-ITF indicates that a term is both important within a
+specific document and distinctive across the corpus. Therefore, the term
+with the smallest DF-ITF in a document is considered the best summary
+for that document, as it balances document-specific significance with
+corpus-wide uniqueness.
Problem 7.1
above statement not guaranteed to be true?
Note: Treat as our training set.
Option 1:
-= (sat['Math'] > sat['Verbal']).mean()
- a = 0.5 b
Option 2:
= (sat['Math'] - sat['Verbal']).mean()
- a = 0 b
Option 3:
+class="sourceCode py">a = (sat['Math'] > sat['Verbal']).mean()
+b = 0.5
+Option 2:
= (sat['Math'] - sat['Verbal'] > 0).mean()
- a = 0.5 b
Option 4:
+class="sourceCode py">a = (sat['Math'] - sat['Verbal']).mean()
+b = 0
+Option 3:
= ((sat['Math'] / sat['Verbal']) > 1).mean() - 0.5
- a = 0 b
a = (sat['Math'] - sat['Verbal'] > 0).mean()
+b = 0.5
+Option 4:
+= ((sat['Math'] / sat['Verbal']) > 1).mean() - 0.5
+ a = 0 b
Option 1
Option 2
"medium"
, or "high"
. Since we can’t use
strings as features in a model, we decide to encode these strings using
the following Pipeline
:
-# Note: The FunctionTransformer is only needed to change the result
-# of the OneHotEncoder from a "sparse" matrix to a regular matrix
-# so that it can be used with StandardScaler;
-# it doesn't change anything mathematically.
-= Pipeline([
- pl "ohe", OneHotEncoder(drop="first")),
- ("ft", FunctionTransformer(lambda X: X.toarray())),
- ("ss", StandardScaler())
- ( ])
# Note: The FunctionTransformer is only needed to change the result
+# of the OneHotEncoder from a "sparse" matrix to a regular matrix
+# so that it can be used with StandardScaler;
+# it doesn't change anything mathematically.
+= Pipeline([
+ pl "ohe", OneHotEncoder(drop="first")),
+ ("ft", FunctionTransformer(lambda X: X.toarray())),
+ ("ss", StandardScaler())
+ ( ])
After calling pl.fit(lunch_props)
,
pl.transform(lunch_props)
evaluates to the following
array:
1.29099445, -0.37796447],
- array([[ -0.77459667, -0.37796447],
- [-0.77459667, -0.37796447],
- [-0.77459667, 2.64575131],
- [1.29099445, -0.37796447],
- [ 1.29099445, -0.37796447],
- [ -0.77459667, -0.37796447],
- [-0.77459667, -0.37796447]]) [
1.29099445, -0.37796447],
+ array([[ -0.77459667, -0.37796447],
+ [-0.77459667, -0.37796447],
+ [-0.77459667, 2.64575131],
+ [1.29099445, -0.37796447],
+ [ 1.29099445, -0.37796447],
+ [ -0.77459667, -0.37796447],
+ [-0.77459667, -0.37796447]]) [
and pl.named_steps["ohe"].get_feature_names()
evaluates
to the following array:
"x0_low", "x0_med"], dtype=object) array([
"x0_low", "x0_med"], dtype=object) array([
Fill in the blanks: Given the above information, we can conclude that
lunch_props
has (a) value(s) equal to
"low"
, (b) value(s) equal to
diff --git a/problems/wi23-final/wi23-final-data-info.md b/problems/wi23-final/wi23-final-data-info.md
index 70e7c97..bd0fa13 100644
--- a/problems/wi23-final/wi23-final-data-info.md
+++ b/problems/wi23-final/wi23-final-data-info.md
@@ -1,4 +1,4 @@
-The DataFrame `sat` contains one row for **most** combinations of `"Year"` and `"State"`, where `"Year"`ranges between `2005` and `2015` and `"State"` is one of the 50 states (not including the District of Columbia).
+The DataFrame `sat` contains one row for **most** combinations of `"Year"` and `"State"`, where `"Year"` ranges between `2005` and `2015` and `"State"` is one of the 50 states (not including the District of Columbia).
The other columns are as follows:
diff --git a/problems/wi23-final/wi23-final-q02.md b/problems/wi23-final/wi23-final-q02.md
index 00d0058..74b6743 100644
--- a/problems/wi23-final/wi23-final-q02.md
+++ b/problems/wi23-final/wi23-final-q02.md
@@ -91,8 +91,9 @@ The DataFrame `scores_2015`, shown in its entirety below, contains the verbal se