diff --git a/docs/wi23-final/index.html b/docs/wi23-final/index.html index 4a6e06e..5e15b48 100644 --- a/docs/wi23-final/index.html +++ b/docs/wi23-final/index.html @@ -139,7 +139,7 @@
The DataFrame sat
contains one row for
most combinations of "Year"
and
-"State"
, where "Year"
ranges between
+"State"
, where "Year"
ranges between
2005
and 2015
and "State"
is one
of the 50 states (not including the District of Columbia).
The other columns are as follows:
@@ -703,7 +703,10 @@What type of test is being proposed above?
-( ) Hypothesis test ( ) Permutation test
+Hypothesis test
Permutation test
s
below.
= '''
- s In DSC 10 [3], you learned about babypandas, a strict subset
-of pandas [15][4]. It was designed [5] to provide programming
-beginners [3][91] just enough syntax to be able to perform
-meaningful tabular data analysis [8] without getting lost in
+In DSC 10 [3], you learned about babypandas, a strict subset
+of pandas [15][4]. It was designed [5] to provide programming
+beginners [3][91] just enough syntax to be able to perform
+meaningful tabular data analysis [8] without getting lost in
100s of details.
-'''
We decide to help Nicole extract citation numbers from papers. Consider the following four extracted lists.
Problem 4 = ['3', '15', '4', '5', '3', '91', '8'] list2 = ['10', '3', '15', '4', '5', '3', '91', '8', '100'] list3 = ['[3]', '[15]', '[4]', '[5]', '[3]', '[91]', '[8]'] - list4 = ['1', '0', '3', '1', '5', '4', '5', '3', + list5 = ['1', '0', '3', '1', '5', '4', '5', '3', list5 '9', '1', '8', '1', '0', '0']
For each expression below, select the list it evaluates to, or select “None of the above.”
@@ -930,6 +933,13 @@Answer: list3
+This regex pattern \d+
matches one or more digits
+anywhere in the string. It doesn’t concern itself with the context of
+the digits, whether they are inside brackets or not. As a result, it
+extracts all sequences of digits in s, including ‘10’, ‘3’, ‘15’, ‘4’,
+‘5’, ‘3’, ‘91’, ‘8’, and ‘100’, which together form list3. This is
+because greedily matches all contiguous digits, capturing both the
+citation numbers and any other numbers present in the text.
Answer: list5
+his pattern [\d+]
is slightly misleading because the
+square brackets are used to define a character class, and the plus sign
+inside is treated as a literal character, not as a quantifier. However,
+since there are no plus signs in s, this detail does not affect the
+outcome. The character class atches any digit, so this pattern
+effectively matches individual digits throughout the string, resulting
+in list5. This list contains every single digit found in s, separated as
+individual string elements.
Answer: list2
+This pattern is specifically designed to match digits that are
+enclosed in square brackets. The \[(\d+)\]
pattern looks
+for a sequence of one or more digits \d+
inside square
+brackets []
. The parentheses capture the digits as a group,
+excluding the brackets from the result. Therefore, it extracts just the
+citation numbers as they appear in s, matching list2 exactly. This
+method is precise for extracting citation numbers from a text formatted
+in the verbose numeric style.
Answer: list4
+Similar to the previous explanation but with a key difference: the
+entire pattern of digits within square brackets is captured, including
+the brackets themselves. The pattern \[\d+\]
specifically
+searches for sequences of digits surrounded by square brackets, and the
+parentheses around the entire pattern ensure that the match includes the
+brackets. This results in list4, which contains all the citation markers
+found in s, preserving the brackets to clearly denote them as
+citations.
Suppose soup
is a BeautifulSoup object instantiated
using the following HTML document.
<college>Your score is ready!</college>
-
-<sat verbal="ready" math="ready">
-
- Your percentiles are as follows:<scorelist listtype="percentiles">
- <scorerow kind="verbal" subkind="per">
- <scorenum>84</scorenum>
- Verbal: </scorerow>
- <scorerow kind="math" subkind="per">
- <scorenum>99</scorenum>
- Math: </scorerow>
- </scorelist>
-
- And your actual scores are as follows:<scorelist listtype="scores">
- <scorerow kind="verbal">
- <scorenum>680</scorenum>
- Verbal: </scorerow>
- <scorerow kind="math">
- <scorenum>800</scorenum>
- Math: </scorerow>
- </scorelist>
- </sat>
<college>Your score is ready!</college>
+
+<sat verbal="ready" math="ready">
+ Your percentiles are as follows:
+ <scorelist listtype="percentiles">
+ <scorerow kind="verbal" subkind="per">
+ Verbal: <scorenum>84</scorenum>
+ </scorerow>
+ <scorerow kind="math" subkind="per">
+ Math: <scorenum>99</scorenum>
+ </scorerow>
+ </scorelist>
+ And your actual scores are as follows:
+ <scorelist listtype="scores">
+ <scorerow kind="verbal"> Verbal: <scorenum>680</scorenum> </scorerow>
+ <scorerow kind="math"> Math: <scorenum>800</scorenum> </scorerow>
+ </scorelist>
+</sat>
Which of the following expressions evaluate to `“verbal”}? Select all -that apply.
+Which of the following expressions evaluate to "verbal"
?
+Select all that apply.
soup.find("scorerow").get("kind")
soup.find("sat").get("ready")
Answer: Option 1, Option 3, Option 4
+Correct options:
+<scorerow>
element and
+retrieves its "kind"
attribute, which is
+"verbal"
for the first <scorerow>
+encountered in the HTML document.<scorerow>
tag,
+retrieves its text ("Verbal: 84")
, splits this text by “:”,
+and takes the first element of the resulting list
+("Verbal")
, converting it to lowercase to match
+"verbal"
"kind"
attributes for all
+<scorerow>
elements. The second to last (-2) element
+in this list corresponds to the "kind"
attribute of the
+first <scorerow>
in the second
+<scorelist>
tag, which is also
+"verbal"
Incorrect options:
+<sat>
tag, which does not exist as an attribute."kind"
attribute from a
+<scorelist>
tag, but <scorelist>
+does not have a "kind"
attribute.(6 pts) Consider the following function.
+Consider the following function.
def summer(tree):
if isinstance(tree, list):
@@ -1138,6 +1193,20 @@
Answer: a: "scorelist"
, b:
"scorelist", attrs={"listtype":"scores"}
, c:
"scorerow", attrs={"kind":"math"}
+soup.find("scorelist")
selects the first
+<scorelist>
tag, which includes both verbal and math
+percentiles (84 and 99)
. The function
+summer(tree)
sums these values to get 183
.
+This selects the <scorelist>
tag with
+listtype="scores"
, which contains the actual scores of
+verbal (680)
and math (800)
. The function sums
+these to get 1480
.
+This selects all <scorerow>
elements with
+kind="math"
, capturing both the percentile
+(99)
and the actual score (800)
. Since tree is
+now a list, summer(tree)
iterates through each
+<scorerow>
in the list, summing their
+<scorenum>
values to reach 899
.
Consider the following list of tokens.
-“py tokens = ["is", "the", "college", "board", "the", "board", "of", "college"] "
= ["is", "the", "college", "board", "the", "board", "of", "college"] tokens
Recall, a uniform language model is one in which each
unique token has the same chance of being sampled.
Suppose we instantiate a uniform language model on tokens
.
-The probability of the sentence “““the college board is” — that is,
-P(\text{the college board is}) — is of
-the form \frac{1}{a^b}, where P(\text{the college board is}) — is of the
+form \frac{1}{a^b}, where a and b are
both positive integers.
What are a and
Answer: a = 5, b = 4 In a uniform language model, each unique token has the same chance of
+being sampled. Given the list of tokens, there are 5 unique tokens:
+[“is”, “the”, “college”, “board”, “of”]. The probability of sampling any
+one token is \frac{1}{5}. For a
+sentence of 4 tokens (“the college board is”), the probability is \frac{1}{5^4} because each token is
+independently sampled. Thus, a = 5 and
+b = 4. Answer: (c, d) = (2, 9) or (8, 3) In a unigram language model, the probability of sampling a token is
+proportional to its frequency in the token list. The frequencies are:
+“is” = 1, “the” = 3, “college” = 2, “board” = 2, “of” = 1. The sentence
+“the college board is” has probabilities \frac{3}{8}, \frac{2}{8}, \frac{2}{8}, \frac{1}{8} for each word respectively, when
+considering the total number of tokens (8). The combined probability is
+\frac{3}{8} \cdot \frac{2}{8} \cdot
+\frac{2}{8} \cdot \frac{1}{8} = \frac{6}{512} = \frac{1}{2^9} or,
+simplifying, \frac{1}{8^3} since 512 = 8^3. Therefore, c = 2 and d =
+9 or c = 8 and d = 3, depending on how you represent the
+fraction. Answer: Sentence 4 A bigram model looks at the probability of a word given the previous
+word. Sentence 4, “the college board of college”, likely has higher
+probabilities for its bigrams (“the college”, “college board”, “board
+of”, “of college”) based on the original list of tokens, which contains
+all these pairs. This reasoning assumes that the given pairs appear more
+frequently or are more probable in sequence than the pairs in other
+sentences. Answer: Yes In the context of TF-IDF, if a word appears in every sentence, its
+inverse document frequency (IDF) part would be (() = 0), making the
+TF-IDF score 0 for that word across all documents. Since “the” appears
+in all five sentences, its IDF is zero, leading to a column of zeros in
+the TF-IDF matrix for “the”. Answer: Sentence 4 The word “college” likely has the highest TF-IDF in Sentence 4
+because it appears less frequently across all sentences and is
+relatively more important (i.e., has a higher term frequency) in
+Sentence 4 than in other sentences where it appears. TF-IDF rewards
+words that are unique to a document but penalizes those that are common
+across all documents. Answer: the smallest The DF-ITF score is lower for terms that are more unique (appear in
+fewer documents) and have a higher count in the document they appear in.
+A smaller DF-ITF indicates that a term is both important within a
+specific document and distinctive across the corpus. Therefore, the term
+with the smallest DF-ITF in a document is considered the best summary
+for that document, as it balances document-specific significance with
+corpus-wide uniqueness.
Problem 7.1
above statement not guaranteed to be true?
Note: Treat as our training set.
Option 1:
-= (sat['Math'] > sat['Verbal']).mean()
- a = 0.5 b
Option 2:
= (sat['Math'] - sat['Verbal']).mean()
- a = 0 b
Option 3:
+class="sourceCode py">a = (sat['Math'] > sat['Verbal']).mean()
+b = 0.5
+Option 2:
= (sat['Math'] - sat['Verbal'] > 0).mean()
- a = 0.5 b
Option 4:
+class="sourceCode py">a = (sat['Math'] - sat['Verbal']).mean()
+b = 0
+Option 3:
= ((sat['Math'] / sat['Verbal']) > 1).mean() - 0.5
- a = 0 b
a = (sat['Math'] - sat['Verbal'] > 0).mean()
+b = 0.5
+Option 4:
+= ((sat['Math'] / sat['Verbal']) > 1).mean() - 0.5
+ a = 0 b
Option 1
Option 2
"medium"
, or "high"
. Since we can’t use
strings as features in a model, we decide to encode these strings using
the following Pipeline
:
-# Note: The FunctionTransformer is only needed to change the result
-# of the OneHotEncoder from a "sparse" matrix to a regular matrix
-# so that it can be used with StandardScaler;
-# it doesn't change anything mathematically.
-= Pipeline([
- pl "ohe", OneHotEncoder(drop="first")),
- ("ft", FunctionTransformer(lambda X: X.toarray())),
- ("ss", StandardScaler())
- ( ])
# Note: The FunctionTransformer is only needed to change the result
+# of the OneHotEncoder from a "sparse" matrix to a regular matrix
+# so that it can be used with StandardScaler;
+# it doesn't change anything mathematically.
+= Pipeline([
+ pl "ohe", OneHotEncoder(drop="first")),
+ ("ft", FunctionTransformer(lambda X: X.toarray())),
+ ("ss", StandardScaler())
+ ( ])
After calling pl.fit(lunch_props)
,
pl.transform(lunch_props)
evaluates to the following
array:
1.29099445, -0.37796447],
- array([[ -0.77459667, -0.37796447],
- [-0.77459667, -0.37796447],
- [-0.77459667, 2.64575131],
- [1.29099445, -0.37796447],
- [ 1.29099445, -0.37796447],
- [ -0.77459667, -0.37796447],
- [-0.77459667, -0.37796447]]) [
1.29099445, -0.37796447],
+ array([[ -0.77459667, -0.37796447],
+ [-0.77459667, -0.37796447],
+ [-0.77459667, 2.64575131],
+ [1.29099445, -0.37796447],
+ [ 1.29099445, -0.37796447],
+ [ -0.77459667, -0.37796447],
+ [-0.77459667, -0.37796447]]) [
and pl.named_steps["ohe"].get_feature_names()
evaluates
to the following array:
"x0_low", "x0_med"], dtype=object) array([
"x0_low", "x0_med"], dtype=object) array([
Fill in the blanks: Given the above information, we can conclude that
lunch_props
has (a) value(s) equal to
"low"
, (b) value(s) equal to
diff --git a/problems/wi23-final/wi23-final-data-info.md b/problems/wi23-final/wi23-final-data-info.md
index 70e7c97..bd0fa13 100644
--- a/problems/wi23-final/wi23-final-data-info.md
+++ b/problems/wi23-final/wi23-final-data-info.md
@@ -1,4 +1,4 @@
-The DataFrame `sat` contains one row for **most** combinations of `"Year"` and `"State"`, where `"Year"`ranges between `2005` and `2015` and `"State"` is one of the 50 states (not including the District of Columbia).
+The DataFrame `sat` contains one row for **most** combinations of `"Year"` and `"State"`, where `"Year"` ranges between `2005` and `2015` and `"State"` is one of the 50 states (not including the District of Columbia).
The other columns are as follows:
diff --git a/problems/wi23-final/wi23-final-q02.md b/problems/wi23-final/wi23-final-q02.md
index 00d0058..74b6743 100644
--- a/problems/wi23-final/wi23-final-q02.md
+++ b/problems/wi23-final/wi23-final-q02.md
@@ -91,8 +91,9 @@ The DataFrame `scores_2015`, shown in its entirety below, contains the verbal se