-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exploring Semantic Spaces - Fundamentals #31
Comments
Chapter 15
|
Dependency parsing and word embedding: If I understand correctly, "local context" is defined using the collection of 5-grams, which means the sequences of five words that appear frequently enough. I wonder whether dependency parsing can improve word embedding since we may have many long sentences in which the subject and the (in)direct objects are far apart. Can we have trans-sentence dependency parsing? This may be useful when we are interested in transitivity in text: A does something to B in one sentence and B does something to C in another, it may be useful if we can retain the information that A indirectly causes C. I imagine this could potentially be useful in a history project when one wants to study the factors that indirectly caused, say, the French revolution. |
For "Dependency Parsing" This chapter introduces an interesting parsing algorithm called the "Graph-Based Dependency Parsing" (pp.17), which encodes the search space as directed graphs to apply the graph theories. It drives me to think about if we also could apply the vector space methods for the parsing, that is, to increase the dimensions of the graph-based parsing and parse based on the vectors? It may have some difficulties, as additional to what we do in constructing the semantic space, we should also consider the POS of words in sentences, so I'm not sure whether it is feasible. For "Vector Semantics and Embeddings" My question is for the section of "Embeddings and Historical Semantics", on the visualization of Figure 6.14 (pp.24). It is mentioned that:
As the figure put the target words from different time period into one visualized figure, I wonder how to project a word from the historic semantic space into the current semantic space, especially without changing the relative position of the grey context words? It will make sense if the figure is jointed by figures from different time, but it seems not by reading the descriptions (quoted above). Also, should we read Chapter 6, "Vector Semantics and Embeddings", instead of Chapter 15 and 16? Chapter 15 is about dependency parsing, and Chp16 about Logical Representations of sentence meaning, which seem irrelevant to this week's topic. |
I think word2vec is awesome and really look forward to trying it out in my final project. I was wondering what could be a good way to deal with a set of a lot of short documents (perhaps tweets, or even shorter than that). For example, you can have a lot of documents that are 4~5 words long but so most of the words do not have enough context words to fill their "window". In addition, what happens to words that are at the very beginning of documents and end of documents? Do they just use fewer context words? |
It is surprising that neural networks can be applied to nlp as well! My question is about the hidden layers. How should we choose hidden layers? How should we explain these hidden layers in the nlp model or text? |
I also read Chapter 6 as mentioned by @timqzhang. My first question is similar to @wanitchayap. In my coding practice, I did not really see the difference between first-order co-occurence and second-order co-occurence. I wonder how could we use word2vec to get this distinction. My second question corresponds to Figure 6.14 on page 117. Since my project may utilize this method to see the change of words over time, I am wondering how could we building separate embedding spaces with different models and combine together into one figure for visualization. |
Tuning weights for gradient descent or any other algorithm is often a trial-and-error process for newbies in Neural Network. What kind of rules do senior engineers follow to best adjust the hyper-parameters? |
I would like to thank @WMhYang for mentioning the methodology in Chapter 6 investigating the dynamic change of word usages, which provides me a hint apart from dynamic topic modeling that we learned last week. For this week’s reading, I noticed that it is necessary to generate training data for the transition-based dependency parsing. I wonder what is the “appropriate” size of the training set in order to obtain a reliable model? Would this algorithm be robust if we are not able to provide enough information? |
Related with @harryx113 's question, it is mentioned in Chapter 7 that we use an optimization algorithm like gradient-descent to train our neural network. In my previous attempt to complete a hw for another class, I basically tried on different optimization algorithms and chose the one that gave me the most 'appropriate' result. I know this is probably not the ideal way to train a model. What are some of the factors that we should consider to pick the best optimization algorithm for a neural network? |
Chapter 7 discussed neural networks and deep learning specifically. From what I understand, we have little insight into how the hidden layers of the neural network interact with one another to produce the result. We kinda basically see the input and the output. I guess I am wondering a bit more about all of those layers and exactly how they interact to produce accurate predictions. |
It is interesting to learn more about Word2Vec in this approach. I am wondering whether we can use this model for topic extraction and training classifier along with other features? Also, what are the similarities and differences between this representation and hierarchical clustering mentioned in the article? |
Is there a rule of thumb for the best choice of word2vec vs skipgram? like if some kinds of corpus would suit skipgram better? Also can we talk more about the details of de-biasing? It is mentioned in both readings, but many details are lacking. |
In the GloVe word embedding model, focusing on the word co-occurrence probabilities might lead to bias. For example, if the corpus of analysis was feminist literature – their algorithms might find similar patterns of gender bias by analyzing cooccurrence. Except in this case, the context matters – as it is likely that these cooccurrences are due to a constant and comprehensive critique of the bias itself. Is there a way to modify the algorithm to prevent or correct for this bias? |
Word embedding is a feasible tool which is not hard to train and contains rich meanings. Just as a response to @nwrim's response, training word embeddings on short text can be tricky because the results can be unstable, and I think building word embeddings on pre-trained vectors like google news could help your embedding training process converge quickly and also help you to compare the change of embeddings. My question is also about the stability of embeddings, is there any common validation method used in empirical study help us to state that the embedding actually makes sense? |
Post questions here for one or more of our fundamentals readings:
Jurafsky, Daniel and James H. Martin. 2015. Speech and Language Processing. Chapters 15-16 (“Vector Semantics”, “Semantics with Dense Vectors”)
The text was updated successfully, but these errors were encountered: