Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect dimension with HybridTupleEmbedding #3

Open
yefanTao opened this issue Dec 1, 2022 · 0 comments
Open

incorrect dimension with HybridTupleEmbedding #3

yefanTao opened this issue Dec 1, 2022 · 0 comments

Comments

@yefanTao
Copy link

yefanTao commented Dec 1, 2022

The dimension is incorrect with HybridTupleEmbedding using CTT model.
I think it's because HybridTupleEmbedding use autoencoder_embedding_model for tuple embedding, and in line 171 of tuple_embedding_models.py, the embedding_matric is having hidden_dimensions (by default, 150). But trainer defined in line 311 is still setting the CTTmodel input as input_dimension (by default, 300).

data is downloaded from https://pages.cs.wisc.edu/~anhai/data1/deepmatcher_data/Textual/Abt-Buy/exp_data/. use below code to reproduce the error:

import pandas as pd
from deep_blocker import DeepBlocker
from tuple_embedding_models import  AutoEncoderTupleEmbedding, CTTTupleEmbedding, HybridTupleEmbedding
from vector_pairing_models import ExactTopKVectorPairing
import blocking_utils
cols_to_block=['name','description','price']

left_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/tableA.csv")
right_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/tableB.csv")

tuple_embedding_model = HybridTupleEmbedding()
topK_vector_pairing_model = ExactTopKVectorPairing(K=20)
db = DeepBlocker(tuple_embedding_model, topK_vector_pairing_model)

candidate_set_df = db.block_datasets(left_df, right_df, cols_to_block)
golden_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/test.csv")
golden_df = golden_df[golden_df['label']==1]
print(blocking_utils.compute_blocking_statistics(candidate_set_df, golden_df, left_df, right_df))
print(candidate_set_df.shape)

error:

RuntimeError                              Traceback (most recent call last)
Cell In [2], line 15
     12 topK_vector_pairing_model = ExactTopKVectorPairing(K=20)
     13 db = DeepBlocker(tuple_embedding_model, topK_vector_pairing_model)
---> 15 candidate_set_df = db.block_datasets(left_df, right_df, cols_to_block)
     16 golden_df = pd.read_csv("/mnt/efs-write/share/public_data/anhai/Textual/Abt-Buy/test.csv")
     17 golden_df = golden_df[golden_df['label']==1]

File ~/blocking/DeepBlocker/deep_blocker.py:58, in DeepBlocker.block_datasets(self, left_df, right_df, cols_to_block)
     56 print("Performing pre-processing for tuple embeddings ")
     57 all_merged_text = pd.concat([self.left_df["_merged_text"], self.right_df["_merged_text"]], ignore_index=True)
---> 58 self.tuple_embedding_model.preprocess(all_merged_text)
     60 print("Obtaining tuple embeddings for left table")
     61 self.left_tuple_embeddings = self.tuple_embedding_model.get_tuple_embedding(self.left_df["_merged_text"])

File ~/blocking/DeepBlocker/tuple_embedding_models.py:314, in HybridTupleEmbedding.preprocess(self, list_of_tuples)
    312 trainer = dl_models.CTTModelTrainer (self.input_dimension, self.hidden_dimensions)
    313 #trainer = dl_models.CTTModelTrainer (self.hidden_dimensions[-1], self.hidden_dimensions)
--> 314 self.ctt_model = trainer.train(self.left_embedding_matrix, self.right_embedding_matrix, self.label_list,
    315         num_epochs=NUM_EPOCHS, batch_size=BATCH_SIZE)

File ~/blocking/DeepBlocker/dl_models.py:168, in CTTModelTrainer.train(self, left_embedding_matrix, right_embedding_matrix, labels, num_epochs, batch_size)
    166 label = label.to(self.device)
    167 optimizer.zero_grad()
--> 168 output = self.model(left, right)
    169 loss = loss_function(output, label)
    170 loss.backward()
RuntimeError: mat1 and mat2 shapes cannot be multiplied (256x150 and 300x300)```

To fix it, I change line 311 in tuple_embedding_models.py from
trainer = dl_models.CTTModelTrainer (self.input_dimension, self.hidden_dimensions)
to
trainer = dl_models.CTTModelTrainer (self.hidden_dimensions[-1], self.hidden_dimensions)

it will work, but might not produce the optimal network structure. Let me know if I get anything wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant