incorrect dimension with HybridTupleEmbedding #3

yefanTao · 2022-12-01T23:57:35Z

The dimension is incorrect with HybridTupleEmbedding using CTT model.
I think it's because HybridTupleEmbedding use autoencoder_embedding_model for tuple embedding, and in line 171 of tuple_embedding_models.py, the embedding_matric is having hidden_dimensions (by default, 150). But trainer defined in line 311 is still setting the CTTmodel input as input_dimension (by default, 300).

data is downloaded from https://pages.cs.wisc.edu/~anhai/data1/deepmatcher_data/Textual/Abt-Buy/exp_data/. use below code to reproduce the error:

import pandas as pd
from deep_blocker import DeepBlocker
from tuple_embedding_models import  AutoEncoderTupleEmbedding, CTTTupleEmbedding, HybridTupleEmbedding
from vector_pairing_models import ExactTopKVectorPairing
import blocking_utils
cols_to_block=['name','description','price']

left_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/tableA.csv")
right_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/tableB.csv")

tuple_embedding_model = HybridTupleEmbedding()
topK_vector_pairing_model = ExactTopKVectorPairing(K=20)
db = DeepBlocker(tuple_embedding_model, topK_vector_pairing_model)

candidate_set_df = db.block_datasets(left_df, right_df, cols_to_block)
golden_df = pd.read_csv("~/public_data/anhai/Textual/Abt-Buy/test.csv")
golden_df = golden_df[golden_df['label']==1]
print(blocking_utils.compute_blocking_statistics(candidate_set_df, golden_df, left_df, right_df))
print(candidate_set_df.shape)

error:

RuntimeError                              Traceback (most recent call last)
Cell In [2], line 15
     12 topK_vector_pairing_model = ExactTopKVectorPairing(K=20)
     13 db = DeepBlocker(tuple_embedding_model, topK_vector_pairing_model)
---> 15 candidate_set_df = db.block_datasets(left_df, right_df, cols_to_block)
     16 golden_df = pd.read_csv("/mnt/efs-write/share/public_data/anhai/Textual/Abt-Buy/test.csv")
     17 golden_df = golden_df[golden_df['label']==1]

File ~/blocking/DeepBlocker/deep_blocker.py:58, in DeepBlocker.block_datasets(self, left_df, right_df, cols_to_block)
     56 print("Performing pre-processing for tuple embeddings ")
     57 all_merged_text = pd.concat([self.left_df["_merged_text"], self.right_df["_merged_text"]], ignore_index=True)
---> 58 self.tuple_embedding_model.preprocess(all_merged_text)
     60 print("Obtaining tuple embeddings for left table")
     61 self.left_tuple_embeddings = self.tuple_embedding_model.get_tuple_embedding(self.left_df["_merged_text"])

File ~/blocking/DeepBlocker/tuple_embedding_models.py:314, in HybridTupleEmbedding.preprocess(self, list_of_tuples)
    312 trainer = dl_models.CTTModelTrainer (self.input_dimension, self.hidden_dimensions)
    313 #trainer = dl_models.CTTModelTrainer (self.hidden_dimensions[-1], self.hidden_dimensions)
--> 314 self.ctt_model = trainer.train(self.left_embedding_matrix, self.right_embedding_matrix, self.label_list,
    315         num_epochs=NUM_EPOCHS, batch_size=BATCH_SIZE)

File ~/blocking/DeepBlocker/dl_models.py:168, in CTTModelTrainer.train(self, left_embedding_matrix, right_embedding_matrix, labels, num_epochs, batch_size)
    166 label = label.to(self.device)
    167 optimizer.zero_grad()
--> 168 output = self.model(left, right)
    169 loss = loss_function(output, label)
    170 loss.backward()
RuntimeError: mat1 and mat2 shapes cannot be multiplied (256x150 and 300x300)```

To fix it, I change line 311 in tuple_embedding_models.py from
trainer = dl_models.CTTModelTrainer (self.input_dimension, self.hidden_dimensions)
to
trainer = dl_models.CTTModelTrainer (self.hidden_dimensions[-1], self.hidden_dimensions)

it will work, but might not produce the optimal network structure. Let me know if I get anything wrong.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect dimension with HybridTupleEmbedding #3

incorrect dimension with HybridTupleEmbedding #3

yefanTao commented Dec 1, 2022 •

edited

Loading

incorrect dimension with HybridTupleEmbedding #3

incorrect dimension with HybridTupleEmbedding #3

Comments

yefanTao commented Dec 1, 2022 • edited Loading

yefanTao commented Dec 1, 2022 •

edited

Loading