Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_input.py的select_best_length逻辑应该有问题 #2

Open
kknd21988 opened this issue Feb 20, 2019 · 0 comments
Open

build_input.py的select_best_length逻辑应该有问题 #2

kknd21988 opened this issue Feb 20, 2019 · 0 comments

Comments

@kknd21988
Copy link

# 原max_length选择方法,逻辑有问题
# for i in len_dict:
#     rate = i[1] / all_sent
#     cover_rate += rate
#     if cover_rate >= limit_ratio:
#         max_length = i[0]
#         break

分析:len_dict是句子长度的频数统计list[(15,3700),(12,2800),(8,500)...(20,30)],每个元素(句长,频数)
按上述逻辑,当3700+2800+500大于总频数95%时,max_len是8,这里就产生了错误。

应该修改为:

改成:将len_dict按照句子长度从小到大排序,从大到小筛选

temp = sorted(len_dict, key=lambda x:x[0], reverse=False)
for i in temp:
    rate = i[1] / all_sent
    cover_rate += rate
    if cover_rate >= limit_ratio:
        max_length = i[0]
        break
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant