Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detruncate the sentences #3

Open
Mte90 opened this issue Mar 21, 2020 · 2 comments
Open

Detruncate the sentences #3

Mte90 opened this issue Mar 21, 2020 · 2 comments

Comments

@Mte90
Copy link

Mte90 commented Mar 21, 2020

I downloaded a few books but I see that often the sentence is splitted in few lines.
So I was wondering if it is a way to rebuild the original sentence in a way that there is a sentence with an uppercase letter and a dot (or a question dot etc).
Example:

Savio di terraferma alla scrittura e le magistrature
Le armi nel loro complesso, il governo ed
il riparto difensivo e territoriale.

to

Savio di terraferma alla scrittura e le magistrature
Le armi nel loro complesso, il governo ed il riparto difensivo e territoriale. 
@coreybobco
Copy link

I had forked this repo to change that along with one other thing I needed for my use case (making replacing deletions with "[deleted]" optional. I have just made a pull request with both changes. My fork is presently available here: github.com/coreybobco/gutenberg_cleaner

@Mte90
Copy link
Author

Mte90 commented Mar 22, 2020

I saw the pr but I think that will generate a text like:

Savio di terraferma alla scrittura e le magistrature Le armi nel loro complesso, il governo ed il riparto difensivo e territoriale. 

Instead sentences with an uppercase letter should stay in a specific line. I was trying to do an algorithm for that but I think that with a regex is possible to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants