Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide the possibility to split data into activity-based equally-sized windows #49

Closed
bockthom opened this issue Jul 28, 2017 · 2 comments

Comments

@bockthom
Copy link
Collaborator

bockthom commented Jul 28, 2017

Currently, we have the possibility to split networks activity-based by either specifying the number of edges per network or by specifying the number of windows.

However, we do not have this possibility for data-based splitting, we can only specify the number of commits resp. e-mails, but not the number of windows.

So, I suggest to implement a function that computes the activity amount based on the number of wanted windows. Example:

get.size.of.equally.sized.windows <- function(input.size, number.windows.wanted) {
  size <- ceiling(input.size / number.windows)
}

In the case of activity-based network splitting, input.size is the overall number of edges.
In the case of activity-based data splitting, input.size is the overall number of commits resp. e-mails.

So, both functions split.data.activity.based and split.network.activity.based should provide a parameter number.windows and both call the above defined function get.size.of.equally.sized.windows when the parameter number.windows is given.

[In addition, one could think of providing a function for determining equally sized windows also for time-based splitting. In that case, there is no difference between network-based or data-based time-based splitting -- we only need the very first and very last date in the data source to determine a time-period for equally-sized windows given by the amount of windows wanted. However, this will only make sense after #38 is closed.]

@clhunsen
Copy link
Collaborator

[In addition, one could think of providing a function for determining equally sized windows also for time-based splitting. In that case, there is no difference between network-based or data-based time-based splitting -- we only need the very first and very last date in the data source to determine a time-period for equally-sized windows given by the amount of windows wanted. However, this will only make sense after #38 is closed.]

We just need this with the argument length.out instead of by: https://github.com/se-passau/codeface-extraction-r/blob/v2.2/util-split.R#L539.

@clhunsen
Copy link
Collaborator

clhunsen commented Feb 23, 2018

[In addition, one could think of providing a function for determining equally sized windows also for time-based splitting. In that case, there is no difference between network-based or data-based time-based splitting -- we only need the very first and very last date in the data source to determine a time-period for equally-sized windows given by the amount of windows wanted. However, this will only make sense after #38 is closed.]

We just need this with the argument length.out instead of by: https://github.com/se-passau/codeface-extraction-r/blob/v2.2/util-split.R#L539.

With the changes in PR #96, the section is now here: https://github.com/clhunsen/codeface-extraction-r/blob/d421807df819ca2173421401dc9a198954b68774/util-split.R#L773

Also, the specifiy change needs to take place here: https://github.com/clhunsen/codeface-extraction-r/blob/d421807df819ca2173421401dc9a198954b68774/util-misc.R#L174
Just compute a lubridate::period from start and end date, then divide it by the number of windows (which becomes another parameter).

@bockthom bockthom modified the milestones: v3.1, v3.2 Feb 27, 2018
@clhunsen clhunsen modified the milestones: v3.2, v3.3 May 2, 2018
@clhunsen clhunsen modified the milestones: v3.3, Future, v3.4 Aug 8, 2018
@bockthom bockthom mentioned this issue Oct 22, 2018
fehnkera pushed a commit to fehnkera/coronet that referenced this issue Sep 23, 2020
Currently, we have the possibility to split data in a time-based manner
by specifying a time period or specific bins. However, we do not have
the possibility to specify the number of windows. With this commit, we
add this functionality.

To implement this functionality, the functions 'split.data.time.based'
and 'split.get.bins.time.based' both get a new parameter
'number.windows' and the function 'generate.date.sequence' a parameter
'length.out'.

Additionally, adjust function documentation appropriately.

This fixes se-sic#49.

Signed-off-by: Claus Hunsen <[email protected]>
fehnkera pushed a commit to fehnkera/coronet that referenced this issue Sep 23, 2020
Currently, we have the possibility to split data in a time-based manner
by specifying a time period, specific bins, or the number of resulting
time windows. However, we do not have the latter possibility for network
splitting. With this commit, we add this functionality.

To implement this functionality, the functions
'split.network.time.based' and 'split.networks.time.based' both get a
new parameter to handle this functionality. Additionally, streamline
other 'split.*' functions for uniformity regarding the 'number.windows'
and 'sliding.window' parameters.

Finally, adjust function documentation appropriately.

This is a follow-up for commit 40974ba.
Hopefully, this really fixes se-sic#49.
Thanks to @bockthom for pointing this out in his review on PR se-sic#140.

Signed-off-by: Claus Hunsen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants