Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change commit filtering and network building regarding the untracked files and base artifact #149

Merged
merged 31 commits into from
Jan 15, 2019
Merged

Conversation

jkronaw
Copy link

@jkronaw jkronaw commented Dec 10, 2018

Prerequisites

  • I adhere to the coding conventions (described here) in my code.
  • I have updated the copyright headers of the files I have modified.
  • I have written appropriate commit messages, i.e., I have recorded the goal, the need, the needed changes, and the location of my code modifications for each commit. This includes also, e.g., referencing to relevant issues.
  • I have put signed-off tags in all commits.
  • I have updated the changelog file NEWS.md appropriately.
  • The pull request is opened against the branch dev.

Description

See changelog.

Changelog

  • In addition to the ProjectConf parameter artifact.filter.base, which configured wether the base artifact should be
    included in the get.commits.filtered method, there is now the similiar parameter filter.untracked.files which does
    the same thing for untracked files (11428d9)
  • Edges are not being constructed in the author network between authors that only modify untracked files. For authors
    it can be configured if the edges should be created or not using the new NetworkConf parameter base.artifact.edges
    (c60c2f6)
  • Rename empty artifact to untracked files

https://github.com/SCPhantom/codeface-extraction-r/pull/new/jakob-updates

Jakob Kronawitter added 10 commits November 29, 2018 17:51
The get.commits.raw function was removed. Instead, the function get.commits
should be used from now on.

Signed-off-by: Jakob Kronawitter <[email protected]>
The artifact kind filtering which filters the commits.list file and only keeps
the commits which have the correct artifact.type (configured in the ProjectConf
class) has been moved to the get.commits method of the ProjectData class.
Previously this functionality was in the get.commits.filtered method.

Signed-off-by: Jakob Kronawitter <[email protected]>
In the case of a valid commits.list file with at least one commit line the
read.commits function returns a data.frame with 16 columns containing all the
commits read from the file. If the commits.list file is empty, however, it
previously returned an empty data.frame with no columns. This has now been
adjusted to return an empty data.frame with all the columns (16 columns), which
should save a lot of additional if-else case distinctions later on because now
the shape of the returned data.frame by the read.commits function is always the
same.

Signed-off-by: Jakob Kronawitter <[email protected]>
This major commit merges the two old methods get.commits.filtered and
get.commits.filtered.empty of the ProjectData class into one new method again
called get.commits.filtered. Similiarly, the filter.commits.empty and
filter.commits methods were merged into one new filter.commits method which now
takes filter.untracked.files and artifact.filter.base as paramaters which then
control how the filtering is performed.

The filter.untracked.files parameter was added to the ProjectConf which now
controls - just like the artifact.filter.base parameter - which commits should
be filtered out when calling the get.commits.filtered method.

If you want to call the get.commits.filtered with other paramaters (not the ones
that are configured in the ProjectConf) then one can call the
get.commits.filtered.uncached version of this method. As the name implies, this
method is not taking advantage of caching and should thus not be used too often.

In the course of revamping these methods it only took a minor effort to rename
the empty artifact to a more speaking identifier, namely, "untracked files".
Thus, this renaming was also performed in this commit.

Signed-off-by: Jakob Kronawitter <[email protected]>
The new get.commits method includes filtering by artifact kind. Two testcases
depended on this and thus have now been adjusted accordingly. 10 test cases of
the test-split.R are still not working.

Signed-off-by: Jakob Kronawitter <[email protected]>
The test cases were adapted to two of the new changes in the network library.
The first one is the fact that the get.commits method now removes either
'Feature' or 'FeatureExpression' commits. The second one was the change that
there are no dummy data.frames anymore (with zero columns and rows). Instead
there are empty data.frames when there no data exists (with columns but zero
rows). One mistake was made during creation of these. The empty data.frames
that are created did not contain any data type informtion (all columns defaulted
to the 'logical' data.type). If this is not wanted there now exists a new helper
method which also takes care of data types.

Signed-off-by: Jakob Kronawitter <[email protected]>
Previously, when an author network was created and the untracked files artifact
and the base artifact were included, edges have been created among the untracked
files artifact and among the base artifact. This was now changed so that there
are no edges created among untracked files at any time. For the base artifact it
can be configured via the new base.artifact.edges parameter in the NetworkConf.

Signed-off-by: Jakob Kronawitter <[email protected]>
Signed-off-by: Jakob Kronawitter <[email protected]>
Copy link
Collaborator

@clhunsen clhunsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your contribution, @SCPhantom! This is a really good start for a very huge change in the network library. Nice work so far!

Except for some minor comments, I have three major issues with the current state:

  • We need documentation on the new configuration options in the README file.
  • For more consistency, the base artifact for the file artifact needs to be added to BASE.ARTIFACTS, so we can use such constants in the code (see the corresponding comments).
  • The construction of co-change-based author networks looks good basically, but I suppose we can improve the code heavily (I added some extensive notes in the corresponding comment).

As I raise questions in my comments, please do not hesitate to answer and ask questions yourself. Maybe, I just forgot about something that we agreed on earlier. 😉

Anyway, we should discuss some details in a face-to-face meeting on Friday; @bockthom, @ecklbarb, you are also invited. I marked all comments with issues that are subject for debate using the emoji 🌳. Additionally, we could also discuss who is carrying out which suggested/needed changes.

util-read.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-data.R Show resolved Hide resolved
util-data.R Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
util-conf.R Outdated Show resolved Hide resolved
util-conf.R Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
util-data.R Outdated Show resolved Hide resolved
util-data.R Outdated Show resolved Hide resolved
util-data.R Outdated Show resolved Hide resolved
util-data.R Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-misc.R Outdated Show resolved Hide resolved
util-read.R Outdated Show resolved Hide resolved
util-data.R Outdated Show resolved Hide resolved
util-data.R Outdated Show resolved Hide resolved
showcase.R Show resolved Hide resolved
Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, @SCPhantom.

There are still some minor issues to fix (mostly documentation), but looks good now.

In addition, I have added some further comments for discussing aspects which we did not yet have discussed (at least not discussed during my presence).

Sorry for spamming you with that much mails. I should have used one review instead of that much single comments...

Jakob Kronawitter and others added 4 commits December 15, 2018 20:23
The global constant 'UNTRACKED.FILE' is added to avoid reusage of the same
string 'untracked.file' all the time. In addition minor adjustments are made to
the documentation.

Signed-off-by: Jakob Kronawitter <[email protected]>
In recent scenarios and in perspective of up-coming changes, the default
behavior of 'Conf' objects upon initialization and update:

1) The default values are *not* automatically checked against the
allowed values anymore. This is mainly disabled to avoid confusion of
users. The constructor of the class 'Conf' is adapted accordingly.
2) When updating a configuration value, the program execution is now
stopped (using 'stop') on failure. Previously, the respective update has
been ignored while issuing a warning. This change helps preventing
confusion and analysis errors early in an analysis script. Accordingly,
the parameter 'stop.on.error' to all update methods is removed.

Furthermore, the code is streamlined, such that the super-constructor is
called from both subclasses 'NetworkConf' and 'ProjectConf'. Some log
statements are added/adjusted, too.

Signed-off-by: Claus Hunsen <[email protected]>
When a network contains no edges but more than one node, all the nodes get
combined. To fix this, the respecting data frame, which contains the nodes,
has to be transposed.

This fixes #150.

Reported-by: Jakob Kronawitter <[email protected]>
Signed-off-by: Thomas Bock <[email protected]>
The edge creation process which does not draw any edges among authors of
untracked files and - if configured in the 'ProjectConf' - does also not draw
any edges among the base artifact authors is being reworked since the old way
of achieving this was rather uninituitive and complicated.

Signed-off-by: Jakob Kronawitter <[email protected]>
util-data.R Outdated Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
For commits to untracked files the artifact column has previously been the
copied file column (for example when looking at the commit data returned by
'get.commits'). However this should only be the case when considering file level
analysis (e.g. 'artifact == file' in the 'ProjectConf').
This commit changes this to the correct behaviour. So for 'artifact == function'
and 'artifact == feature' the artifact column now only contains the empty string
for untracked files. To avoid hardcoding this empty string in every affected
place a global constant called 'UNTRACKED.FILE.EMPTY.ARTIFACT' was added.

Signed-off-by: Jakob Kronawitter <[email protected]>
@bockthom
Copy link
Collaborator

As @clhunsen asked for my opinion on some ideas, I will add some comments now:

  1. First, rename the function get.empty.dataframe to create.empty.data.frame to stay more consistent. Second, introduce functions such as create.empty.commit.list and create.empty.mail.list that call create.empty.data.frame internally. This way, we would have these calls consistently anywhere and would not bother the developer with the data-columns types when needed. This addition would just be convenient. 😉

I definitely agree with this. Sounds like a good idea.

  1. When constructing artifact networks, we get the empty artifact as a vertex when using artifact == feature (and only there for now, if I performed the test correctly).

    To fix this coherently, we need to add commit data on untracked files to the test suite... I will post details in an issue that we need to solve in a consequent PR.

I am not sure if I understand this problem correctly. Does this only affect the test suite? However, we can discuss that in #153.

  1. A general question that came to me while reviewing: Should the methods ProjectData$group.* use the method $get.commits or the method $get.commits.raw? Should we be able to use either one if wanted (e.g., by adding another data.source "commits.filtered")? Otherwise, this needs to be documented somewhere. 😉

I am a little bit confused now. get.commits.raw does not exist any more. Is the intended question whether to use $get.commits or $get.commits.filtered ?

As I remember our discussions taking place on the last Fridays and Mondays correctly, then we have agreed on using get.commits.filtered for the grouping functions. But I am not sure if I remember that correctly... In the end, we should make sure, that this is in compliance with the intended network construction afterwards...

Jakob Kronawitter added 3 commits December 20, 2018 16:01
Signed-off-by: Jakob Kronawitter <[email protected]>
The following four functions are introduced to make empty dataframe creation
even easier:
- create.empty.authors.list
- create.empty.commits.list
- create.empty.issues.list
- create.empty.mails.list

As the name implies, each function is creating and returning an dataframe with
zero rows but the correct column names and datatypes (like a dataframe of the
corresponding type would have if it was filled with authors, commits, issues or
mails, respectively).

In addition, all the column names and column datatypes are made publicly
available by defining them as constants.

Signed-off-by: Jakob Kronawitter <[email protected]>
@clhunsen
Copy link
Collaborator

I am a little bit confused now. get.commits.raw does not exist anymore. Is the intended question whether to use $get.commits or $get.commits.filtered ?

Yeah, that is just an honest mistake. Surely, I meant $get.commits.filtered, as you pointed out already. 🤦‍♂️

As I remember our discussions taking place on the last Fridays and Mondays correctly, then we have agreed on using get.commits.filtered for the grouping functions. But I am not sure if I remember that correctly... In the end, we should make sure, that this is in compliance with the intended network construction afterwards...

Honestly, I could not remember completely and wanted to point this out explicitly before forgetting about it eventually. To make it clear: I am fine with the current way. As we use the filtered commit data for network construction right now, we are compliant already.

util-read.R Show resolved Hide resolved
Change the behaviour of method 'set.commits' and 'get.authors.by.data.source'
whenever a 'NULL' is passed as parameter.

Signed-off-by: Jakob Kronawitter <[email protected]>
@jkronaw
Copy link
Author

jkronaw commented Dec 20, 2018

First of all sorry for rush-committing while discussions are / were still ongoing ;)
I'll try to avoid this for future pull requests and commits, but wanted to get the things done quickly before the holidays 😄

To the points of you, @clhunsen:

  1. I did that, the idea was great and makes the code much better!
  2. I have not tried it out, yet, however, as you said, this is a topic for another pull request. I agree, that we should add lines for commits to untracked files to our commits.list
  3. I vote for staying with get.commits.filtered for now. Just because it makes the PR easier.

Copy link
Collaborator

@bockthom bockthom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your work, @SCPhantom!

I quickly looked over the current version of this PR. Looks quite good already.

For now, I have only a few little comments. (I need to have a closer look at the new network-construction algorithm regarding the respective nodes next year.)

With that said, this is my last action on this PR before the Christmas break. Have a Merry Christmas and a Happy New Year!

README.md Outdated Show resolved Hide resolved
util-misc.R Outdated Show resolved Hide resolved
util-misc.R Outdated Show resolved Hide resolved
util-networks.R Outdated Show resolved Hide resolved
util-misc.R Outdated Show resolved Hide resolved
util-conf.R Show resolved Hide resolved
Jakob Kronawitter added 3 commits January 7, 2019 18:09
Signed-off-by: Jakob Kronawitter <[email protected]>
Previously, next to the declaration of column names for commonly used dataframes
(dataframes containing commits, mails, etc.), there was an SQL statement
describing how the data was retrieved before it is read by the network library
(in the tool 'codeface-extraction').
These SQL statements are now removed and instead, it is referred to the tool
'codeface-extraction' itself on GitHub.

Signed-off-by: Jakob Kronawitter <[email protected]>
@jkronaw
Copy link
Author

jkronaw commented Jan 9, 2019

Once again, I think that most of the work should be done for now.

I have updated the changelog once again including the relevant changes of all commits.

I have ignored the two commits: 7e27a18 and 137d833, as they are not 100% clear to me at the moment. I would be glad, as they were authored by @clhunsen, if you could give me the corresponding lines for the changelog. You are the one with the better understanding here 😏

@jkronaw jkronaw closed this Jan 9, 2019
@jkronaw jkronaw reopened this Jan 9, 2019
@jkronaw
Copy link
Author

jkronaw commented Jan 9, 2019

Whoops, that was an accident!

@clhunsen
Copy link
Collaborator

clhunsen commented Jan 9, 2019

I will have a closer look tomorrow. Thanks for everything! 👍

I have ignored the two commits: 7e27a18 and 137d833, as they are not 100% clear to me at the moment. I would be glad, as they were authored by @clhunsen, if you could give me the corresponding lines for the changelog. You are the one with the better understanding here 😏

Here you are:

### Added
- Add method `ProjectData$get.authors.by.data.source` to retrieve authors by given data-source name (#149, 65804276dd2ada9b2f00b2cab7b6ad0cecbe733e, 137d8337bc35f5a83aa16a48ef8e47fc0d36b36c)

### Fixed  
- Fix vertices for networks without edges (#150, PR #149, 0d7c2226da67f3537f3ff9d013607fe19df8a4c0, 7e27a182de282f054f08e3a2fb04d852c2c55102)

clhunsen
clhunsen previously approved these changes Jan 14, 2019
Copy link
Collaborator

@clhunsen clhunsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is very, very close to being merged. 😀👍 Here the last changes to be made:

  • two small changes to the README file (with suggestions) and
  • some missing updates to copyright headers.

NEWS.md Outdated Show resolved Hide resolved
NEWS.md Show resolved Hide resolved
tests/test-networks-artifact.R Show resolved Hide resolved
tests/test-networks-author.R Show resolved Hide resolved
tests/test-networks-bipartite.R Show resolved Hide resolved
tests/test-networks-covariates.R Show resolved Hide resolved
Jakob Kronawitter added 2 commits January 14, 2019 23:04
Signed-off-by: Jakob Kronawitter <[email protected]>
Signed-off-by: Jakob Kronawitter <[email protected]>
@clhunsen
Copy link
Collaborator

Thank you very much again for these extensive changes, @SCPhantom! And for your endurance. 😉 I will merge now. 👍

FYI: The test for R 3.3 is failing due to issue #152 (comment). I will add a patch for this in my next PR.

@clhunsen clhunsen merged commit 2488039 into se-sic:dev Jan 15, 2019
@clhunsen clhunsen mentioned this pull request Jun 7, 2019
fehnkera pushed a commit to fehnkera/coronet that referenced this pull request Sep 23, 2020
This patch consists of three related fix and adaptations:

First, the method 'ProjectData$get.authors.by.data.source' does not
correct the column names of the returned data.frame anymore. This
establishes compatibility with the method 'ProjectData$get.authors'.
Additionally, the returned data.frame only contains unique entries. The
documentation is tidied.

Second, the method 'NetworkBuilder$get.author.network.cochange' is fixed
by adding the missing 'private$' prefix when accessing the project data.

Third, the assignment of author vertices is corrected to use only author
names with the correct vertex attribute (i.e., "name"). This adapts the
code with respect to the first change mentioned above.

This change fixes all failing tests in PR se-sic#149.

Signed-off-by: Claus Hunsen <[email protected]>
fehnkera pushed a commit to fehnkera/coronet that referenced this pull request Sep 23, 2020
To properly test the changes introduced in PR se-sic#149 (configuration
parameter 'commits.filter.untracked.files'), commit information on
untracked files is added to the test suite (in detail, two commits by
two distinct authors). The test suite is adapted accordingly.

This is a first step for issue se-sic#153.

Signed-off-by: Claus Hunsen <[email protected]>
fehnkera pushed a commit to fehnkera/coronet that referenced this pull request Sep 23, 2020
Augmenting the changes from PR se-sic#149, the empty artifact, which results
from untracked files, is not a part of any network type anymore.

This is change is based on commit d9f527c
and fixes se-sic#153.

This is a preliminary implementation with regard to issue se-sic#154.

Signed-off-by: Claus Hunsen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants