Skip to content

Commit

Permalink
Update docs re train/test leakage and min_percent
Browse files Browse the repository at this point in the history
  • Loading branch information
mmcdermott authored Dec 2, 2020
1 parent 50b3077 commit 420bfc4
Showing 1 changed file with 6 additions and 2 deletions.
8 changes: 6 additions & 2 deletions mimic_direct_extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -744,8 +744,12 @@ def plot_variable_histograms(col_names, df):
help="Don't group by level2.")

ap.add_argument('--min_percent', type=float, default=0.0,
help='Minimum percentage of row numbers need to be observations for each numeric column.' +
'min_percent = 1 means columns with more than 99 percent of nan will be removed')
help='Minimum percentage of row numbers need to be observations for each numeric column. ' +
'min_percent = 1 means columns with more than 99 percent of nan will be removed. ' +
'Note that as our code does not split the data into train/test sets, ' +
'removing columns in this way prior to train/test splitting yields in a (very minor) ' +
'form of leakage across the train/test set, as the overall missingness measures are used ' +
'that are based on both the train and test sets, rather than just the train set.')
ap.add_argument('--min_age', type=int, default=15,
help='Minimum age of patients to be included')
ap.add_argument('--min_duration', type=int, default=12,
Expand Down

0 comments on commit 420bfc4

Please sign in to comment.