Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstracted Replay Data Parsers #119

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

cole-maclean
Copy link
Contributor

PR to implement an abstraction of the replay_actions module to separate running the replays and the data scrapping as per issue #51 , similar to how running agents is abstracted to allow for arbitrary user written agents. This will likely be a fairly large redesign, so PR is currently a Work In Progress to get things started. Current commit is proof of concept of abstracting a parser class from the replay setup to run the action stats scrapping of the current replay_actions script.

@cole-maclean
Copy link
Contributor Author

Ok @tewalds , do you wanna take a look and see if this is on the right track? I've separated the replay_actions (this name should probably change, I'm thinking parse_replays) from the actual replay parsers, which can be found in the new folder "replay_parsers". There's a base_parser class that implements basic stats and methods needed to parse a replay. The key parsing function is parse_step, which takes as input from the ReplayParser obs, feat and info. Other optional overrides are valid_replay for users to write their own definitions of what constitutes a valid replay, and stats merging/ printing if they want custom printed stats. If the parser parse_step function returns anything, whatever is returned is appended to a data list at each step in the replay, and this data list is saved into a json data file for that specific replay&playerId. There's an optional data_dir argument for users to pass in for where to save these files, if their parser requires it.

I've also included my state scrapping script I wrote for a project as an example of how to parse state info from a replay and save them to files.

Let me know your thoughts, and if you think this is the right way to go I'll work on updating documentation. Could probably use some testing as well.

Thanks!
-Cole

flags.DEFINE_integer("screen_resolution", 16,
"Resolution for screen feature layers.")
flags.DEFINE_integer("minimap_resolution", 16,
"Resolution for minimap feature layers.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added resolution args to be consistent with play.py

flags.mark_flag_as_required("replays")
FLAGS(sys.argv)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed this to get around Issue #77

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be dealt with by app.run and shouldn't be explicitly needed.

interface.feature_layer.resolution.x = FLAGS.screen_resolution
interface.feature_layer.resolution.y = FLAGS.screen_resolution
interface.feature_layer.minimap_resolution.x = FLAGS.minimap_resolution
interface.feature_layer.minimap_resolution.y = FLAGS.minimap_resolution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from play.py to be consistent

self.proc_id = proc_id
self.time = time.time()
self.stage = ""
self.replay = ""
self.replay_stats = ReplayStats()
self.parser = parser_cls()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed replay_stats to parser, updated throughout code

merge_dict(self.made_actions, other.made_actions)
self.crashing_replays |= other.crashing_replays
self.invalid_replays |= other.invalid_replays

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge moved to parser class

@@ -192,7 +122,7 @@ def run(self):
self._print("Empty queue, returning")
return
try:
replay_name = os.path.basename(replay_path)[:10]
replay_name = os.path.basename(replay_path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed 10 character truncating to full replay name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason it was just the first 10 is that I was running mainly over replays that were sha1 named, so very long and the prefix gave enough uniqueness. I'm fine with showing the full as long as it's readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've left the full replay name but I'll leave it up to you to decide on readability. The only catch is that data files are uniquely named from their replay name and will be overwritten if there are colliding names.

@@ -216,16 +146,16 @@ def run(self):
self._print("Starting %s from player %s's perspective" % (
replay_name, player_id))
self.process_replay(controller, replay_data, map_data,
player_id)
player_id,info,replay_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing info and replay_name to process_replay to allow user to parse info data, replay_name for file naming

for ability_id in feat.available_actions(obs.observation):
self.stats.replay_stats.valid_actions[ability_id] += 1

if obs.player_result:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

action stats parsing moved to custom ActionParser class

if data:
data_file = FLAGS.data_dir + replay_name + "_" + str(player_id) + '.json'
with open(data_file,'w') as outfile:
json.dump(data,outfile)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to disk file saving at end of replay if custom parser parse_step returns data

friendly_army,enemy_army,all_features['player'].tolist(),
all_features['available_actions'].tolist(),actions,winner,
race,enemy_race]
return full_state
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning from this function initiates a data list for each step that will be saved to a file for this replay and player perspective

FLAGS = flags.FLAGS
flags.DEFINE_integer("parallel", 1, "How many instances to run in parallel.")
flags.DEFINE_integer("step_mul", 8, "How many game steps per observation.")
flags.DEFINE_string("replays", None, "Path to a directory of replays.")
flags.DEFINE_string("parser", "pysc2.replay_parsers.base_parser.BaseParser",
"Which agent to run")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update help text.

FLAGS = flags.FLAGS
flags.DEFINE_integer("parallel", 1, "How many instances to run in parallel.")
flags.DEFINE_integer("step_mul", 8, "How many game steps per observation.")
flags.DEFINE_string("replays", None, "Path to a directory of replays.")
flags.DEFINE_string("parser", "pysc2.replay_parsers.base_parser.BaseParser",
"Which agent to run")
flags.DEFINE_string("data_dir", "C:/",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't set a default value that will be wrong or invalid for a large number of users. C:/ doesn't exist on linux, and any default location will be wrong for someone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've changed the default to be None, and data is only saved to a file if the user supplies a data_dir. Another option would be to use os.getcwd() and use the current working directory as the default directory.

@@ -192,7 +122,7 @@ def run(self):
self._print("Empty queue, returning")
return
try:
replay_name = os.path.basename(replay_path)[:10]
replay_name = os.path.basename(replay_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason it was just the first 10 is that I was running mainly over replays that were sha1 named, so very long and the prefix gave enough uniqueness. I'm fine with showing the full as long as it's readable.

@@ -216,16 +146,16 @@ def run(self):
self._print("Starting %s from player %s's perspective" % (
replay_name, player_id))
self.process_replay(controller, replay_data, map_data,
player_id)
player_id,info,replay_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the google python style guide: spaces between args.

@@ -237,7 +167,7 @@ def _update_stage(self, stage):
self.stats.update(stage)
self.stats_queue.put(self.stats)

def process_replay(self, controller, replay_data, map_data, player_id):
def process_replay(self, controller, replay_data, map_data, player_id,info,replay_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more spaces for style guide

def calc_armies(screen):
friendly_army = []
enemy_army = []
unit_list = np.unique(screen[6])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not hardcode these numbers. Look them up in features.SCREEN_FEATURES. That's both for readability and so they don't break if we add new layers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

class StateParser(base_parser.BaseParser):
"""Action statistics parser for replays."""
def valid_replay(self,info, ping):
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh yeup. Removed.

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Action statistics parser for replays."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through this file it may be better to put it in your repo. The action stats is already a little bit complicated as an example. It'd be nice to have a simpler example. Some ideas:

  • distribution of steps between actions, which could be useful for finding reasonable APMs and step_mul
  • distribution of APM and MMRs to graph APM vs MMR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, updated example to just scrape the General player information layer from each step, along with the winning player. Could be used to build a winner prediction model based on resource/supply/army size data as the game progresses.

import collections
import six

class BaseParser(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you read about MapReduce? I haven't done enough research, but consider whether that would be a good framework to use here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no, not currently in my wheelhouse, but might be a good opportunity to dig into it a little.

flags.mark_flag_as_required("replays")
FLAGS(sys.argv)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be dealt with by app.run and shouldn't be explicitly needed.

# Save scraped replay data to file at end of replay if parser returns
# and data_dir provided
if data:
if FLAGS.data_dir:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data only saved to file if data_dir provided from user

@cole-maclean
Copy link
Contributor Author

Alright, @tewalds can you take a look when you get a chance? Think I hit all your comments. I also added some documentation about replays and parsers.

@tewalds
Copy link
Contributor

tewalds commented Nov 11, 2017

I haven't looked at the recent changes, but it's probably also worth renaming replay_actions.py to something more general like process_replays.py or similar. Also, thanks for adding documentation. If you have a reasonable way to do it, some unit tests would be appreciated as well.

@cole-maclean
Copy link
Contributor Author

Agreed, updated to process_replays. I'll put some thought to testing.

self._print("Empty queue, returning")
return
try:
self.load_replay(replay_path, controller, ping)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replay loading refactored into own method 'load_replay'

return

def load_replay(self, replay_path, controller, ping):
replay_name = os.path.basename(replay_path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored load_replay method to modularize replay processing

@cole-maclean
Copy link
Contributor Author

Ok @tewalds I've added some tests, replay_parser_test. Probably not comprehensive, but should provide a good framework for building up future tests if needed, and follows pretty closely to the design of other test scripts in the repo. I think at this point this should close out a first pass for this functionality. We can discover and experiment with improvements (ie. MapReduce) in a future PR. Let me know if you have any feedback or see some snags! Thanks Timo.

@cole-maclean cole-maclean changed the title [WIP] Abstracted Replay Data Parsers Abstracted Replay Data Parsers Nov 17, 2017
merge_dict(self.made_actions, other.made_actions)

def valid_replay(self, info, ping):
"""Make sure the replay isn't corrupt, and is worth looking at."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed valid replay conditions to be very low for test replay pass. The checks can stand as an example for users who wish to make valid_replay checks more aggressive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants