-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Agent data collection #348
Comments
That's an interesting question. My inclination is to say that since the scheduler keeps track of time in a model, models should have just one scheduler. Check out the custom scheduler in the Wolf-Sheep example (if you haven't already) to see how you can activate different types of agents in different orders. That doesn't solve the problem of how to handle data collection for heterogeneous attributes, though. The easy but ugly way is to give all agents the attribute, and just set some of them to 0, or None, or some similar 'N/A' value. I think the better way would be to give the DataCollector a default 'collect attribute' behavior, which would also let us get away from some of the ugliness with passing lambdas, etc. As part of that, the DataCollector would ensure that agents had the appropriate attribute, and only collect it if they did. |
Aha! I did not look into the example, because I thought the order is not important for the wolf-sheep example. And I did not think about implementing a custom scheduler. But after reconsidering I strongly agree that there should be exactly one scheduler per model. Regarding the DataCollector, a "collect_attribute" would indeed be nice and more intuitive than the lambda functions. But I would still lean towards additionally defining the agent-type. Staying with the Wolf-Sheep example, one might be interested only in the position of the wolves, but querying the pos-attribute would still query all agents. |
To advance on this, in the datacollection module I modified the def _new_agent_reporter(self, reporter_name, reporter_function=None):
""" Add a new agent-level reporter to collect.
Args:
reporter_name: Name of the agent-level variable to collect.
reporter_function: Function object that returns the variable when
given an agent object.
"""
if isinstance(reporter_function, str):
reporter_function = self._collect_attribute(reporter_function)
self.agent_reporters[reporter_name] = reporter_function
self.agent_vars[reporter_name] = []
def _collect_attribute(self, attribute):
""" return a reporter function that gets an attribute with the name of
the reporter, if an agent has that attribute
"""
def reporter_function(agent):
if hasattr(agent, attribute):
return getattr(agent, attribute)
return reporter_function So instead of calling the agent-reporter with something like What do you think? |
I just stumbled upon this issue again in my current model and I still think my last comment offers a nice solution. If you think this is a good way I will create a test for this and submit a PR. |
I ran into this as well and used another approach. Example:
What I did was simply altering the uids. E.g. all kings would have uids ['k0', 'k1', ...] and birds would have ['b0', 'b1', ...]. You get the idea. The agent reporter would then look like: I didn't really explore the above solution from Corvince. But I just wanted to add mine here, which seems to work well for me and is stupidly simple. An alternative to changing uids would be to simply add a property to the class. Something like Agent.type = 'king', and then verifying this when you collect data. |
Nowadays the data-collector simply returns "None" if an attribute doesn't exist |
@Corvince |
There is https://github.com/projectmesa/mesa/pull/1702/files, which adds an optional argument
This is the only documentation of the new feature, and so there hasn't been proper guide written for this yet. If you
It should automatically ignores agents that don't have |
Thank you for your instruction. Do you mind to guide me again? [update] |
That sounds like a serious bug. Do you have a minimal code reproducer, so I can tinker with it and fix it? |
Hi @rht,
|
Mesa only automatically returns None if your data collector only consists of string collectors. I.e. for your model agent_reporters={"satisfication": "satisfication",
"unique_id": "unique_id"} At some point this was planned to be the main way to collect Attributes (because it is also the fastest), but custom functions are still around so I kind of closed this issue to early. |
Yes this is indeed a bug. Looking at the code of #1702 now we can see that it completely removes None values. But of course that leaves blank spaces in the Data frame, so the values move to the left (because there are too few values to fill the DF and no indication where values should go). Not sure how that option was supposed to work. @rht? You can also see in the discussion of #1702 that the feature alone wasn't meant to remove the need for handling Attribute errors on the user side. Sorry for that. |
It looks like we have to make a choice:
We could use I think one solution would be to transpose the data collection agent_records = {}
for k, func in self.agent_reporters.items():
record = tuple((agent.unique_id, r) for agent in model.schedule.agents if r := func(agent) is not None)
agent_records[k] = record The records can be merged into 1 DF via the |
Maybe I am being dumb right now, but what was the purpose of removing None values in the first place? It doesn't facilitate multi-agent data collection. So I think the choice should be obvious to not exclude None values but receive the correct DF. What would be the advantage in receiving a wrong dataframe, just to save one from the slight inconvenience (?) of having lots of None values. Am I missing something here? |
It wasn't raised as a GH issue, but @tpike3 encountered OOM when running Sugarscape G1MT on Colab (see #1702 (comment)). I suppose the storage problem here is due to Python's dict of list growing with list size even though the constituents are None's. Another option: maybe a DF for |
But then we really need to clear things up. Because until now I thought the remove None value function is somehow related to multi agent data collection, because it is discussed here and in #1702, which "fixed" #1419, which was also related to multi agent data collection. If this is just about resolving that memory issue then this needs to be further investigated. Because it sounds very strange that removing some None values solves this. None values themselves take up nearly no memory. And I don't know which "dict of list" you are referring to, but yes something like that must be going on. But it still sounds fishy, since colab has 13GB of ram, more than most consumer hardware. So I wonder why this hasn't been encountered previously. But right now we should focus on resolving the bug found by @philip928lin because that might really mess up some people's research. |
@Corvince I had a very long explanation, but as I am digging in I am finding inconsistencies in my understanding, so I will need to dig into this some more. Regardless, when updating the sugarscape with traders the memory issue became the code was collecting ~2500 none values each step for the sugar and space which start to break the Colab memory. Sugarscape examples are here](https://github.com/SFIComplexityExplorer/Mesa-ABM-Tutorial/tree/main). I still need to updates for Mesa 2.0 but I think I will need to work through this issue first. Short version appreciating None take up a very small amount of memory, when you have agents at each grid cell like plant life and collect against them it still becomes problematic. |
Its still hard to imagine, would be great if you could look into this. For reference (and for the fun of it), a simple list of 2500 x = [[i, 1, None] for i in range(2500)] Using from pympler import asizeof
asizeof.asizeof(x) we find that this list of lists consumes 300kB. So after 1000 steps we are at 300MB. Thats still quite far away from Colabs 13GB of RAM. |
This is the original
where the tuple element can be dropped while safely retaining which agents have which values. |
You should measure/debug on the actual agent records object at https://colab.research.google.com/github/SFIComplexityExplorer/Mesa-ABM-Tutorial/blob/main/Session_19_Data_Collector_Agent.ipynb. |
Thank you for the link, I couldn't find the right version. In your link I only had to change Analyzing the actual agent_records object gave me 310MB of memory usage and for the None-removed version 9MB. So that was very nice to see my approximation of 300MB being exactly true. But this also shows that while yes, removing None can save lots of space compared to the full dataset, no that doesn't prevent the model from being run on colab. I could easily store 10 model runs in colabs memory. @tpike3 I realized that having multiple tabs of colab open each session shares the same memory. So maybe you were simply doing too much colab work at the same time?
At first this looks nice and I like the semantics here of retaining what value is being collected. But I am afraid this won't scale very well. For this small example your version has a larger memory footprint (with None being removed of course), due to the dictionary overhead. That probably goes away with larger size, but it doesn't scale with collecting more attributes, because you always have to store the unique_id with each data value. For example: ('A0', 'a', 'b', 'c', 'd') would become {'A': ('A0', 'a'), {'B': ('A0', 'b'), {'C': ('A0', 'c'), {'D': ('A0', 'd'), Which can easily take up more memory. So it will really depend on how many None values you have. Also I am worried that we need additional code to put the dataframe back together and this will further complicate the code. And the reason to favor #1702 instead of #1701 was to have simpler code. That goes away for something that could also be done after the fact by simply calling |
That makes a lot of sense now. Maybe it was a coincidence that @tpike3 's memory usage was relieved by freeing ~300 MB, in each sessions (as such, it could be gigabytes)? Regarding with multi-agent data collection, @philip928lin already had the correct DF by using |
I just went back thru and actually found another change in Mesa 2.0 that broke the tutorial I need to go back in and fix. So i will try and get to that this weekend. However, if you run session 20 (batch_run) and comment out line 204 in the the Model cell
This results in GBs of memory usage with one colab open. You also need to change the instantiation of the sugar and spice landscape (lines 92 to 105) to ...
@Corvince , @rht, @philip928lin let me know what to thinking on either something I am messing up or the best way to move forward. |
@tpike3 I can confirm that batch run leads to excessive memory usage, although it doesn't actually start that many model runs. I need to investigate this further but my first impression is that something is off with batch_run. |
Thanks @Corvince I am wondering that to, maybe it wasn't the datacollector but batch_run. I l always behind, but I will dabble with it as well. |
I'm a big fan of Corvince's latest proposal. I think it's both elegant and adds a huge amount of capability and flexibility! @jackiekazil @tpike3 @rht I'm really curious what you think! |
I would like to give implementing @Corvince’s proposal a go. But before doing that, I need to know if there is broader support or if we need to go a different direction, like my previous proposal or otherwise. I would also really hope we can move this forward. “No”, “I disagree”, “I don’t have time”, “this shouldn’t be a priority” are all legit answers. But please just communicate anything, than everyone knows how the deck is stacked. |
Tracks agents in the model with a defaultdict. This PR adds a new `agents` dictionary to the Mesa `Model` class, enabling native support for handling multiple agent types within models. This way all modules can know which agents and agents type are in the model at any given time, by calling `model.agents`. NetLogo has had agent types, called [`breeds`](https://ccl.northwestern.edu/netlogo/docs/dict/breed.html), built-in from the start. It works perfectly in all NetLogo components, because it's a first class citizen and all components need to be designed to consider different breeds. In Mesa, agent types are an afterthought at best. Almost nothing is currently designed with multiple agent types in mind. That has caused several issues and limitations over the years, including: - projectmesa#348 - projectmesa#1142 - projectmesa#1162 Especially in scheduling, space and datacollection, lack of a native, consistent construct for agent types severely limits the possibilities. With the discussion about patches and "empty" this discussion done again. You might want empty to refer to all agents or only a subset of types or single type. That's currently cumbersome to implement. Basically, by always having dictionary available of which agents of which types are in the model, you can always trust on a consistent construct to iterate over agents and agent types. - The `Model` class now uses a `defaultdict` to store agents, ensuring a set is automatically created for each new agent type. - The `Agent` class has been updated to leverage this feature, simplifying the registration process when an agent is created. - The `remove` method in the `Agent` class now uses `discard`, providing a safer way to remove agents from the model.
Tracks agents in the model with a defaultdict. This PR adds a new `agents` dictionary to the Mesa `Model` class, enabling native support for handling multiple agent types within models. This way all modules can know which agents and agents type are in the model at any given time, by calling `model.agents`. NetLogo has had agent types, called [`breeds`](https://ccl.northwestern.edu/netlogo/docs/dict/breed.html), built-in from the start. It works perfectly in all NetLogo components, because it's a first class citizen and all components need to be designed to consider different breeds. In Mesa, agent types are an afterthought at best. Almost nothing is currently designed with multiple agent types in mind. That has caused several issues and limitations over the years, including: - #348 - #1142 - #1162 Especially in scheduling, space and datacollection, lack of a native, consistent construct for agent types severely limits the possibilities. With the discussion about patches and "empty" this discussion done again. You might want empty to refer to all agents or only a subset of types or single type. That's currently cumbersome to implement. Basically, by always having dictionary available of which agents of which types are in the model, you can always trust on a consistent construct to iterate over agents and agent types. - The `Model` class now uses a `defaultdict` to store agents, ensuring a set is automatically created for each new agent type. - The `Agent` class has been updated to leverage this feature, simplifying the registration process when an agent is created. - The `remove` method in the `Agent` class now uses `discard`, providing a safer way to remove agents from the model.
Note that this discussion is largely continued here: |
I'm posting back here instead of #1944, because it directly follows a proposal here: I'm inclined to say that @Corvince was closest with his API: dc = DataCollector(
items={
"wolf_vars": collect(
target=model.get_agents_of_type(Wolf),
attributes={
"energy": "energy",
"healthy": lambda a: a.energy > 5,
aggegates={
"mean_energy": ("energy", np.mean()),
"number_healty": ("healthy", sum()),
),
}
) This would return the following dictionary: {
"wolf_vars": {
"attributes": {
"agent_id": [1, 2, 3], # List of agent IDs
"energy": [3, 7, 10], # Energy levels of each wolf
"healthy": [False, True, True], # Whether each wolf is healthy (energy > 5)
},
"aggregates": {
"mean_energy": 6.67, # Mean energy of all wolves
"number_healthy": 2 # Number of healthy wolves
}
}
} Implementation wise, this could roughly look like: class DataCollector:
def __init__(self, items):
self.items = items
self.data = {
key: {
"attributes": {},
"aggregates": {}
}
for key in items
}
def collect(self, model):
for item_name, item_details in self.items.items():
attributes = item_details['attributes']
aggregates = item_details['aggregates']
# Collect agent IDs
self.data[item_name]["attributes"]["agent_id"] = agents.get("unique_id")
# Collect attributes for each agent
for attr_name, attr_func in attributes.items():
if str(attr_func):
# Use AgentSet.get()
else:
# Use AgentSet.apply()
# Collect aggregates
for agg_name, (attr_name, agg_func) in aggregates.items():
values = self.data[item_name]["attributes"][attr_name]
self.data[item_name]["aggregates"][agg_name] = agg_func(values) I think this gives a huge amount of flexibility, while offering a logical code path: First collect the raw agent data, than aggegrate if needed. A nice benefit is that |
One thing which could be considered is not running aggerates per |
The new ideas that I can incorporate into #2199:
I still find the API to be too hectic, too verbose for casual users to intuitively remember. Unless there is a key feature that the simple API in #2199 can't cover. And hence why I implemented the way I did in #2199 and ditched the fancy measure classes. Reminder that there is not much time left on the drawing board. Only ~2 weeks left. |
One challenge I keep encountering is with the terminology we're using. I believe we're conflating data collection and data analysis too often, which muddies the distinction between the two and distracts us from what we want to achieve. In my current job, I’ve had to re-evaluate various libraries, focusing on what makes some more user-friendly than others. I’ve found that the deciding factor in terms of ease of use is having sensible defaults combined with the ability to fully customize under the hood. An intuitive API for data collection, in my view, would look something like this: data = run_model(model) However, this kind of simplicity is missing from the current API. The reason I advocate for this approach is that I typically prefer to collect as much data as possible during the model run and perform the analysis afterward, either through custom functions or with built-in Mesa functions. By default, I believe we should automatically collect all agent and model attributes (and possibly every property) at every step. Aggregates, by their nature, can be calculated post-run. Expressions like I anticipate concerns about the potential performance impact of this approach. However, I don't think this will be a significant issue for most models. Data collection should be implemented at a low level, with more "expensive" convenience functions like data = run_model(model, data_collector=DataCollector(...)) This way, the API can focus on being flexible rather than overly concise. It doesn’t need to be memorized for every model but can be something that users opt into when they have specific requirements. |
I broadly agree with the vision of @Corvince. However, for me, there is still a fundamental difference between the various attributes within a model/agent and what data one wants to collect about the model (and whether to collect this data over time). So, I am not in favor of just collecting everything by default. Rather, I want users to declare explicitly within the model/agent that a given attribute is collectable/observable. Next, outside the model, the user can specify how to collect it (over time or at the end of the runtime). This is still in line with having a simple and concise API that is incredibly flexible and so small that it is easy to remember. So you would get something like this class MyModel(model):
gini = Observable()
# rest of model goes here
class MyAgent(Agent):
wealth = Observable()
# rest of agent goes here
model = MyModel()
collectors = [AgentCollector(MyAgent.wealth),
ModelCollector(MyModel.gini)]
model.run(ticks=100) |
I still don't fully understand this sentiment. If I declare an attribute, it must play a decisive role for my model - otherwise I wouldn't need it. And so of course it seems advisable to keep its data. There is nothing lost in keeping more data than I might actually analyze. The other way around is much more annoying. If I want to analyze something additional, that I haven't thought about before, I now have to rerun my model - just to collect additional data. The same logic applies to aggregating - I would never do this at runtime. Its not possible to disaggregate later, while aggregating can be done as late as needed. But for other reasons I am much more in favor of exploring observables/signals as main building blocks for mesa models now. And then I agree that their declaration could be a nice entry point also for declaring how/if they should be collected. But I wouldn't tie this too much into the data collection discussion. But I also agree that for my envision API an easy way to declare included/excluded attributes would be desirable. And doing this directly at the attribute level would be viable. |
Yes, but from this, it does not follow that all attributes are part of the outcomes that you, as an analyst, are interested in. I see models as objects on which one performs experiments. It is good practice to carefully design the experimental setup. This includes specifying the variables to explicitly control and the data to gather for this experiment. Collecting simply everything, which I see a lot of people do, in my experience, often devolves into data dredging and bad science. I hope this clarifies where I am coming from.
I started exploring psygnal, and it seems like a nice library on which to build. For example, it solves the problem of observing changes to collections. I hope to find some time in the coming weeks to rerig my datacollection branch on top of this. |
Philosophically I understand your point much better now, but practically speaking I think it applies much earlier. The only distinction I immediately see is between private and public variables. In some examples there are attributes that hold some private state (e.g. countdown for the grass patches in the wolf sheep example). Those should probably be declared as private variables and need not be collected. But as soon as they are used for some form of interaction (i.e. public), they should be measured. |
Yes, this is what I was getting at. However, the cutoff is not between private and public persé. In the wolf-sheep example, the position is critical to the model's functioning, but for analysis (note: not visualization) purposes, this is not so relevant. In contrast, in Eppstein, the agent's internal state (wanting to protest but not daring to do so) can be quite relevant to tracking for analysis purposes.
It might be quite easy to automatically track all attributes declared as "Observable" by default, as you envisioned in your API. This would still involve some additional collector classes that are part of the basic model class or part of some run_experiment function (as in your sketched API). In my view, data collection is independent of the model and, conceptually does not belong inside a model class). So you would get something like class MyModel(model):
gini = Observable()
# rest of model goes here
class MyAgent(Agent):
wealth = Observable()
# rest of agent goes here
model = MyModel()
data = run_experiment(model, ticks=100) Here, data would be some new Data class from which you can grab, say, the agent-level data. It could have some convenience methods for converting to, e.g., data frames, but it would leave it up to the user to decide how to make the results persistent. |
class MyModel(model):
gini = Observable()
# rest of model goes here
class MyAgent(Agent):
wealth = Observable()
# rest of agent goes here Unless it is a requirement/constraint of the psygnal syntax, marking both as an With the current existing data collector API, one could already move it to outside of the model specification. And it just needs the wrapper |
Thanks for this extensive and insightful discussion. I will try to wrap my mind some further around the Observable() ideas. On thing that's interesting is that with #2219 being merged, the AgentSet can now do practically anything we originally wanted to do in the DataCollector:
Some simple aggerates are also already possible, with |
@Corvince with the class MyModel(Model):
def __init__(self):
super().__init__()
self.datacollector = DataCollector(
agent_reporters={"life_span": "life_span"},
# The new agenttype_reporters argument
agenttype_reporters={
Wolf: {"sheep_eaten": "sheep_eaten"},
Sheep: {"wool": "wool_amount"},
Animal: {"energy": "energy"} # Collects from all animals
}
) |
Yes! Finally 🎉 |
Dear all,
This is actually how I came across this issue: I wanted to activate different agent types sequentially (but both randomly), so I used two different schedulers, but this broke the data collection. Currently, agent reporters get their agents hard-coded from model.scheduler.agents, assuming it exists and failing if your scheduler is named differently. One way to fix this would be to (optionally?) supply the scheduler to the DataCollector.
The downside is that if you have agents that are not part of any schedule you still can't collect data for them. That's already a problem right now, so it wouldn't worsen the situation, but maybe someone has a better long-term solution to this?
Also, if you use the same scheduler, there seems to be no way to collect data from different agents. If you for example want to count the wealth of some of your agents, but not all agents have a wealth attribute, it fails.
The text was updated successfully, but these errors were encountered: