Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUGFIX: [graph] Increase count of related elements fetched by Correlation Graph to 500000 #9240

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ckane
Copy link
Contributor

@ckane ckane commented Dec 5, 2024

The 500 upper limit is very low and results in a lot of correlation graphs displaying no correlations in the chart, while the main page of the Graph says there exist correlations. In some cases this limit was just 20. This fixes a lot of the reported problems where related reporting is seen on the Overview page of a analysis/case/etc. but the Correlation Graph is blank (because it isn't fetching all nodes, and the correlations don't show in the first 500 fetched). In some scenarios, the graph does show some items, but is missing other expected ones.

Proposed changes

  • Set first: 500000 when fetching first-degree and second-degree related nodes for Correlation Graphs

Related issues

Checklist

  • I consider the submitted work as finished
  • I tested the code for its functionality
  • I wrote test cases for the relevant uses case (coverage and e2e)
  • I added/update the relevant documentation (either on github or on notion)
  • Where necessary I refactored code to improve the overall quality

@ckane ckane mentioned this pull request Dec 5, 2024
Copy link

codecov bot commented Dec 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 65.26%. Comparing base (f8d757f) to head (ebcd2af).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9240      +/-   ##
==========================================
- Coverage   65.26%   65.26%   -0.01%     
==========================================
  Files         630      630              
  Lines       60311    60311              
  Branches     6777     6771       -6     
==========================================
- Hits        39361    39359       -2     
- Misses      20950    20952       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@labo-flg
Copy link
Member

labo-flg commented Dec 5, 2024

Thanks for your contributions !
I totally agree this limit makes the correlations views inconsistent, however it's a bit delicate to entirely remove the limit, for performance reasons.

Imagine you have a widely used entity like "APT28" or "USA", present in thousands of reports. GQL will wait for the complete resolution of this huge graph, which will take a long time. Display would suffer a lot too.

@ckane
Copy link
Contributor Author

ckane commented Dec 5, 2024

Thanks for your contributions ! I totally agree this limit makes the correlations views inconsistent, however it's a bit delicate to entirely remove the limit, for performance reasons.

Imagine you have a widely used entity like "APT28" or "USA", present in thousands of reports. GQL will wait for the complete resolution of this huge graph, which will take a long time. Display would suffer a lot too.

I agree, thought in my experience that behavior will probably hit an upper limit on browser or network timeout, which a user can still escape from if they don't like it - so with the low upper-limit, those "big reports" are unusable but also a bunch of medium-sized reports are also unusable. With this change, those big reports may still be unusable, but it makes the medium and small sized reports more usable and accurate.

The graphing library seems fairly performant, and here's an example of one with 1000's of items that seemed to render fine for me:
image

@ckane
Copy link
Contributor Author

ckane commented Dec 5, 2024

"BIG Reports" like the "Operation SMN (Novetta)" report that is available via MISP Feeds are still unusable, and eventually display an error. However, the normal entity graph for that report also doesn't work, and times out w/ an error while trying to fetch its content. This report is my favorite use case when trying to test/evaluate "large report" problems.
image

In my opinion, the failure mode of "too big" graphs occurs in a much more limited set of circumstances than the current "missing data" behavior today, and between the two, I would say it is preferred to tolerate the failure where a small number of large or very-related reports might error out with a render failure.

@ckane
Copy link
Contributor Author

ckane commented Dec 5, 2024

If you try to render a large graph and it's too big, it shows this error after about ~15secs, so IMO that behavior is preferable
image

@ckane
Copy link
Contributor Author

ckane commented Dec 6, 2024

I took this work a step further and have the correlation traversal algorithm traverse all related nodes of type cases, groupings, and reports regardless of which entity type is the "current view". This allows for the exploration of related OSINT reporting from within an Incident Reponse or Grouping container's correlation graph, and vice-versa.

Example view from within an RFI object I created:
image

@labo-flg labo-flg added community use to identify PR from community graph linked to graph display and manipulation labels Dec 6, 2024
@ckane ckane force-pushed the corr-graph-missing-items-issue9198 branch 4 times, most recently from ed0ab53 to 1ee9da9 Compare December 12, 2024 04:12
@labo-flg
Copy link
Member

I took this work a step further and have the correlation traversal algorithm traverse all related nodes of type cases, groupings, and reports regardless of which entity type is the "current view". This allows for the exploration of related OSINT reporting from within an Incident Reponse or Grouping container's correlation graph, and vice-versa.

That's a good point. Part of our work on #3227 for next milestone addressed this issue. It has been merged in #9175

Note that in the details view, the correlated containers are detected through common observables and indicators. To be consistent we also aligned in the graph view.
We add in this PR the ability to switch in the graph view from "observables and indicators" to "all entities" in case user needs extended correlations.

Now, concerning the limitation & performances issues, this raised a lot of discussions.
We need to be careful about the impact of such changes : in some conditions it might generate such a huge amount of data that your backend process would just eat all the memory and die.
See this issue : #9293

We think the best solution is a heavy refactoring of the graph code to fetch data bits by bits, paginating results and loading the graph incrementally.
I've drafted an issue for that #9359

@ckane
Copy link
Contributor Author

ckane commented Dec 16, 2024

I took this work a step further and have the correlation traversal algorithm traverse all related nodes of type cases, groupings, and reports regardless of which entity type is the "current view". This allows for the exploration of related OSINT reporting from within an Incident Reponse or Grouping container's correlation graph, and vice-versa.

That's a good point. Part of our work on #3227 for next milestone addressed this issue. It has been merged in #9175

Note that in the details view, the correlated containers are detected through common observables and indicators. To be consistent we also aligned in the graph view. We add in this PR the ability to switch in the graph view from "observables and indicators" to "all entities" in case user needs extended correlations.

Now, concerning the limitation & performances issues, this raised a lot of discussions. We need to be careful about the impact of such changes : in some conditions it might generate such a huge amount of data that your backend process would just eat all the memory and die. See this issue : #9293

We think the best solution is a heavy refactoring of the graph code to fetch data bits by bits, paginating results and loading the graph incrementally. I've drafted an issue for that #9359

Thanks! I agree that the best implementation would be to do a paginated walk to incrementally build the graph by making smaller requests that don't overload the back-end. I don't understand well enough how to do that for these 2nd-degree related items, however (thus, my solution being to just set the limits really really high).

I do think that having the ability to have the front-end query what the "total set" size would be ahead of time, so that it can render a progress meter during incremental graph build-out, would be a helpful UI element to it, so the end users know when the data is still being retrieved. The partial graph could be shown in real-time, and have additions to it added and displayed as they're fetched from the server.

Additionally, having a "stop" and/or "pause" button that allows the data fetch/graph-build to be stopped would also be helpful. So, for larger graphs, if the end user happens to see the data they're interested in visible on the graph at some point during the paginated ingest, they could stop or pause the data fetch and then explore the portion of the graph that's been built up to that point.

@ckane ckane force-pushed the corr-graph-missing-items-issue9198 branch from 2bd60d5 to 76acd0f Compare December 18, 2024 03:29
@labo-flg
Copy link
Member

That's where we want to go next, we're aligned on that :)

I'm thinking step-by-step queries, all properly paginated. The frontend would query the main container, then it's content, then for each one the relate containers, etc.
The graph would load bits by bits on screen, with a proper loading bar.

Knowing in advance the size of the full graph is challenging, but at least we can give visual feedback with all the info we can get, as soon as we get them.

@ckane ckane force-pushed the corr-graph-missing-items-issue9198 branch from 76acd0f to f89fbb0 Compare December 24, 2024 22:59
@richard-julien richard-julien force-pushed the master branch 5 times, most recently from b414944 to c7f4cb7 Compare January 13, 2025 09:27
ckane added 4 commits January 27, 2025 19:34
The 500 upper limit is very low and results in a lot of correlation
graphs displaying no correlations in the chart, while the main page of
the Graph says there exist correlations. In some cases this limit was
just 20.
For correlation graphs, include any cases, groupings, or reports as
correlations, regardless of the current container type
This limit was still at 20, resulting in empty graphs when correlations
were expected.
@ckane ckane force-pushed the corr-graph-missing-items-issue9198 branch from f89fbb0 to ebcd2af Compare January 28, 2025 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community use to identify PR from community graph linked to graph display and manipulation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants