feat(web): add input schema to improve web operator #819

chuang8511 · 2024-11-06T12:42:48Z

Because

we want to filter urls

This commit

add the filter function in web operator
add termination system for context
fix compogen bug

Note

we update compogen to add blank line back in template. It is the reason why there are much more files changed.

linear · 2024-11-06T12:42:50Z

INS-6739 Add regex/substring filter option for links in TASK_CRAWL

pkg/component/operator/web/v0/config/tasks.json

pkg/component/operator/web/v0/crawl_website.go

pkg/component/operator/web/v0/config/tasks.json

pkg/component/operator/web/v0/helper.go

jvallesm · 2024-11-11T13:29:12Z

pkg/component/operator/web/v0/crawl_website.go

+			}
+		}
+	}()
+
 	<-ctx.Done()


Is this redundant with the previous <-ctx.Done()?

We need this to block the main execution.

Yeah but you can skip the go func() in the previous block and run the for / select in the main thread, right? Otherwise it looks like you're dispatching work in order to wait for it immediately.

Oh, do you mean we actually don't need another go-routine for this block?
So, <-ctx.Done() in the main thread is not necessary.

inactivityTimer := time.NewTimer(2 * time.Second) defer inactivityTimer.Stop() for { select { case <-pageUpdateCh: inactivityTimer.Reset(2 * time.Second) // If no new pages for 2 seconds, cancel the context case <-inactivityTimer.C: cancel() return // If the context is done, we should return case <-ctx.Done(): return } }

Almost what I meant, yes. In this case you'll need to break from the for loop instead of returning. But this whole block might not be necessary, depending on the thread about c.Wait()

depending on the thread about c.Wait()

Yes, I will try it out later! Thanks for carefully reviewing!

depending on the thread about c.Wait()

It does not work, so I modified the code here.

pkg/component/operator/web/v0/crawl_website.go

jvallesm · 2024-11-11T13:34:04Z

pkg/component/operator/web/v0/crawl_website.go

@@ -148,6 +148,11 @@ func (e *execution) CrawlWebsite(input *structpb.Struct) (*structpb.Struct, erro
 		r.Headers.Set("User-Agent", randomString())
 	})

+	// colly.Wait() does not terminate the program. So, we need a system to terminate the program when there is no collector.
+	// We use a channel to notify the main goroutine that a new page has been scraped.
+	// When there is no new page for 2 seconds, we cancel the context.


Silly question: is there a way to know there aren't remaining URLs to scrape (e.g. cancelling the context after Wait finishes)? Even if there's a sweet spot, waiting for any amount of time is always a compromise between being inefficient (waiting when we could return) and risking data loss (there's new data but it's taking longer than the threshold).

Originally, I expected colly should finish the job when there aren't remaining URLs.

But, I could not find a way to close colly when there is no callback jobs.
So, I decided to use this way to deal with the case that the output length hasn't achieved the max-k but there is no URLs.

Something that isn't clear to me is what you mean by

does not terminate the program. So, we need a system to terminate the program when there is no collector.

I'm not familiar with colly, so what I'm proposing might not have sense (in which case, I'm just trying to understand the code). Colly is a popular enough project, so my intuition tells me there has to be a simpler way to finish the execution when all the websites have been crawled. Also I'm assuming that c.Wait() does work and that it halts the execution until all the sites have been scraped.

From the assumption I just mentioned, what I see you're trying to accomplish is waiting until the first occurrence between:

All the sites have been scraped. c.Wait() will signal this.

We've reached the 2 minute timeout. ctx.Done() will signal this.

We've reached the limit of pages. ctx.Done() will also signal this, as we cancel the context when we find this condition.

Therefore, we need to wait for either c.Wait() or ctx.Done(). I think we can avoid timing the duration between 2 completed pages and select between ctx.Done() and a channel that is closed after c.Wait() ends:

scrapeDone := make(chan struct{}) go func() { defer close(scrapeDone) _ = c.Visit(inputStruct.URL) c.Wait() }() select { case <- scrapeDone: // No more pages to scrape. case <-ctx.Done(): // Timeout or page limit reached }

All the sites have been scraped. c.Wait() will signal this.

Yes! It is the problem. I expect c.Wait() will finish when there is no more callback (job).
However, when I did end-to-end test, I noticed it did not close this go-routine.
If c.Wait() can close it, we won't need another chan(If I understand it correctly).

I noticed it did not close this go-routine.

I don't fully understand this. How were you checking the goroutine had ended? The moment you use go func, the main thread stops caring about the function you're calling, so it will never know when c.Wait finished. That's why in my snippet I added a channel that's closed at the end of the spawned function. By listening to that channel, the main thread is signaled about the wait group being unlocked.

Also, isn't it simpler to use channels here than mutexes?

I am not quite sure what it specifically means in which part. We do use channels to signal the stop process here. right?

What I mean is that you're defining variables outside of the goroutines and using mutexes to make sure you modify them atomically. To me a much more natural pattern is using channels to communicate the information outside of the routines. There's this article and also the go proverb:

Channels orchestrate; mutexes serialize..

What I mean is that you're defining variables outside of the goroutines and using mutexes to make sure you modify them atomically.

I got what you mean here. You mean the output pages. right?
I will address this point in other PR and take care of this part in the future development.

Adding this, we can check the 3 scenarios work (in your sample, we'd need to port that to the component code):

Let's back to the current code(PR) not sample one.
How about I do the following thing.

I will add back the scrapeDone channel and remain the "2 second part" in the codebase.
So, we can deal with 4 scenarios.

No more pages to scrape: c.Wait() returns and the goroutine closes scrapeDone. Program finishes before the timeout.

Max pages scraped: c.OnResponse cancels the context / closes scrapeDone. Program finishes before the timeout.

Max pages haven't been collected before the timeout. Context is canceled and program finishes at the timeout.

Max pages have not been collected before the timeout. Context is canceled and the program finishes when there are no more validate data in 2 seconds.

And, why we need the 4th one is because what I mentioned here.

Agreed, make sure you document the 4th in the code 🤝 I updated the sample to use channels.

Added it back.

@jvallesm
Could you help me take final check?
If there is no problem, I'd merge it and release it in this sprint.

chuang8511 · 2024-11-11T18:13:57Z

@jvallesm
Thanks for reviewing.
I modified and replied all of them.
Please take a look when you have time!

pkg/component/operator/web/v0/crawl_website.go

jvallesm · 2024-11-12T08:17:05Z

pkg/component/operator/web/v0/crawl_website.go

+			}
+		}
+	}()
+
 	<-ctx.Done()


Yeah but you can skip the go func() in the previous block and run the for / select in the main thread, right? Otherwise it looks like you're dispatching work in order to wait for it immediately.

jvallesm · 2024-11-12T08:18:45Z

pkg/component/operator/web/v0/crawl_website.go

@@ -148,6 +148,11 @@ func (e *execution) CrawlWebsite(input *structpb.Struct) (*structpb.Struct, erro
 		r.Headers.Set("User-Agent", randomString())
 	})

+	// colly.Wait() does not terminate the program. So, we need a system to terminate the program when there is no collector.
+	// We use a channel to notify the main goroutine that a new page has been scraped.
+	// When there is no new page for 2 seconds, we cancel the context.


Something that isn't clear to me is what you mean by

does not terminate the program. So, we need a system to terminate the program when there is no collector.

I'm not familiar with colly, so what I'm proposing might not have sense (in which case, I'm just trying to understand the code). Colly is a popular enough project, so my intuition tells me there has to be a simpler way to finish the execution when all the websites have been crawled. Also I'm assuming that c.Wait() does work and that it halts the execution until all the sites have been scraped.

From the assumption I just mentioned, what I see you're trying to accomplish is waiting until the first occurrence between:

All the sites have been scraped. c.Wait() will signal this.

We've reached the 2 minute timeout. ctx.Done() will signal this.

We've reached the limit of pages. ctx.Done() will also signal this, as we cancel the context when we find this condition.

Therefore, we need to wait for either c.Wait() or ctx.Done(). I think we can avoid timing the duration between 2 completed pages and select between ctx.Done() and a channel that is closed after c.Wait() ends:

scrapeDone := make(chan struct{}) go func() { defer close(scrapeDone) _ = c.Visit(inputStruct.URL) c.Wait() }() select { case <- scrapeDone: // No more pages to scrape. case <-ctx.Done(): // Timeout or page limit reached }

chuang8511 · 2024-11-14T19:53:36Z

@jvallesm I modified it again. Please take a look! Thank you!

codecov · 2024-11-15T10:00:28Z

Codecov Report

Attention: Patch coverage is 25.00000% with 45 lines in your changes missing coverage. Please review.

Project coverage is 20.50%. Comparing base (7e5d3de) to head (e3a186f).
Report is 20 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/component/operator/web/v0/crawl_website.go	13.46%	43 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #819      +/-   ##
==========================================
+ Coverage   20.01%   20.50%   +0.48%     
==========================================
  Files         354      359       +5     
  Lines       74750    75198     +448     
==========================================
+ Hits        14963    15416     +453     
+ Misses      57571    57484      -87     
- Partials     2216     2298      +82

Flag	Coverage Δ
unittests	`20.50% <25.00%> (+0.48%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…x when there is no scraping url

droplet-bot added instill vdp instill component labels Nov 6, 2024

chuang8511 marked this pull request as ready for review November 6, 2024 15:49

chuang8511 requested review from GeorgeWilliamStrong, donch1989 and jvallesm as code owners November 6, 2024 15:49

jvallesm reviewed Nov 11, 2024

View reviewed changes

chuang8511 changed the title ~~feat: add input schema to improve web operator~~ feat(web): add input schema to improve web operator Nov 11, 2024

chuang8511 force-pushed the chunhao/ins-6739-web-operator-improve branch from b07a167 to 9d4cd61 Compare November 11, 2024 19:07

jvallesm reviewed Nov 12, 2024

View reviewed changes

chuang8511 force-pushed the chunhao/ins-6739-web-operator-improve branch from 4a9055a to e3a186f Compare November 15, 2024 09:33

chuang8511 added 12 commits November 18, 2024 14:40

feat: add input schema to improve web operator

a047d1b

feat: add filter for crawler

6b7f5f9

fix: fix compogen blank line issue

3b039ff

chore: update doc by updated template

e15217b

fix: fix bug for filter url

a484a43

feat: add the termination system to avoid users to wait for cancel ct…

a8e023f

…x when there is no scraping url

fix: fix case that initial url does not match filter

c8e1a24

fix: clean the code and modify the input schema

2187902

fix: remove the redundant code

a180907

chore: fix period in doc, it should be dealt with in another ticket

a947bbf

fix: fix the stop program scenarios

1d23911

fix: fix the duplicate URL cases

cdfcada

chuang8511 force-pushed the chunhao/ins-6739-web-operator-improve branch from e3a186f to cdfcada Compare November 18, 2024 15:11

chuang8511 requested a review from jvallesm November 18, 2024 16:53

chuang8511 merged commit f7e1fe9 into main Nov 18, 2024
12 checks passed

chuang8511 deleted the chunhao/ins-6739-web-operator-improve branch November 18, 2024 17:38

droplet-bot mentioned this pull request Nov 18, 2024

chore(main): release 0.47.0-beta #824

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(web): add input schema to improve web operator #819

feat(web): add input schema to improve web operator #819

chuang8511 commented Nov 6, 2024 •

edited

Loading

linear bot commented Nov 6, 2024

jvallesm Nov 11, 2024

chuang8511 Nov 11, 2024

jvallesm Nov 12, 2024

chuang8511 Nov 12, 2024

jvallesm Nov 12, 2024

chuang8511 Nov 12, 2024

chuang8511 Nov 14, 2024

jvallesm Nov 11, 2024

chuang8511 Nov 11, 2024

jvallesm Nov 12, 2024

chuang8511 Nov 12, 2024

jvallesm Nov 12, 2024

jvallesm Nov 18, 2024

chuang8511 Nov 18, 2024

jvallesm Nov 18, 2024

chuang8511 Nov 18, 2024

chuang8511 Nov 18, 2024

chuang8511 commented Nov 11, 2024

jvallesm Nov 12, 2024

jvallesm Nov 12, 2024

chuang8511 commented Nov 14, 2024

codecov bot commented Nov 15, 2024

feat(web): add input schema to improve web operator #819

feat(web): add input schema to improve web operator #819

Conversation

chuang8511 commented Nov 6, 2024 • edited Loading

linear bot commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chuang8511 commented Nov 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chuang8511 commented Nov 14, 2024

codecov bot commented Nov 15, 2024

Codecov Report

chuang8511 commented Nov 6, 2024 •

edited

Loading