Skip to content

Commit

Permalink
perf(encoding): add html size buffer allocation streaming
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Dec 26, 2023
1 parent cbb4cc9 commit 215c706
Show file tree
Hide file tree
Showing 8 changed files with 83 additions and 63 deletions.
8 changes: 4 additions & 4 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions examples/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_examples"
version = "1.80.19"
version = "1.80.20"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "Multithreaded web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand All @@ -22,7 +22,7 @@ htr = "0.5.27"
flexbuffers = "2.0.0"

[dependencies.spider]
version = "1.80.19"
version = "1.80.20"
path = "../spider"
features = ["serde"]

Expand Down
2 changes: 0 additions & 2 deletions examples/encoding.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,4 @@ async fn main() {
});

website.crawl().await;

println!("Links found {:?}", website.get_links().len());
}
2 changes: 1 addition & 1 deletion spider/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider"
version = "1.80.19"
version = "1.80.20"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "The fastest web crawler written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand Down
20 changes: 10 additions & 10 deletions spider/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This is a basic async example crawling a web page, add spider to your `Cargo.tom

```toml
[dependencies]
spider = "1.80.19"
spider = "1.80.20"
```

And then the code:
Expand Down Expand Up @@ -91,7 +91,7 @@ We have a couple optional feature flags. Regex blacklisting, jemaloc backend, gl

```toml
[dependencies]
spider = { version = "1.80.19", features = ["regex", "ua_generator"] }
spider = { version = "1.80.20", features = ["regex", "ua_generator"] }
```

1. `ua_generator`: Enables auto generating a random real User-Agent.
Expand Down Expand Up @@ -129,7 +129,7 @@ Move processing to a worker, drastically increases performance even if worker is

```toml
[dependencies]
spider = { version = "1.80.19", features = ["decentralized"] }
spider = { version = "1.80.20", features = ["decentralized"] }
```

```sh
Expand All @@ -149,7 +149,7 @@ Use the subscribe method to get a broadcast channel.

```toml
[dependencies]
spider = { version = "1.80.19", features = ["sync"] }
spider = { version = "1.80.20", features = ["sync"] }
```

```rust,no_run
Expand Down Expand Up @@ -179,7 +179,7 @@ Allow regex for blacklisting routes

```toml
[dependencies]
spider = { version = "1.80.19", features = ["regex"] }
spider = { version = "1.80.20", features = ["regex"] }
```

```rust,no_run
Expand All @@ -206,7 +206,7 @@ If you are performing large workloads you may need to control the crawler by ena

```toml
[dependencies]
spider = { version = "1.80.19", features = ["control"] }
spider = { version = "1.80.20", features = ["control"] }
```

```rust
Expand Down Expand Up @@ -276,7 +276,7 @@ Use cron jobs to run crawls continuously at anytime.

```toml
[dependencies]
spider = { version = "1.80.19", features = ["sync", "cron"] }
spider = { version = "1.80.20", features = ["sync", "cron"] }
```

```rust,no_run
Expand Down Expand Up @@ -315,7 +315,7 @@ the feature flag [`chrome_intercept`] to possibly speed up request using Network

```toml
[dependencies]
spider = { version = "1.80.19", features = ["chrome", "chrome_intercept"] }
spider = { version = "1.80.20", features = ["chrome", "chrome_intercept"] }
```

You can use `website.crawl_concurrent_raw` to perform a crawl without chromium when needed. Use the feature flag `chrome_headed` to enable headful browser usage if needed to debug.
Expand Down Expand Up @@ -347,7 +347,7 @@ Enabling HTTP cache can be done with the feature flag [`cache`] or [`cache_mem`]

```toml
[dependencies]
spider = { version = "1.80.19", features = ["cache"] }
spider = { version = "1.80.20", features = ["cache"] }
```

You need to set `website.cache` to true to enable as well.
Expand Down Expand Up @@ -378,7 +378,7 @@ Intelligently run crawls using HTTP and JavaScript Rendering when needed. The be

```toml
[dependencies]
spider = { version = "1.80.19", features = ["smart"] }
spider = { version = "1.80.20", features = ["smart"] }
```

```rust,no_run
Expand Down
102 changes: 62 additions & 40 deletions spider/src/page.rs
Original file line number Diff line number Diff line change
Expand Up @@ -400,55 +400,77 @@ impl Page {
/// Html getter for getting the content with proper encoding. Pass in a proper encoding label like SHIFT_JIS. This fallsback to get_html without the [encoding] flag enabled.
#[cfg(feature = "encoding")]
pub fn get_html_encoded(&self, label: &str) -> String {
use encoding_rs::CoderResult;

match self.html.as_ref() {
Some(html) => match encoding_rs::Encoding::for_label(label.as_bytes()) {
Some(enc) => {
use encoding_rs::CoderResult;
let mut buffer_bytes = [0u8; 2048];
let buffer: &mut str = std::str::from_utf8_mut(&mut buffer_bytes[..]).unwrap();

let mut bytes_in_buffer = 0usize;
let mut output = String::new();
let mut decoder = enc.new_decoder();
let mut total_read_from_current_input = 0usize;

loop {
let (result, read, written, _had_errors) = decoder.decode_to_str(
&html[total_read_from_current_input..],
&mut buffer[bytes_in_buffer..],
false,
);
total_read_from_current_input += read;
bytes_in_buffer += written;
match result {
CoderResult::InputEmpty => {
break;
}
CoderResult::OutputFull => {
output.push_str(&buffer[..bytes_in_buffer]);
bytes_in_buffer = 0usize;
continue;
let process = |buffer: &mut str| {
let mut bytes_in_buffer = 0usize;
let mut output = String::new();
let mut decoder = enc.new_decoder();
let mut total_read_from_current_input = 0usize;

loop {
let (result, read, written, _had_errors) = decoder.decode_to_str(
&html[total_read_from_current_input..],
&mut buffer[bytes_in_buffer..],
false,
);
total_read_from_current_input += read;
bytes_in_buffer += written;
match result {
CoderResult::InputEmpty => {
break;
}
CoderResult::OutputFull => {
output.push_str(&buffer[..bytes_in_buffer]);
bytes_in_buffer = 0usize;
continue;
}
}
}
}

loop {
let (result, _, written, _had_errors) =
decoder.decode_to_str(b"", &mut buffer[bytes_in_buffer..], true);
bytes_in_buffer += written;
output.push_str(&buffer[..bytes_in_buffer]);
bytes_in_buffer = 0usize;
match result {
CoderResult::InputEmpty => {
break;
}
CoderResult::OutputFull => {
continue;
loop {
let (result, _, written, _had_errors) =
decoder.decode_to_str(b"", &mut buffer[bytes_in_buffer..], true);
bytes_in_buffer += written;
output.push_str(&buffer[..bytes_in_buffer]);
bytes_in_buffer = 0usize;
match result {
CoderResult::InputEmpty => {
break;
}
CoderResult::OutputFull => {
continue;
}
}
}
}

output
output
};

match html.len() {
15001..=usize::MAX => {
let mut buffer_bytes = [0u8; 2048];
process(
std::str::from_utf8_mut(&mut buffer_bytes[..]).unwrap_or_default(),
)
}
1000..=15000 => {
let mut buffer_bytes = [0u8; 1024];
process(
std::str::from_utf8_mut(&mut buffer_bytes[..]).unwrap_or_default(),
)
}
_ => {
let mut buffer_bytes = [0u8; 512];
process(
std::str::from_utf8_mut(&mut buffer_bytes[..]).unwrap_or_default(),
)
}
}
.into()
}
_ => Default::default(),
},
Expand Down
4 changes: 2 additions & 2 deletions spider_cli/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_cli"
version = "1.80.19"
version = "1.80.20"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "The fastest web crawler CLI written in Rust."
repository = "https://github.com/spider-rs/spider"
Expand All @@ -26,7 +26,7 @@ quote = "1.0.18"
failure_derive = "0.1.8"

[dependencies.spider]
version = "1.80.19"
version = "1.80.20"
path = "../spider"

[[bin]]
Expand Down
4 changes: 2 additions & 2 deletions spider_worker/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "spider_worker"
version = "1.80.19"
version = "1.80.20"
authors = ["madeindjs <[email protected]>", "j-mendez <[email protected]>"]
description = "The fastest web crawler as a worker or proxy."
repository = "https://github.com/spider-rs/spider"
Expand All @@ -22,7 +22,7 @@ lazy_static = "1.4.0"
env_logger = "0.10.0"

[dependencies.spider]
version = "1.80.19"
version = "1.80.20"
path = "../spider"
features = ["serde", "flexbuffers"]

Expand Down

0 comments on commit 215c706

Please sign in to comment.