Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Self-Host] Build Failure in html-transformer Module Due to Type Mismatch (*mut u8 vs *mut i8) When Using Docker #1103

Open
hacksman opened this issue Jan 28, 2025 · 3 comments

Comments

@hacksman
Copy link

Describe the Issue
When self-hosting Firecrawl using Docker, the build fails at the html-transformer module due to type mismatch errors. Specifically, the issue is related to CString::into_raw() returning a pointer of type *mut u8 while the function expects *mut i8.

To Reproduce
Steps to reproduce the issue:

  1. Clone the Firecrawl repository: git clone https://github.com/xxx/Firecrawl.git
  2. Navigate to the project directory: cd Firecrawl
  3. Run the Docker build: docker-compose up --build
  4. The build process fails with the following errors:
ng phf_macros v0.10.0
#0 60.75    Compiling tinystr v0.7.6
#0 60.95    Compiling icu_locid v1.5.0
#0 61.09    Compiling phf_macros v0.8.0
#0 61.64    Compiling phf v0.10.1
#0 61.81    Compiling icu_provider v1.5.0
#0 61.99    Compiling phf v0.8.0
#0 62.16    Compiling icu_collections v1.5.0
#0 62.94    Compiling icu_locid_transform v1.5.0
#0 63.07    Compiling string_cache_codegen v0.5.2
#0 63.49    Compiling phf_codegen v0.10.0
#0 63.59    Compiling icu_properties v1.5.1
#0 63.66    Compiling markup5ever v0.11.0
#0 64.17    Compiling selectors v0.24.0
#0 64.37    Compiling selectors v0.22.0
#0 65.58    Compiling string_cache v0.8.7
#0 65.83    Compiling icu_normalizer v1.5.0
#0 66.56    Compiling idna_adapter v1.2.0
#0 66.68    Compiling thiserror-impl v2.0.11
#0 66.76    Compiling thin-slice v0.1.1
#0 68.00    Compiling idna v1.0.3
#0 68.23    Compiling hashbrown v0.15.2
#0 68.40    Compiling form_urlencoded v1.2.1
#0 68.59    Compiling encoding_rs v0.8.35
#0 68.75    Compiling mime v0.3.17
#0 68.91    Compiling ryu v1.0.18
#0 68.96    Compiling bitflags v2.8.0
#0 69.24    Compiling kuchikiki v0.8.2
#0 69.29    Compiling url v2.5.4
#0 70.49    Compiling lol_html v2.2.0
#0 72.49    Compiling html-transformer v0.1.0 (/app/sharedLibs/html-transformer)
#0 72.62 error[E0308]: mismatched types
#0 72.62   --> src/lib.rs:33:5
#0 72.62    |
#0 72.62 13 | pub unsafe extern "C" fn extract_links(html: *const libc::c_char) -> *mut i8 {
#0 72.62    |                                                                      ------- expected `*mut i8` because of return type
#0 72.62 ...
#0 72.62 33 |     CString::new(serde_json::ser::to_string(&out).unwrap()).unwrap().into_raw()
#0 72.62    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `*mut i8`, found `*mut u8`
#0 72.62    |
#0 72.62    = note: expected raw pointer `*mut i8`
#0 72.62               found raw pointer `*mut u8`
#0 72.62
#0 72.63 error[E0308]: mismatched types
#0 72.63    --> src/lib.rs:153:5
#0 72.63     |
#0 72.63 57  | pub unsafe extern "C" fn extract_metadata(html: *const libc::c_char) -> *mut i8 {
#0 72.63     |                                                                         ------- expected `*mut i8` because of return type
#0 72.63 ...
#0 72.63 153 |     CString::new(serde_json::ser::to_string(&out).unwrap()).unwrap().into_raw()
#0 72.63     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `*mut i8`, found `*mut u8`
#0 72.63     |
#0 72.63     = note: expected raw pointer `*mut i8`
#0 72.63                found raw pointer `*mut u8`
#0 72.63
#0 72.65 error[E0308]: mismatched types
#0 72.65    --> src/lib.rs:341:20
#0 72.65     |
#0 72.65 337 | pub unsafe extern "C" fn transform_html(opts: *const libc::c_char) -> *mut i8 {
#0 72.65     |                                                                       ------- expected `*mut i8` because of return type
#0 72.65 ...
#0 72.65 341 |             return CString::new("RUSTFC:ERROR").unwrap().into_raw();
#0 72.65     |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `*mut i8`, found `*mut u8`
#0 72.65     |
#0 72.65     = note: expected raw pointer `*mut i8`
#0 72.65                found raw pointer `*mut u8`
commit 173028295b7c1a2fb1cddfb3ce245309c6599eb3 (HEAD -> main, origin/main, origin/HEAD)
Author: Móricz Gergő <[email protected]>
Date:   Tue Jan 28 09:41:37 2025 +0100

    fix(crawl): relative URL page discovery issues

commit b8c4e198d15ddba142bc39cba3eb7598db12da08
Author: Hercules Smith <[email protected]>
Date:   Tue Jan 28 09:40:30 2025 +0200

    Fix bad WebSocket URL in CrawlWatcher (#1053)

    * fix: bad websocket url in crawl watcher

    Fixed CrawlWatcher creating WebSocket using standard http url from base app.

    * Use regex to improve url replacement

commit 6b9e65c4f6fdc84bef213b34cd5a1d9d08ffb18d
Author: Nicolas <[email protected]>
Date:   Mon Jan 27 20:07:01 2025 -0300

    (feat/extract) Refactor and Reranker improvements (#1100)

    * Reapply "Nick: extract api reference"

    This reverts commit 61d7ba76f76ce74e0d230f89a93436f29dc8d9df.

    * Nick: refactor analyzer

    * Nick: formatting

    * Nick:

    * Update extraction-service.ts

    * Nick: fixes

    * NIck:

:
commit 173028295b7c1a2fb1cddfb3ce245309c6599eb3 (HEAD -> main, origin/main, origin/HEAD)
Author: Móricz Gergő <[email protected]>
Date:   Tue Jan 28 09:41:37 2025 +0100

    fix(crawl): relative URL page discovery issues

commit b8c4e198d15ddba142bc39cba3eb7598db12da08
Author: Hercules Smith <[email protected]>
Date:   Tue Jan 28 09:40:30 2025 +0200

    Fix bad WebSocket URL in CrawlWatcher (#1053)

    * fix: bad websocket url in crawl watcher

    Fixed CrawlWatcher creating WebSocket using standard http url from base app.

    * Use regex to improve url replacement

commit 6b9e65c4f6fdc84bef213b34cd5a1d9d08ffb18d
Author: Nicolas <[email protected]>
Date:   Mon Jan 27 20:07:01 2025 -0300

    (feat/extract) Refactor and Reranker improvements (#1100)

    * Reapply "Nick: extract api reference"

    This reverts commit 61d7ba76f76ce74e0d230f89a93436f29dc8d9df.

    * Nick: refactor analyzer

    * Nick: formatting

    * Nick:

    * Update extraction-service.ts

    * Nick: fixes

    * NIck:

    * Nick: wip
:
Author: Móricz Gergő <[email protected]>
Date:   Tue Jan 28 09:41:37 2025 +0100

    fix(crawl): relative URL page discovery issues

commit b8c4e198d15ddba142bc39cba3eb7598db12da08
Author: Hercules Smith <[email protected]>
Date:   Tue Jan 28 09:40:30 2025 +0200

    Fix bad WebSocket URL in CrawlWatcher (#1053)

    * fix: bad websocket url in crawl watcher

    Fixed CrawlWatcher creating WebSocket using standard http url from base app.

    * Use regex to improve url replacement

commit 6b9e65c4f6fdc84bef213b34cd5a1d9d08ffb18d
Author: Nicolas <[email protected]>
Date:   Mon Jan 27 20:07:01 2025 -0300

    (feat/extract) Refactor and Reranker improvements (#1100)

    * Reapply "Nick: extract api reference"

    This reverts commit 61d7ba76f76ce74e0d230f89a93436f29dc8d9df.

    * Nick: refactor analyzer

    * Nick: formatting

    * Nick:

    * Update extraction-service.ts

    * Nick: fixes

    * NIck:

    * Nick: wip
:
Author: Móricz Gergő <[email protected]>
Date:   Tue Jan 28 09:41:37 2025 +0100

    fix(crawl): relative URL page discovery issues

commit b8c4e198d15ddba142bc39cba3eb7598db12da08
Author: Hercules Smith <[email protected]>
Date:   Tue Jan 28 09:40:30 2025 +0200

    Fix bad WebSocket URL in CrawlWatcher (#1053)

    * fix: bad websocket url in crawl watcher

    Fixed CrawlWatcher creating WebSocket using standard http url from base app.

    * Use regex to improve url replacement

commit 6b9e65c4f6fdc84bef213b34cd5a1d9d08ffb18d
Author: Nicolas <[email protected]>
Date:   Mon Jan 27 20:07:01 2025 -0300

    (feat/extract) Refactor and Reranker improvements (#1100)

    * Reapply "Nick: extract api reference"

    This reverts commit 61d7ba76f76ce74e0d230f89a93436f29dc8d9df.

    * Nick: refactor analyzer

    * Nick: formatting

    * Nick:

    * Update extraction-service.ts

    * Nick: fixes

    * NIck:

    * Nick: wip
:
Author: Móricz Gergő <[email protected]>
#0 72.65
#0 72.65 error[E0308]: mismatched types
#0 72.65    --> src/lib.rs:350:5
#0 72.65     |
#0 72.65 337 | pub unsafe extern "C" fn transform_html(opts: *const libc::c_char) -> *mut i8 {
#0 72.65     |                                                                       ------- expected `*mut i8` because of return type
#0 72.65 ...
#0 72.65 350 |     CString::new(out).unwrap().into_raw()
#0 72.65     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `*mut i8`, found `*mut u8`
#0 72.65     |
#0 72.65     = note: expected raw pointer `*mut i8`
#0 72.65                found raw pointer `*mut u8`
#0 72.65
#0 72.65 error[E0308]: mismatched types
#0 72.65    --> src/lib.rs:359:37
#0 72.65     |
#0 72.65 359 |     drop(unsafe { CString::from_raw(ptr) })
#0 72.65     |                   ----------------- ^^^ expected `*mut u8`, found `*mut i8`
#0 72.65     |                   |
#0 72.65     |                   arguments to this function are incorrect
#0 72.65     |
#0 72.65     = note: expected raw pointer `*mut u8`
#0 72.65                found raw pointer `*mut i8`
#0 72.65 note: associated function defined here
#0 72.65    --> /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/alloc/src/ffi/c_str.rs:396:19
#0 72.65
#0 72.67 For more information about this error, try `rustc --explain E0308`.
#0 72.69 error: could not compile `html-transformer` (lib) due to 5 previous errors
------
failed to solve: executor failed running [/bin/sh -c cd /app/sharedLibs/html-transformer &&     cargo build --release &&     chmod +x target/release/libhtml_transformer.so]: exit code: 101

Expected Behavior
The build process should complete successfully, generating the required services and running Firecrawl without issues.

Screenshots
If applicable, add screenshots or copies of the command line output to help explain the self-hosting issue.

Environment (please complete the following information):

  • OS: macOS Sequola Version 15.1.1(M1)
  • Firecrawl Version: v1.4.1
  • Node.js Version: v20.18.1
  • Docker Version (if applicable): Docker version 20.10.21, build baeda1f

Logs
If applicable, include detailed logs to help understand the self-hosting problem.

Configuration

# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379 #for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379 #for self-hosting using docker, use redis://redis:6379. For running locally, use redis://localhost:6379
# PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/html
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/scrape
# PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3000/scrape
# PLAYWRIGHT_MICROSERVICE_URL=http://playwright:3000/scrape


## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false

# ===== Optional ENVS ======

# # SearchApi key. Head to https://searchapi.com/ to get your API key
# SEARCHAPI_API_KEY=
# # SearchApi engine, defaults to google. Available options: google, bing, baidu, google_news, etc. Head to https://searchapi.com/ to explore more engines
# SEARCHAPI_ENGINE=

# # Supabase Setup (used to support DB authentication, advanced logging, etc.)
# SUPABASE_ANON_TOKEN=
# SUPABASE_URL=
# SUPABASE_SERVICE_TOKEN=

# # Other Optionals
# # use if you've set up authentication and want to test with a real API key
# TEST_API_KEY=
# # set if you'd like to test the scraping rate limit
# RATE_LIMIT_TEST_API_KEY_SCRAPE=
# # set if you'd like to test the crawling rate limit
# RATE_LIMIT_TEST_API_KEY_CRAWL=
# # set if you'd like to use scraping Be to handle JS blocking
# SCRAPING_BEE_API_KEY=
# # add for LLM dependednt features (image alt generation, etc.)
# OPENAI_API_KEY=
# BULL_AUTH_KEY=@
# # set if you have a llamaparse key you'd like to use to parse pdfs
# LLAMAPARSE_API_KEY=
# # set if you'd like to send slack server health status messages
# SLACK_WEBHOOK_URL=
# # set if you'd like to send posthog events like job logs
# POSTHOG_API_KEY=
# # set if you'd like to send posthog events like job logs
# POSTHOG_HOST=

# STRIPE_PRICE_ID_STANDARD=
# STRIPE_PRICE_ID_SCALE=
# STRIPE_PRICE_ID_STARTER=
# STRIPE_PRICE_ID_HOBBY=
# STRIPE_PRICE_ID_HOBBY_YEARLY=
# STRIPE_PRICE_ID_STANDARD_NEW=
# STRIPE_PRICE_ID_STANDARD_NEW_YEARLY=
# STRIPE_PRICE_ID_GROWTH=
# STRIPE_PRICE_ID_GROWTH_YEARLY=

# # set if you'd like to use the fire engine closed beta
# FIRE_ENGINE_BETA_URL=

# # Proxy Settings for Playwright (Alternative you can can use a proxy service like oxylabs, which rotates IPs for you on every request)
# PROXY_SERVER=
# PROXY_USERNAME=
# PROXY_PASSWORD=
# # set if you'd like to block media requests to save proxy bandwidth
# BLOCK_MEDIA=

# # Set this to the URL of your webhook when using the self-hosted version of FireCrawl
# SELF_HOSTED_WEBHOOK_URL=

# # Resend API Key for transactional emails
# RESEND_API_KEY=

# LOGGING_LEVEL determines the verbosity of logs that the system will output.
# Available levels are:
# NONE - No logs will be output.
# ERROR - For logging error messages that indicate a failure in a specific operation.
# WARN - For logging potentially harmful situations that are not necessarily errors.
# INFO - For logging informational messages that highlight the progress of the application.
# DEBUG - For logging detailed information on the flow through the system, primarily used for debugging.
# TRACE - For logging more detailed information than the DEBUG level.
# Set LOGGING_LEVEL to one of the above options to control logging output.
LOGGING_LEVEL=INFO

Additional Context
Add any other context about the self-hosting issue here, such as specific infrastructure details, network setup, or any modifications made to the original Firecrawl setup.

@hacksman hacksman changed the title [Self-Host] [Self-Host] Build Failure in html-transformer Module Due to Type Mismatch (*mut u8 vs *mut i8) When Using Docker Jan 28, 2025
@sebapoole
Copy link

sebapoole commented Feb 2, 2025

@hacksman the below changes to html-transformer/src/lib.rs resolved the issue for me:

use std::{collections::HashMap, ffi::{CStr, CString}};

use kuchikiki::{parse_html, traits::TendrilSink};
use serde::Deserialize;
use serde_json::Value;
use url::Url;
use libc;

/// Extracts links from HTML
/// 
/// # Safety
/// Input options must be a C HTML string. Output will be a JSON string array. Output string must be freed with free_string.
#[no_mangle]
pub unsafe extern "C" fn extract_links(html: *const libc::c_char) -> *mut libc::c_char {
    let html = unsafe { CStr::from_ptr(html) }.to_str().unwrap();

    let document = parse_html().one(html);

    let mut out: Vec<String> = Vec::new();

    let anchors: Vec<_> = document.select("a[href]").unwrap().collect();
    for anchor in anchors {
        let mut href = anchor.attributes.borrow().get("href").unwrap().to_string();
        
        if href.starts_with("http:/") && !href.starts_with("http://") {
            href = format!("http://{}", &href[6..]);
        } else if href.starts_with("https:/") && !href.starts_with("https://") {
            href = format!("https://{}", &href[7..]);
        }

        out.push(href);
    }

    CString::new(serde_json::ser::to_string(&out).unwrap()).unwrap().into_raw() as *mut libc::c_char
}

macro_rules! insert_meta_name {
    ($out:ident, $document:ident, $metaName:expr, $outName:expr) => {
        if let Some(x) = $document.select(&format!("meta[name=\"{}\"]", $metaName)).unwrap().next().and_then(|description| description.attributes.borrow().get("content").map(|x| x.to_string())) {
            $out.insert(($outName).to_string(), Value::String(x));
        }
    };
}

macro_rules! insert_meta_property {
    ($out:ident, $document:ident, $metaName:expr, $outName:expr) => {
        if let Some(x) = $document.select(&format!("meta[property=\"{}\"]", $metaName)).unwrap().next().and_then(|description| description.attributes.borrow().get("content").map(|x| x.to_string())) {
            $out.insert(($outName).to_string(), Value::String(x));
        }
    };
}

/// Extracts metadata from HTML
/// 
/// # Safety
/// Input options must be a C HTML string. Output will be a JSON object. Output string must be freed with free_string.
#[no_mangle]
pub unsafe extern "C" fn extract_metadata(html: *const libc::c_char) -> *mut libc::c_char {
    let html = unsafe { CStr::from_ptr(html) }.to_str().unwrap();

    let document = parse_html().one(html);
    let mut out = HashMap::<String, Value>::new();

    if let Some(title) = document.select("title").unwrap().next() {
        out.insert("title".to_string(), Value::String(title.text_contents()));
    }
    // insert_meta_name!(out, document, "description", "description");

    if let Some(favicon_link) = document.select("link[rel=\"icon\"]").unwrap().next()
        .and_then(|x| x.attributes.borrow().get("href").map(|x| x.to_string()))
        .or_else(|| document.select("link[rel*=\"icon\"]").unwrap().next()
            .and_then(|x| x.attributes.borrow().get("href").map(|x| x.to_string()))) {
        out.insert("favicon".to_string(), Value::String(favicon_link));
    }

    if let Some(lang) = document.select("html[lang]").unwrap().next().and_then(|x| x.attributes.borrow().get("lang").map(|x| x.to_string())) {
        out.insert("language".to_string(), Value::String(lang));
    }

    // insert_meta_name!(out, document, "keywords", "keywords");
    // insert_meta_name!(out, document, "robots", "robots");
    insert_meta_property!(out, document, "og:title", "ogTitle");
    insert_meta_property!(out, document, "og:description", "ogDescription");
    insert_meta_property!(out, document, "og:url", "ogUrl");
    insert_meta_property!(out, document, "og:image", "ogImage");
    insert_meta_property!(out, document, "og:audio", "ogAudio");
    insert_meta_property!(out, document, "og:determiner", "ogDeterminer");
    insert_meta_property!(out, document, "og:locale", "ogLocale");

    for meta in document.select("meta[property=\"og:locale:alternate\"]").unwrap() {
        let attrs = meta.attributes.borrow();

        if let Some(content) = attrs.get("content") {
            if let Some(v) = out.get_mut("og:locale:alternate") {
                match v {
                    Value::Array(x) => {
                        x.push(Value::String(content.to_string()));
                    },
                    _ => unreachable!(),
                }
            } else {
                out.insert("og:locale:alternate".to_string(), Value::Array(vec! [Value::String(content.to_string())]));
            }
        }
    }

    insert_meta_property!(out, document, "og:site_name", "ogSiteName");
    insert_meta_property!(out, document, "og:video", "ogVideo");
    insert_meta_name!(out, document, "article:section", "articleSection");
    insert_meta_name!(out, document, "article:tag", "articleTag");
    insert_meta_property!(out, document, "article:published_time", "publishedTime");
    insert_meta_property!(out, document, "article:modified_time", "modifiedTime");
    insert_meta_name!(out, document, "dcterms.keywords", "dcTermsKeywords");
    insert_meta_name!(out, document, "dc.description", "dcDescription");
    insert_meta_name!(out, document, "dc.subject", "dcSubject");
    insert_meta_name!(out, document, "dcterms.subject", "dcTermsSubject");
    insert_meta_name!(out, document, "dcterms.audience", "dcTermsAudience");
    insert_meta_name!(out, document, "dc.type", "dcType");
    insert_meta_name!(out, document, "dcterms.type", "dcTermsType");
    insert_meta_name!(out, document, "dc.date", "dcDate");
    insert_meta_name!(out, document, "dc.date.created", "dcDateCreated");
    insert_meta_name!(out, document, "dcterms.created", "dcTermsCreated");

    for meta in document.select("meta").unwrap() {
        let meta = meta.as_node().as_element().unwrap();
        let attrs = meta.attributes.borrow();

        if let Some(name) = attrs.get("name").or_else(|| attrs.get("property")) {
            if let Some(content) = attrs.get("content") {
                if let Some(v) = out.get(name) {
                    match v {
                        Value::String(_) => {
                            if name != "title" { // preserve title tag in metadata
                                out.insert(name.to_string(), Value::Array(vec! [v.clone(), Value::String(content.to_string())]));
                            }
                        },
                        Value::Array(_) => {
                            match out.get_mut(name) {
                                Some(Value::Array(x)) => {
                                    x.push(Value::String(content.to_string()));
                                },
                                _ => unreachable!(),
                            }
                        },
                        _ => unreachable!(),
                    }
                } else {
                    out.insert(name.to_string(), Value::String(content.to_string()));
                }
            }
        }
    }

    CString::new(serde_json::ser::to_string(&out).unwrap()).unwrap().into_raw() as *mut libc::c_char
}

const EXCLUDE_NON_MAIN_TAGS: [&str; 41] = [
    "header",
    "footer",
    "nav",
    "aside",
    ".header",
    ".top",
    ".navbar",
    "#header",
    ".footer",
    ".bottom",
    "#footer",
    ".sidebar",
    ".side",
    ".aside",
    "#sidebar",
    ".modal",
    ".popup",
    "#modal",
    ".overlay",
    ".ad",
    ".ads",
    ".advert",
    "#ad",
    ".lang-selector",
    ".language",
    "#language-selector",
    ".social",
    ".social-media",
    ".social-links",
    "#social",
    ".menu",
    ".navigation",
    "#nav",
    ".breadcrumbs",
    "#breadcrumbs",
    ".share",
    "#share",
    ".widget",
    "#widget",
    ".cookie",
    "#cookie",
];

const FORCE_INCLUDE_MAIN_TAGS: [&str; 1] = [
    "#main"
];

#[derive(Deserialize)]
struct TranformHTMLOptions {
    html: String,
    url: String,
    include_tags: Vec<String>,
    exclude_tags: Vec<String>,
    only_main_content: bool,
}

struct ImageSource {
    url: String,
    size: i32,
    is_x: bool,
}

fn _transform_html_inner(opts: TranformHTMLOptions) -> Result<String, ()> {
    let mut document = parse_html().one(opts.html);
    
    if !opts.include_tags.is_empty() {
        let new_document = parse_html().one("<div></div>");
        let root = new_document.select_first("div")?;

        for x in opts.include_tags.iter() {
            let matching_nodes: Vec<_> = document.select(x)?.collect();
            for tag in matching_nodes {
                root.as_node().append(tag.as_node().clone());
            }
        }

        document = new_document;
    }

    while let Ok(x) = document.select_first("head") {
        x.as_node().detach();
    }

    while let Ok(x) = document.select_first("meta") {
        x.as_node().detach();
    }

    while let Ok(x) = document.select_first("noscript") {
        x.as_node().detach();
    }

    while let Ok(x) = document.select_first("style") {
        x.as_node().detach();
    }

    while let Ok(x) = document.select_first("script") {
        x.as_node().detach();
    }

    for x in opts.exclude_tags.iter() {
        // TODO: implement weird version
        while let Ok(x) = document.select_first(x) {
            x.as_node().detach();
        }
    }

    if opts.only_main_content {
        for x in EXCLUDE_NON_MAIN_TAGS.iter() {
            let x: Vec<_> = document.select(x)?.collect();
            for tag in x {
                if !FORCE_INCLUDE_MAIN_TAGS.iter().any(|x| tag.as_node().select(x).is_ok_and(|mut x| x.next().is_some())) {
                    tag.as_node().detach();
                }
            }
        }
    }

    let srcset_images: Vec<_> = document.select("img[srcset]")?.collect();
    for img in srcset_images {
        let mut sizes: Vec<ImageSource> = img.attributes.borrow().get("srcset").ok_or(())?.split(",").filter_map(|x| {
            let tok: Vec<&str> = x.trim().split(" ").collect();
            let tok_1 = if tok.len() > 1 && !tok[1].is_empty() {
                tok[1]
            } else {
                "1x"
            };
            if let Ok(parsed_size) = tok_1[..tok_1.len()-1].parse() {
                Some(ImageSource {
                    url: tok[0].to_string(),
                    size: parsed_size,
                    is_x: tok_1.ends_with("x")
                })
            } else {
                None
            }
        }).collect();

        if sizes.iter().all(|x| x.is_x) {
            if let Some(src) = img.attributes.borrow().get("src").map(|x| x.to_string()) {
                sizes.push(ImageSource {
                    url: src,
                    size: 1,
                    is_x: true,
                });
            }
        }

        sizes.sort_by(|a, b| b.size.cmp(&a.size));

        if let Some(biggest) = sizes.first() {
            img.attributes.borrow_mut().insert("src", biggest.url.clone());
        }
    }

    let url = Url::parse(&opts.url).map_err(|_| ())?;
    
    let src_images: Vec<_> = document.select("img[src]")?.collect();
    for img in src_images {
        let old = img.attributes.borrow().get("src").map(|x| x.to_string()).ok_or(())?;
        if let Ok(new) = url.join(&old) {
            img.attributes.borrow_mut().insert("src", new.to_string());            
        }
    }

    let href_anchors: Vec<_> = document.select("a[href]")?.collect();
    for anchor in href_anchors {
        let old = anchor.attributes.borrow().get("href").map(|x| x.to_string()).ok_or(())?;
        if let Ok(new) = url.join(&old) {
            anchor.attributes.borrow_mut().insert("href", new.to_string());            
        }
    }

    Ok(document.to_string())
}

/// Transforms rawHtml to html (formerly removeUnwantedElements)
/// 
/// # Safety
/// Input options must be a C JSON string. Output will be an HTML string. Output string must be freed with free_string.
#[no_mangle]
pub unsafe extern "C" fn transform_html(opts: *const libc::c_char) -> *mut libc::c_char {
    let opts: TranformHTMLOptions = match unsafe { CStr::from_ptr(opts) }.to_str().map_err(|_| ()).and_then(|x| serde_json::de::from_str(x).map_err(|_| ())) {
        Ok(x) => x,
        Err(_) => {
            return CString::new("RUSTFC:ERROR").unwrap().into_raw() as *mut libc::c_char;
        }
    };

    let out = match _transform_html_inner(opts) {
        Ok(x) => x,
        Err(_) => "RUSTFC:ERROR".to_string(),
    };

    CString::new(out).unwrap().into_raw()
}

/// Frees a string allocated in Rust-land.
/// 
/// # Safety
/// ptr must be a non-freed string pointer returned by Rust code.
#[no_mangle]
pub unsafe extern "C" fn free_string(ptr: *mut libc::c_char) {
    if !ptr.is_null() {
        drop(unsafe { CString::from_raw(ptr as *mut u8) })
    }
}

@anuragphadke
Copy link

Thx @sebapoole
can confirm that the above change fixed it for me on M4 mini. Can we merge this PR into main so others can benefit?

@dexterchan
Copy link

I got similar issue today. I did it with MacBook M4.
11.11 error[E0308]: mismatched types
11.11 --> src/lib.rs:379:37
11.11 |
11.11 379 | drop(unsafe { CString::from_raw(ptr) })
11.11 | ----------------- ^^^ expected *mut u8, found *mut i8
11.11 | |
11.11 | arguments to this function are incorrect
11.11 |
11.11 = note: expected raw pointer *mut u8
11.11 found raw pointer *mut i8
11.11 note: associated function defined here
11.11 --> /rustc/e71f9a9a98b0faf423844bf0ba7438f29dc27d58/library/alloc/src/ffi/c_str.rs:396:19
11.11
[+] Running 1/2information about this error, try rustc --explain E0308.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants