Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Crawler reads binary content of ePub #1153

Open
rafalzawadzki opened this issue Feb 9, 2025 · 0 comments
Open

[Bug] Crawler reads binary content of ePub #1153

rafalzawadzki opened this issue Feb 9, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@rafalzawadzki
Copy link

Describe the Bug
When crawling websites, URLs to files like ePub are also scraped, resulting in garbage results because such formats are not supported by Firecrawl.

To Reproduce
Steps to reproduce the issue:

  1. Run /crawl with url: "https://www.gutenberg.org/ebooks/100"
  2. /crawl/{jobId} and get all pages
  3. Observe that results include content from unparsed files

Expected Behavior
I would expect either an option to exclude files from crawling or proper support for various file formats.

Screenshots
Image

Environment (please complete the following information):

  • OS: Any
  • Firecrawl Version: 1.16.0
  • Node.js Version: 22
@rafalzawadzki rafalzawadzki added the bug Something isn't working label Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant