Skip to content

Commit

Permalink
Merge pull request #18 from carlosiborra/main
Browse files Browse the repository at this point in the history
Refactor PDF Cleaning Tests for Improved Modularity and Error Handling + Add Rust Distribution README
  • Loading branch information
YM162 authored Apr 3, 2024
2 parents 81e7aa8 + b10d0d7 commit 325c3a2
Show file tree
Hide file tree
Showing 3 changed files with 156 additions and 35 deletions.
31 changes: 18 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@

# Gulag Cleaner


[![Twitter](https://a11ybadges.com/badge?logo=twitter)](https://twitter.com/gulagcleaner)
[![Instagram](https://a11ybadges.com/badge?logo=instagram)](https://www.instagram.com/gulagcleaner/)
[![Ko-fi](https://a11ybadges.com/badge?logo=kofi)](https://ko-fi.com/L3L86VEX9)


Gulag Cleaner is a tool designed to remove advertisements from PDFs, making it easier to read and navigate documents without being disrupted by unwanted ads.

This tool does not just crop the ads out of the PDF, instead, we extract the original file without ads by manipulating the internal structure of the PDF, ensuring maximum quality.
Expand All @@ -22,6 +19,7 @@ This tool can be used without installation directly from [our website](https://g
# Installation

To install Gulag Cleaner, please [download](https://www.python.org/downloads/) and [install](https://wiki.python.org/moin/BeginnersGuide/Download) Python and then run the following command in your terminal:

```
pip install gulagcleaner
```
Expand All @@ -42,11 +40,11 @@ gulagcleaner [-r] [-s] [-n] [-h] [-v] <filename>...

Gulag Cleaner provides several options for its usage:

> * '-r': Replace the original file with the cleaned version.
> * '-s': Do not show metadata about cleaned files.
> * '-n': Force the naive cleaning method.
> * '-h': Display the help message, providing information on how to use Gulag Cleaner.
> * '-v': Display the current version of Gulag Cleaner.
> - '-r': Replace the original file with the cleaned version.
> - '-s': Do not show metadata about cleaned files.
> - '-n': Force the naive cleaning method.
> - '-h': Display the help message, providing information on how to use Gulag Cleaner.
> - '-v': Display the current version of Gulag Cleaner.
## Code

Expand All @@ -58,17 +56,24 @@ from gulagcleaner.clean import clean_pdf_path
return_msg = clean_pdf_path("input.pdf","output.pdf")
```

## Rust Distribution

If you are willing to use the Rust distribution of Gulag Cleaner, you can find the instructions in the [Rust distribution README.md](gulagcleaner_rs/README.md) file.

# License

Gulag Cleaner is distributed under the GPL-3 license, which means it's open-source and free to use.

# Contributing

We're always looking for ways to improve Gulag Cleaner, and we welcome contributions from the community. If you have ideas for improvements or bug fixes, please feel free to submit a pull request.

## TODO

If you want to help, these are the top priorities right now:

* Write tests for the package.
* Add README.md (With code examples) for the rust and JS distributions.
* Add comments to a lot of the rust code.
* Optimize the rust code for performance improvements.
* Add a new "clean_pdf_bytes()" function in python that does not require a file path, just the bytes.
- Write tests for the package.
- Add README.md (With code examples) for the rust and JS distributions.
- Add comments to a lot of the rust code.
- Optimize the rust code for performance improvements.
- Add a new "clean_pdf_bytes()" function in python that does not require a file path, just the bytes.
54 changes: 54 additions & 0 deletions gulagcleaner_rs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Gulag Cleaner Rust Distribution

## Setting Up Rust

To incorporate Rust components within Gulag Cleaner, ensure Rust is correctly installed on your system. Follow the installation guide on the [official Rust website](https://www.rust-lang.org/tools/install) for detailed instructions. This includes installing `rustup`, which is the Rust toolchain manager, and the Rust compiler (`rustc`).

## Running Rust Tests

Gulag Cleaner leverages Rust for certain operations, providing performance and safety benefits. To ensure these components work as expected, comprehensive tests are included.

To run the tests:

1. Open a terminal.
2. Navigate to the root directory of Gulag Cleaner.
3. Execute the following command to run all tests:

```bash
cargo test
```

For more detailed test outputs, including print statements from your tests, use:

```bash
cargo test --package gulagcleaner_rs --lib -- tests --nocapture
```

This command targets the specific Rust package (`gulagcleaner_rs`) and enables detailed outputs with `--nocapture`.

Note: at the moment this test only include the reading, cleaning and writing of 2 example PDFs for Wuolah and Studocs.

## Rust Development Guidelines

To contribute to the Rust portion of Gulag Cleaner, please adhere to the following guidelines:

- **Code Clarity**: Write clear, readable code with meaningful variable names and concise functions.
- **Comments and Documentation**: Add comments explaining complex logic or important decisions. Update the `README.md` with relevant examples and instructions when adding new features or making significant changes.
- **Performance**: Optimize for efficiency. Rust is known for its performance, so ensure your contributions enhance or maintain the current speed and memory usage.
- **Testing**: Write tests for new features or bug fixes if possible. Ensure existing tests pass without modifications unless the changes are intended to update the test behavior.

## TODO for Rust

If you're looking to contribute, here are some areas that need attention:

- **Writing Tests**: Our test coverage could be improved. Writing additional unit and integration tests for the Rust code is a priority.
- **Documentation**: A detailed README.md needs to be added, including setup instructions, examples of usage, and a description of the functions available.
- **Code Optimization**: There's always room for performance improvements. Profiling and optimizing existing Rust code can significantly impact overall tool performance.

## Contributing

Contributions to the Rust codebase of Gulag Cleaner are highly encouraged. Whether you're fixing bugs, optimizing performance, or adding new features, your input is valued. Follow the project's contribution guidelines and submit pull requests with your changes.

## License

Gulag Cleaner is distributed under the GPL-3 license, which means it's open-source and free to use.
106 changes: 84 additions & 22 deletions gulagcleaner_rs/src/tests.rs
Original file line number Diff line number Diff line change
@@ -1,36 +1,98 @@
use crate::clean::clean_pdf;
use std::fs;
use std::time::Instant;

const OUT_PATH: &str = "example_docs/out";

/// Creates out folder if missing so tests won't fail
fn create_out_folder() {
fs::create_dir_all(OUT_PATH).unwrap();
/// Represents configuration for running a test, including the paths for input and output files.
struct TestConfig {
input_path: &'static str,
output_filename: &'static str,
}

#[test]
fn test_wuolah() {
create_out_folder();
/// Ensures the output directory exists, creating it if necessary.
/// This function is invoked before running tests to ensure a location
/// is available for storing cleaned PDFs.
fn create_output_directory() {
fs::create_dir_all(OUT_PATH).expect("Failed to create output directory");
}

//Load some pdf bytes and clean it
let data = std::fs::read("example_docs/wuolah-free-example.pdf").expect(
"Missing Wuolah test PDF, please store one in path `example_docs/wuolah-free-example.pdf",
);
let (clean_pdf, _) = clean_pdf(data, false);
/// Reads a PDF from the specified path, cleans it, and returns the cleaned PDF data.
///
/// # Arguments
///
/// * `in_path` - A string slice that holds the path to the input PDF file.
///
/// # Returns
///
/// A `Result` which is `Ok` with a `Vec<u8>` containing the cleaned PDF data,
/// or an `Err` with a string describing the error.
fn read_and_clean_pdf(in_path: &str) -> Result<Vec<u8>, String> {
let data =
std::fs::read(in_path).map_err(|e| format!("Failed to read `{}`: {}", in_path, e))?;
let (clean_file, _) = clean_pdf(data, false);
Ok(clean_file)
}

//Stores the clean pdf in the out directory
std::fs::write(format!("{}/wuolah_clean.pdf", OUT_PATH), clean_pdf).unwrap();
/// Writes the cleaned PDF data to a file in the output directory.
///
/// # Arguments
///
/// * `out_path` - The path where the cleaned PDF will be stored.
/// * `clean_file` - A vector of bytes representing the cleaned PDF data.
///
/// # Returns
///
/// A `Result` which is `Ok` if the file was successfully written, or an `Err`
/// with a string describing the error.
fn store_pdf(out_path: &str, clean_file: Vec<u8>) -> Result<(), String> {
std::fs::write(out_path, clean_file)
.map_err(|e| format!("Failed to write `{}`: {}", out_path, e))
}
#[test]

fn test_studocu() {
create_out_folder();
/// Executes a cleaning test using the provided `TestConfig`.
///
/// This function orchestrates the test process: creating the output directory,
/// cleaning the PDF specified in the `TestConfig`, and storing the cleaned PDF
/// in the output directory. It also measures and prints the duration of the test.
///
/// # Arguments
///
/// * `test_config` - A reference to the `TestConfig` containing the test parameters.
fn run_test_for_config(test_config: &TestConfig) {
create_output_directory();

let start = Instant::now();

let clean_file = read_and_clean_pdf(test_config.input_path).expect("Failed to clean PDF");
store_pdf(
&format!("{}/{}", OUT_PATH, test_config.output_filename),
clean_file,
)
.expect("Failed to store PDF");

let duration = start.elapsed();

//Stores the clean pdf in the out directory
let data = std::fs::read("example_docs/studocu-example.pdf").expect(
"Missing Studocu test PDF, please store one in path `example_docs/studocu-example.pdf",
println!(
"Test for `{}` completed in {:?}",
test_config.input_path, duration
);
let (clean_pdf, _) = clean_pdf(data, false);
//Print the length of the pdf
std::fs::write(format!("{}/studocu_clean.pdf", OUT_PATH), clean_pdf).unwrap();
}

// Define tests for specific PDF files, utilizing the TestConfig structure.

#[test]
fn test_wuolah_pdf() {
run_test_for_config(&TestConfig {
input_path: "example_docs/wuolah-free-example.pdf",
output_filename: "wuolah_clean.pdf",
});
}

#[test]
fn test_studocu_pdf() {
run_test_for_config(&TestConfig {
input_path: "example_docs/studocu-example.pdf",
output_filename: "studocu_clean.pdf",
});
}

0 comments on commit 325c3a2

Please sign in to comment.