Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify create_hnsw_index #49

Open
wants to merge 175 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
175 commits
Select commit Hold shift + click to select a range
0278e06
Merge pull request #1 from ahill187/fix-simplicial-set-embedding-bug
ahill187 Feb 20, 2024
54cfd50
create a custom logger
ahill187 Aug 11, 2024
891ae17
replace all print statements with logging statements
ahill187 Aug 11, 2024
119a1af
remove commented print statements
ahill187 Aug 11, 2024
7b87ff2
Merge remote-tracking branch 'upstream/master'
ahill187 Aug 11, 2024
8e2749c
Merge pull request #8 from ahill187/logging
ahill187 Aug 11, 2024
3b770f6
fix code line layout
ahill187 Aug 11, 2024
1a410f5
remove unnecessary ==True and ==False
ahill187 Aug 11, 2024
6901748
remove unnecessary dummy variables
ahill187 Aug 11, 2024
6aaae03
remove unused variables
ahill187 Aug 11, 2024
09b3ca6
clean up comments
ahill187 Aug 11, 2024
42a302e
make strings consistent
ahill187 Aug 11, 2024
6f7a323
make strings consistent
ahill187 Aug 11, 2024
87d968a
remove trailing whitespace
ahill187 Aug 11, 2024
7b87cd2
update gitignore
ahill187 Aug 11, 2024
f7430a1
Merge branch 'code-cleanup' of https://github.com/ahill187/PARC into …
ahill187 Aug 11, 2024
966359f
rearrange run_toobig_PARC signature
ahill187 Aug 11, 2024
fc4e249
Merge pull request #9 from ahill187/code-cleanup
ahill187 Aug 11, 2024
68042ac
move func_mode to utils
ahill187 Aug 11, 2024
8ac7a4e
add test for get_mode
ahill187 Aug 11, 2024
b485cc7
modify get_mode docstring
ahill187 Aug 11, 2024
61eb152
Merge pull request #10 from ahill187/func-mode
ahill187 Aug 11, 2024
dc99e66
rename `X_data`, `X_input`, `data` -> `x_data`
ahill187 Aug 12, 2024
56f4a9b
rename `true_label`, `true_labels` -> `y_data_true`
ahill187 Aug 12, 2024
8c45f3d
rename `PARC_labels`, `labels` -> `y_data_pred`
ahill187 Aug 12, 2024
fe8ecc8
rename `PARC_labels_array` -> `y_data_pred_array`
ahill187 Aug 12, 2024
7e6982d
rename `num_threads` -> `n_threads`
ahill187 Aug 12, 2024
571ce1a
rename `N`, `n_cells`, `n_elements` -> `n_samples`
ahill187 Aug 12, 2024
af15b72
rename `small_pop` -> `small_community_size`
ahill187 Aug 12, 2024
37fcf0f
rename `time_smallpop` -> `small_community_timeout`
ahill187 Aug 12, 2024
cce91b3
rename `time_smallpop_start` -> `time_start`
ahill187 Aug 12, 2024
44f92b5
rename `small_pop_exist` -> `small_community_exists`
ahill187 Aug 12, 2024
58c837d
rename `too_big_factor` -> `large_community_factor`
ahill187 Aug 12, 2024
ae12407
rename `distance` -> `distance_metric`
ahill187 Aug 12, 2024
5c4a59a
rename `dist_std_local` -> `l2_std_factor`
ahill187 Aug 12, 2024
621ab3a
rename `jac_std_global`, `jac_std_toobig` -> `jac_std_factor`
ahill187 Aug 12, 2024
729907f
rename `G` -> `graph`
ahill187 Aug 12, 2024
e928027
rename `G_sim` -> `graph_pruned`
ahill187 Aug 12, 2024
52d83dd
rename `sim_list` -> `similarities`
ahill187 Aug 12, 2024
2964602
rename `sim_list_new` -> `similarities_new`
ahill187 Aug 12, 2024
b5371de
rename `sim_list_array` -> `similarities_array`
ahill187 Aug 12, 2024
bc22ece
rename `edgelist` -> `edges`
ahill187 Aug 12, 2024
307accb
rename `edgelist_copy` -> `edges_copy`
ahill187 Aug 12, 2024
f1576e4
rename `new_edgelist` -> `new_edges`
ahill187 Aug 12, 2024
67b456c
remove edge_list_copy_array
ahill187 Aug 12, 2024
f2552c9
rename `strong_locs` -> `indices_similar`
ahill187 Aug 12, 2024
94bed24
rename `PARC_labels_leiden` -> `node_communities`
ahill187 Aug 12, 2024
2b3c0f2
rename `PARC_labels_leiden_big` -> `node_communities_big`
ahill187 Aug 12, 2024
3dda2d2
rename `set_PARC_labels_leiden` -> `set_node_communities`
ahill187 Aug 12, 2024
20f7d9c
rename `cluster_ii` -> `community_id`
ahill187 Aug 12, 2024
8f8810d
rename `pop_i`, `pop_ii` -> `community_size`
ahill187 Aug 12, 2024
d3e6a54
rename `cluster_i_loc`, `cluster_ii_loc` -> `community_indices`
ahill187 Aug 12, 2024
400bbda
rename `cluster_big_loc` -> `large_community_indices`
ahill187 Aug 12, 2024
4a37121
rename `cluster_big` -> `large_community_id`
ahill187 Aug 12, 2024
ca6a23c
rename `big_pop` -> `large_community_size`
ahill187 Aug 12, 2024
9601904
rename `n_cancer` -> `n_target`
ahill187 Aug 12, 2024
e64a860
rename `cluster_i` -> `community_id`
ahill187 Aug 12, 2024
c66a311
rename `pbmc_labels` -> `negative_labels`
ahill187 Aug 12, 2024
3fb65c4
rename `thp1_labels` -> `positive_labels`
ahill187 Aug 12, 2024
77525e3
rename `onevsall`, `onevsall_val` -> `target`
ahill187 Aug 12, 2024
8c4294b
rename `sources` -> `input_nodes`
ahill187 Aug 12, 2024
a7a7be6
rename `targets` -> `output_nodes`
ahill187 Aug 12, 2024
5a8de18
remove X_data_big
ahill187 Aug 12, 2024
f16c831
Merge pull request #11 from ahill187/variable-renaming
ahill187 Aug 12, 2024
236d927
add docstrings to constructor
ahill187 Aug 12, 2024
0ebd221
change default `keep_all_local_dist` to None
ahill187 Aug 12, 2024
fb559fa
create setter for keep_all_local_dist
ahill187 Aug 12, 2024
321372b
fix bug from last commit
ahill187 Aug 12, 2024
b77b9ec
add partition_type setter
ahill187 Aug 12, 2024
774a59f
add setter for `y_data_true`
ahill187 Aug 12, 2024
699f957
add typehints for constructor
ahill187 Aug 12, 2024
2846751
clean up the Attributes section
ahill187 Aug 12, 2024
ecf2edc
Merge pull request #12 from ahill187/constructor
ahill187 Aug 12, 2024
74534b5
factor out compute_performance_metrics
ahill187 Aug 12, 2024
12cbf72
rename `time_start_total` -> `time_start`
ahill187 Aug 12, 2024
e2edfa9
remove run_subPARC
ahill187 Aug 12, 2024
5db3326
Merge pull request #13 from ahill187/compute-performance-metrics
ahill187 Aug 12, 2024
3a21b6a
rename `run_PARC` -> `run_parc`
ahill187 Aug 12, 2024
af0cf57
simplify find_partition logic
ahill187 Aug 12, 2024
8d5a131
update logs
ahill187 Aug 12, 2024
dab84ba
factor out get_leiden_partition
ahill187 Aug 12, 2024
63083bf
add tests for get_leiden_partition
ahill187 Aug 12, 2024
0fde470
Merge pull request #14 from ahill187/get-leiden-partition
ahill187 Aug 12, 2024
3a9a60a
rename knngraph_full to create_knn_graph
ahill187 Aug 12, 2024
678fa34
clean up create_knn_graph
ahill187 Aug 12, 2024
cb2b01d
rename `k_umap` -> `knn`
ahill187 Aug 12, 2024
d323fa2
set knn as argument
ahill187 Aug 12, 2024
dc35c8e
add tests for create_knn_graph
ahill187 Aug 13, 2024
7ec52fc
rename `graph` -> `csr_array`
ahill187 Aug 13, 2024
65434d8
Merge pull request #15 from ahill187/create-knn-graph
ahill187 Aug 13, 2024
6ba48ef
Rename `make_csrmatrix_noselfloop` -> `prune_local`
ahill187 Aug 13, 2024
7f1a2a3
rename `keep_all_local_dist` -> `do_prune_local`
ahill187 Aug 13, 2024
095a0b3
remove unused `discard_count`
ahill187 Aug 13, 2024
91fe748
remove unnecessary comments
ahill187 Aug 13, 2024
8110841
rename `distlist` -> `distances`
ahill187 Aug 13, 2024
b02b813
add `max_distance`
ahill187 Aug 13, 2024
2bb3e68
rename `row` -> `neighbors`
ahill187 Aug 13, 2024
0c3526d
rename `updated_nn_ind` -> `updated_neighbors`
ahill187 Aug 13, 2024
c7bb875
rename `updated_nn_weights` -> `updated_distances`
ahill187 Aug 13, 2024
54c1b08
rename `ik` -> `index`
ahill187 Aug 13, 2024
a90113e
rename `rowi` -> `community_id`
ahill187 Aug 13, 2024
1ed4b40
rename `csr_graph` -> `csr_array`
ahill187 Aug 13, 2024
0cfc2c2
rename `dist` -> `distance`
ahill187 Aug 13, 2024
73b5551
clean up the `prune_local` function
ahill187 Aug 13, 2024
8b58a53
add `l2_std_factor` as argument to `prune_local`
ahill187 Aug 13, 2024
5bf9097
add tests for `prune_local`
ahill187 Aug 13, 2024
00ad8d1
Merge pull request #16 from ahill187/prune-local
ahill187 Aug 13, 2024
be8a959
add `jac_threshold_type`
ahill187 Aug 13, 2024
4478039
Merge pull request #17 from ahill187/jac-threshold-type
ahill187 Aug 13, 2024
9107b03
create new function `prune_global`
ahill187 Aug 13, 2024
2ce919c
factor out `prune_global` from `run_parc`
ahill187 Aug 13, 2024
f403e0a
factor out `prune_global` from `run_toobig_subPARC`
ahill187 Aug 13, 2024
7943b4e
modify the log messages
ahill187 Aug 13, 2024
a6dc4a7
add tests for `prune_global`
ahill187 Aug 13, 2024
10d0a9a
remove `similarities_new`
ahill187 Aug 13, 2024
42c58ca
remove `new_edges`
ahill187 Aug 13, 2024
6f0fbe1
remove `similarities_array`
ahill187 Aug 13, 2024
cf94246
docstring cleanup
ahill187 Aug 13, 2024
711c70d
Merge pull request #18 from ahill187/prune-global
ahill187 Aug 13, 2024
d6e7440
rename `hnsw` -> `knn_struct`
ahill187 Aug 13, 2024
d794d84
change `x_data` to parameter and removed `big_cluster`
ahill187 Aug 13, 2024
fa98813
rename `num_dims` -> `n_features`
ahill187 Aug 13, 2024
4f8bbc3
move `n_featuers`, `n_samples`, `add_items`
ahill187 Aug 13, 2024
e56a773
add `distance_metric` parameter
ahill187 Aug 13, 2024
42ed8de
move hnswlib.Index call
ahill187 Aug 13, 2024
ae1279e
move init_index
ahill187 Aug 13, 2024
c217d9b
change `n_threads` to parameter
ahill187 Aug 13, 2024
6c332d0
change `M` to parameter
ahill187 Aug 13, 2024
b6fe4c0
convert `ef_construction` to parameter
ahill187 Aug 13, 2024
621d5bb
rename `p` -> `knn_struct`
ahill187 Aug 13, 2024
4079b5d
add `knn` as parameter
ahill187 Aug 13, 2024
6eef35c
add docstrings
ahill187 Aug 13, 2024
6a46286
add `ef_query` as parameter and remove `too_big`
ahill187 Aug 13, 2024
0e4d02b
remove unnecessary comment
ahill187 Aug 13, 2024
dc61bee
update the docstrings
ahill187 Aug 13, 2024
588d0ee
add tests for `make_knn_struct`
ahill187 Aug 13, 2024
f00473a
Merge pull request #19 from ahill187/refactor-knn-struct-2
ahill187 Aug 13, 2024
5e8cec6
fix `small_community_size` bug
ahill187 Aug 13, 2024
6dffd7d
Merge pull request #20 from ahill187/small-community-size
ahill187 Aug 13, 2024
63d5e1f
add tests for run_parc
ahill187 Aug 14, 2024
1e66c5b
add pytest.ini
ahill187 Aug 14, 2024
fe2e2b1
fix & bug
ahill187 Aug 14, 2024
f44e4a5
add `large_community_id`
ahill187 Aug 14, 2024
5390162
update log messages
ahill187 Aug 14, 2024
78175a3
rearrange constructor to match order of attributes
ahill187 Aug 14, 2024
ab0bf8f
add tests_workflow
ahill187 Aug 30, 2024
59b7e2c
fix bugs in tests
ahill187 Aug 30, 2024
d1f960e
cancel tests if one fails
ahill187 Aug 30, 2024
fb3617f
change default logging level to 25
ahill187 Aug 30, 2024
119c664
Merge pull request #21 from ahill187/tests
ahill187 Aug 30, 2024
60b484c
add x_data setter
ahill187 Aug 30, 2024
0a1f3fa
modify the` y_data_true` setter
ahill187 Aug 30, 2024
acc585a
add `y_data_pred` setter
ahill187 Aug 30, 2024
0cf4fd4
allow `x_data` to be passed in as `pd.DataFrame`
ahill187 Aug 30, 2024
4419f67
fix `y_data_true` setter
ahill187 Aug 30, 2024
0c8f322
Merge pull request #22 from ahill187/input-data-types
ahill187 Aug 30, 2024
23510b3
clean up `setup.py`
ahill187 Aug 31, 2024
b72c2f1
add requirements.txt
ahill187 Aug 31, 2024
8862598
Merge pull request #23 from ahill187/requirements-setup
ahill187 Aug 31, 2024
8cf1d9b
rename run_parc -> fit_predict
ahill187 Nov 9, 2024
580c3e8
refactor pop_list -> community_counts
ahill187 Nov 9, 2024
4c89daf
convert logging message to info
ahill187 Nov 11, 2024
01e44ca
add docstrings to fit_predict
ahill187 Nov 11, 2024
296e320
move time_start to top
ahill187 Nov 11, 2024
3854bbf
Merge pull request #25 from ahill187/fit
ahill187 Nov 11, 2024
a824f7b
add save function to PARC
ahill187 Nov 11, 2024
2988e0f
add tests for save model
ahill187 Nov 11, 2024
7fb85cf
add load function
ahill187 Nov 11, 2024
2f3f66b
test load function
ahill187 Nov 11, 2024
4d55f8c
Merge pull request #26 from ahill187/save-model
ahill187 Nov 13, 2024
057df63
rename `create_knn_struct` -> `create_hnsw_index`
ahill187 Nov 13, 2024
8325745
rename `knn_struct` -> `hnsw_index`
ahill187 Nov 13, 2024
100220f
update logging
ahill187 Nov 13, 2024
7c1d4ea
move ef_query settings to create_hnsw_index
ahill187 Nov 13, 2024
3ca6478
update ef_query in run_toobig_subPARC
ahill187 Nov 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .github/workflows/tests_workflow.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: tests_workflow

# execute this workflow automatically when a we push to any branch
on: [push]

jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install pytest
- name: Install PARC
run: |
pip install .
- name: Run Python tests -x
run: |
pytest tests/
continue-on-error: false

concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
env/
*/__pycache__
*.egg-info
.eggs/
.DS_Store
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ plt.scatter(X[:, 0], X[:, 1], c=parc_labels, cmap='rainbow')
plt.show()

# Run umap on the HNSW knngraph already built in PARC (more time and memory efficient for large datasets)
// Parc1.knn_struct = p1.make_knn_struct() // if you choose to visualize before running PARC clustering. then you need to include this line
// Parc1.hnsw_index = p1.create_hnsw_index() // if you choose to visualize before running PARC clustering. then you need to include this line
graph = Parc1.knngraph_full()
X_umap = Parc1.run_umap_hnsw(X, graph)
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=Parc1.labels)
Expand Down
2 changes: 1 addition & 1 deletion docs/Examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Examples
plt.show()

# Run umap on the HNSW knngraph already built in PARC (more time and memory efficient for large datasets)
# If you choose to visualize before running PARC clustering. then you need to include this line: Parc1.knn_struct = p1.make_knn_struct()
# If you choose to visualize before running PARC clustering. then you need to include this line: Parc1.hnsw_index = p1.create_hnsw_index()
graph = Parc1.knngraph_full()
X_umap = Parc1.run_umap_hnsw(X, graph)
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=Parc1.labels)
Expand Down
1,377 changes: 954 additions & 423 deletions parc/_parc.py

Large diffs are not rendered by default.

95 changes: 95 additions & 0 deletions parc/logger.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
"""Custom colored logger.

See:

`StackOverflow <http://stackoverflow.com/a/24956305/408556>`__
`python-colors GitHub <https://gist.github.com/dideler/3814182>`__

"""

import logging
import sys

MIN_LEVEL = logging.DEBUG
MESSAGE = 25
logging.addLevelName(MESSAGE, "MESSAGE")
LOGGING_LEVEL = 25


class LogFilter(logging.Filter):
"""Filters (lets through) all messages with level < LEVEL"""
def __init__(self, level):
self.level = level

def filter(self, record):
# "<" instead of "<=": since logger.setLevel is inclusive, this should
# be exclusive
return record.levelno < self.level


class Logger(logging.Logger):
def message(self, msg, *args, **kwargs):
if self.isEnabledFor(MESSAGE):
self._log(MESSAGE, msg, args, **kwargs)


class ColoredFormatter(logging.Formatter):
def __init__(self, fmt_prefix, fmt_msg):
self.use_color = self.supports_color()
self.fmt_prefix = fmt_prefix
self.fmt_msg = fmt_msg
super().__init__(fmt=f"{fmt_prefix} {fmt_msg}")

def format(self, record: logging.LogRecord) -> str:
if record.levelno == logging.WARNING:
if self.use_color:
fmt = f"\x1b[93m{self.fmt_prefix}\x1b[0m {self.fmt_msg}"
formatter = logging.Formatter(fmt)
return formatter.format(record)
else:
return super().format(record)
elif record.levelno == logging.ERROR:
if self.use_color:
fmt = f"\x1b[91m{self.fmt_prefix}\x1b[0m {self.fmt_msg}"
formatter = logging.Formatter(fmt)
return formatter.format(record)
else:
return super().format(record)
elif record.levelno == 25 or record.levelno == logging.INFO:
if self.use_color:
fmt = f"\x1b[96m{self.fmt_prefix}\x1b[0m {self.fmt_msg}"
formatter = logging.Formatter(fmt)
return formatter.format(record)
else:
return super().format(record)

def supports_color(self) -> bool:
"""Check if the system supports ANSI color formatting."""
return hasattr(sys.stdout, "isatty") and sys.stdout.isatty()


def get_logger(module_name, level: int = LOGGING_LEVEL) -> Logger:
logging.setLoggerClass(Logger)
stdout_handler = logging.StreamHandler(sys.stdout)
stderr_handler = logging.StreamHandler(sys.stderr)
stdout_handler.addFilter(LogFilter(logging.WARNING))
stdout_handler.setLevel(level)
if level == 25:
formatter = ColoredFormatter(
fmt_prefix="[%(levelname)s]:",
fmt_msg="%(message)s"
)
else:
formatter = ColoredFormatter(
fmt_prefix="[%(levelname)s] %(name)s.%(funcName)s (line %(lineno)d):",
fmt_msg="%(message)s"
)
stdout_handler.setFormatter(formatter)
stderr_handler.setLevel(max(MIN_LEVEL, logging.WARNING))
# messages lower than WARNING go to stdout
# messages >= WARNING (and >= STDOUT_LOG_LEVEL) go to stderr
logger = logging.getLogger(module_name)
logger.propagate = False
logger.handlers = [stdout_handler, stderr_handler]
logger.setLevel(level)
return logger
13 changes: 13 additions & 0 deletions parc/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
def get_mode(a_list: list[any]) -> any:
"""Get the value which appears most often in the list (the mode).

If multiple items are maximal, the function returns the first one encountered
(not necessarily the first one in the list).

Args:
a_list: A list with values.

Returns:
The most frequent value in the list
"""
return max(set(a_list), key=a_list.count)
4 changes: 4 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[pytest]
log_cli = true
log_cli_level = 25
addopts = -rP
8 changes: 8 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
pybind11
numpy
scipy
pandas
hnswlib
igraph
leidenalg>=0.7.0
umap-learn
26 changes: 13 additions & 13 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
import setuptools

from os import path
this_directory = path.abspath(path.dirname(__file__))
with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:

PARC_DIR = path.abspath(path.dirname(__file__))

with open(path.join(PARC_DIR, "README.md"), encoding="utf-8") as f:
long_description = f.read()

setuptools.setup(
name='parc',
version='0.40', #May 28,2024
packages=['parc',],
license='MIT',
author_email='[email protected]',
url='https://github.com/ShobiStassen/PARC',
setup_requires=['numpy', 'pybind11'],
name="parc",
version="0.40",
packages=["parc"],
license="MIT",
author_email="[email protected]",
url="https://github.com/ShobiStassen/PARC",
setup_requires=["numpy", "pybind11"],
install_requires=[
'pybind11', 'numpy', 'scipy', 'pandas', 'hnswlib', 'igraph',
'leidenalg>=0.7.0', 'umap-learn'
line.strip() for line in open("requirements.txt")
],
extras_require={
"dev": ["pytest", "scikit-learn"]
},
long_description=long_description,
long_description_content_type='text/markdown'
long_description_content_type="text/markdown"
)
Loading