Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V0.0.1 packaging #30

Merged
merged 77 commits into from
Aug 1, 2023
Merged
Show file tree
Hide file tree
Changes from 74 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
559ab5a
add beginnings of a package
Jul 24, 2023
d7633f8
Merge branch 'main' into fix-conflict-packaging
EliahKagan Jul 25, 2023
e8628cf
Merge pull request #28 from EliahKagan/fix-conflict-packaging
zbloss Jul 25, 2023
014a25d
add bz2 compressor
Jul 25, 2023
bddbd4e
refactor get_bits_per_char method to be more verbose
Jul 25, 2023
005edc8
add stringtooshortexception
Jul 25, 2023
4625c56
add distance calculation class
zbloss Jul 26, 2023
332723d
add github actions
zbloss Jul 26, 2023
013ffb3
add pytest to test dependencies
zbloss Jul 26, 2023
d81ef2c
remove unused exception
zbloss Jul 26, 2023
5b40496
add helper function to open files
zbloss Jul 26, 2023
c073764
fix bz2 compressor from being a duplicate lzma compressor
zbloss Jul 26, 2023
74b5d85
add initial tests
zbloss Jul 26, 2023
b17e0b1
modify test-package.yml to only run tests/
zbloss Jul 26, 2023
c27597f
simplify aggregate_strings logic with itertools
zbloss Jul 27, 2023
419d545
add test cases for aggregations
zbloss Jul 27, 2023
14b5c22
remove unused StringTooShortException
zbloss Jul 28, 2023
f7d4d54
formatting
zbloss Jul 28, 2023
83c2eef
add new exceptions for knn_compressor
zbloss Jul 28, 2023
41fd76d
add core KNN Compressor
zbloss Jul 28, 2023
671544d
add utils for test cases
zbloss Jul 28, 2023
2cf38b4
formatting
zbloss Jul 28, 2023
9939ed4
add knn compressor and tests
zbloss Jul 28, 2023
3241ef6
formatting
zbloss Jul 28, 2023
e27c927
don't commit .npy files
zbloss Jul 28, 2023
7721f3d
add torchtext to dev dependencies for examples
zbloss Jul 28, 2023
97f3dc6
add tqdm
zbloss Jul 28, 2023
050ba6a
add sampling_percentage to predict to speed up predictions
zbloss Jul 28, 2023
e952025
add scikit-learn to dev dependencies for examples
zbloss Jul 28, 2023
ca61d8d
add imdb prediction example
zbloss Jul 28, 2023
d9eddb9
add sampling_percentage tests
zbloss Jul 28, 2023
41f58f1
add install and test descriptions
zbloss Jul 28, 2023
db6ae4e
move original codebase to a separate directory
zbloss Jul 28, 2023
d6883ad
Merge branch 'main' into v0.0.1-packaging
zbloss Jul 28, 2023
4234261
bump version to 0.1.0
zbloss Jul 28, 2023
6a05984
update after feedback
zbloss Jul 29, 2023
1ffa682
fix random selection of data
zbloss Jul 29, 2023
5afe9cf
update python version dependency
zbloss Jul 29, 2023
6be71e2
bump back to 3.9 minimum through 3.11
zbloss Jul 29, 2023
8bd8bc3
update usage of Union[list-like] with Sequence
zbloss Jul 29, 2023
25350aa
update pipelines
zbloss Jul 29, 2023
5429bad
update spacing
zbloss Jul 29, 2023
d2855d1
remove object assignment from tests that don't need it.
zbloss Jul 29, 2023
39db00c
added type hinting
zbloss Jul 29, 2023
ed7e4ec
add ag_news example
zbloss Jul 29, 2023
c139ca3
only publish the package one time
zbloss Jul 29, 2023
c812503
remove unused --doctest-modules
zbloss Jul 29, 2023
bf86554
isort + black formatting
zbloss Jul 29, 2023
0424c61
tpyos on respectively
zbloss Jul 29, 2023
27c1692
raise exception instead of returning text.
zbloss Jul 29, 2023
8c2e56f
add return type hint to knn_classifier.predict
zbloss Jul 29, 2023
c2ea22f
rename compressed_x to compressed_input
zbloss Jul 29, 2023
7a3e855
add replace=False to fix training data from resampling
zbloss Jul 30, 2023
9b9ee3a
add several -> None type hints to __init__
zbloss Jul 30, 2023
67d76a5
add --doctest-modules back into the cicd pipeline
zbloss Jul 30, 2023
e444346
ignore examples in pytest
zbloss Jul 30, 2023
aadb104
ignore examples/ in pytest
zbloss Jul 30, 2023
baa72ab
add sample_data method to handle random sampling logic
zbloss Jul 30, 2023
b317492
add new classification report results
zbloss Jul 30, 2023
e2b1955
add correct load_filipino function
zbloss Jul 30, 2023
a18b56a
black + isort
zbloss Jul 30, 2023
cef8c8d
add hyperlink in README
zbloss Jul 31, 2023
6b44056
remove pip upgrade
zbloss Jul 31, 2023
b00bae7
simplify utils by only expecting ints
zbloss Jul 31, 2023
3fe355e
add return hint of None
zbloss Jul 31, 2023
9208fa7
update tests
zbloss Jul 31, 2023
0880102
simplify poetry install
zbloss Jul 31, 2023
241595c
move doctest to top of file
zbloss Jul 31, 2023
5a649d0
remove redundant poetry run pytest calls
zbloss Jul 31, 2023
feae49f
undo pipeline change
zbloss Jul 31, 2023
994f9b0
add tool.pytest.ini_options to simplify pytest call
zbloss Jul 31, 2023
22f95f3
remove unused if/main
zbloss Jul 31, 2023
a08a9dd
update numpy
zbloss Jul 31, 2023
a241b36
fix spacing in pipeline
zbloss Jul 31, 2023
9c8b787
update numpy
zbloss Jul 31, 2023
2b53ac3
cleaning up from recent feedback
zbloss Jul 31, 2023
8293be9
simplifying the aggregation line
zbloss Aug 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .github/workflows/publish-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Publish npc_gzip package

on:
release:
types: [published]

jobs:

publish:

runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9"]
zbloss marked this conversation as resolved.
Show resolved Hide resolved

steps:
- uses: actions/checkout@v3
EliahKagan marked this conversation as resolved.
Show resolved Hide resolved

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
poetry --version

- name: Run Tests
run: |
poetry install
poetry run pytest --junit-xml=junit/test-results-${{ matrix.python-version }}.xml

- name: Upload pytest test results
zbloss marked this conversation as resolved.
Show resolved Hide resolved
if: ${{ !cancelled() }} # Upload results even if tests fail.
uses: actions/upload-artifact@v3
with:
name: pytest-results-${{ matrix.python-version }}
path: junit/test-results-${{ matrix.python-version }}.xml

- name: Poetry Build & Publish
run: |
poetry build
poetry publish
Comment on lines +41 to +44
Copy link
Collaborator

@EliahKagan EliahKagan Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to access GitHub Actions secrets to work, because poetry publish has to authenticate with PyPI. However that is done, this code will probably have to be expanded slightly to facilitate it. In particular, if environment variables are used, the relevant secrets will have to be placed in them using an env: key. I don't know if this needs to be taken care of before the PR is merged or not. In particular, if this CI job is only going to be used in the future, and not for initially publishing the package, then there is no need to make any further changes to this file before merging the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so this is a new one for me. Usually I set a pypi api token as an env var. However, when I logged into pypi I saw a new feature for "trusted publishers". I set up this repo as a trusted publisher which I think means this github action will be able to publish this package without storing an api token.

Let's merge with this as is, and if it doesn't work we know how to quickly fix it.

Copy link
Collaborator

@EliahKagan EliahKagan Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Based on Using trusted publishing with GitHub Actions, I think using trusted publishing in publish-package.yml may require either doing the publish step with the pypi-publish action instead of by running poetry publish, or taking explicit steps to provide poetry with the OIDC-based token. (However publishing is achieved, the build could definitely still be done with poetry build.) Either way, as you say, it can be fixed pretty quickly if it doesn't work.

It occurs to me that fixing it may be even easier if workflow_dispatch is added as a second event trigger. Then, if a release is created on GitHub and publishing fails in a way requires the workflow to be modified to fix, then publishing can be manually reattempted from the Actions tab after fixing it, without having to make a second release on GitHub, or delete and remake the tag, etc. (Jobs that have run already can be re-run, but that would use the workflow as it was for the job, rather than the updated workflow. In contrast, triggering the workflow_dispatch event from the Actions tab uses the tip of whatever branch it is run for.) If you choose to do this, the change I'm suggesting is:

 on:
   release:
     types: [published]
+  workflow_dispatch:

37 changes: 37 additions & 0 deletions .github/workflows/test-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: Run npc_gzip tests

on: [push, pull_request]

jobs:

run-tests:

runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11"]
zbloss marked this conversation as resolved.
Show resolved Hide resolved
EliahKagan marked this conversation as resolved.
Show resolved Hide resolved

steps:
- uses: actions/checkout@v3
EliahKagan marked this conversation as resolved.
Show resolved Hide resolved

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
poetry --version

- name: Run Tests
run: |
poetry install
poetry run pytest --junit-xml=junit/test-results-${{ matrix.python-version }}.xml

- name: Upload pytest test results
if: ${{ !cancelled() }} # Upload results even if tests fail.
uses: actions/upload-artifact@v3
with:
name: pytest-results-${{ matrix.python-version }}
path: junit/test-results-${{ matrix.python-version }}.xml
EliahKagan marked this conversation as resolved.
Show resolved Hide resolved
163 changes: 163 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
notebooks/
junit/
*.npy
42 changes: 37 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,40 @@
### Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
# Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

This paper is accepted to Findings of [ACL2023](https://aclanthology.org/2023.findings-acl.426/).

### Require
## Getting Started

This codebase is [available on pypi.org via](https://pypi.org/project/npc-gzip)


```bash

pip install npc-gzip

```

## Usage

See the [examples](./examples/imdb.py) directory for example usage.


## Testing

This package utilizes `poetry` to maintain its dependencies and `pytest` to execute tests. To get started running the tests:

```bash

poetry shell
poetry install
zbloss marked this conversation as resolved.
Show resolved Hide resolved
pytest
zbloss marked this conversation as resolved.
Show resolved Hide resolved

```

-------------------------

### Original Codebase

#### Require

See `requirements.txt`.

Expand All @@ -13,7 +45,7 @@ conda activate npc
pip install -r requirements.txt
```

### Run
#### Run

```
python main_text.py
Expand All @@ -36,7 +68,7 @@ By default, this will only use 100 test and training samples per class as a quic

```

### Calculate Accuracy (Optional)
#### Calculate Accuracy (Optional)

If we want to calculate accuracy from recorded distance file <DISTANCE DIR>, use

Expand All @@ -45,7 +77,7 @@ python main_text.py --record --score --distance_fn <DISTANCE DIR>
```
to calculate accuracy. Otherwise, the accuracy will be calculated automatically using the command in the last section.

### Use Custom Dataset
#### Use Custom Dataset

You can use your own custom dataset by passing `custom` to `--dataset`; pass the data directory that contains `train.txt` and `test.txt` to `--data_dir`; pass the class number to the `--class_num`.

Expand Down
Loading