Releases: databricks/lilac
Releases · databricks/lilac
v0.1.22
High-level
- Excluding a tag from the UI is now an option from the searchbox, enabling the workflow of keeping that filter on, and progressively tagging new data to be removed.
Features
- Add
dataset.transform()
where we pass the entire input as iterable by @dsmilkov in #897 - Add support for input paths to dataset.map. by @nsthorat in #882
- Improve ergonomics of map, relaxing the exact requirement of kwargs={row, job_id} by @nsthorat in #883
- Add a second option in searchbox dropdown to exclude a tag by @brilee in #889
- Add rendering of string spans that were derived from a map with input path by @dsmilkov in #888
- Make schema in signals optional by @dsmilkov in #895
- Add string filters by @brilee in #892
Bug fixes
- Fix a few issues with batching, prefetching, and searches. by @nsthorat in #881
- Upgrade duckdb to 0.9.2, fixing a crash in a dask process with fetch_df_chunk. by @nsthorat in #884
- Fix UI bugs with span rendering of maps. by @nsthorat in #894
- Fix span resolving for map outputs by @dsmilkov in #886
- Prefer existing embedding in embedding retrieval function by @brilee in #890
- Allow lilac to run tasks outside a running event loop. by @nsthorat in #899
Other Changes
- Pass explicit schema during jsonl -> parquet conversion by @dsmilkov in #885
- Rename
lilac.lilac_span
tolilac.span
by @dsmilkov in #887 - Make the tags & namespaces in the dataset panel expandable. by @nsthorat in #893
- Fix trailing error with tests. by @nsthorat in #901
- Make the tag expandables serializable in the URL for sharing. by @nsthorat in #898
- Add the navigation store to the URL hash. by @nsthorat in #896
Full Changelog: v0.1.21...v0.1.22
v0.1.21
Features
- Signal computations are now cached. If a signal fails half-way through, it will be resumed.
- Source loading is much faster, up to 40x faster for some sources (e.g. HuggingFace)
- Map dtype is now supported for parquet sources.
Details
- Add jsonl intermediate caching to signals. Introduce a central spot for this cache abstraction. by @nsthorat in #858
- Rename fast_process to load_to_parquet by @brilee in #862
- Implement fast_process for parquet sources by @brilee in #860
- Implement CSV direct to parquet by @brilee in #863
- Implement fast json source by @brilee in #865
- Add
map<key, value>
dtype. No support in the UI yet. by @dsmilkov in #870 - Implement fast processing for huggingface datasets by @brilee in #869
Bug Fixes & Other Changes
- add development docs on profiling by @brilee in #861
- Add docs for settings and compare mode by @dsmilkov in #859
- Add a nest_under field to dataset.map(). by @nsthorat in #866
- Avoid computing stats for every single field on page load by @dsmilkov in #873
- Fix a sample_size yaml bug by @dsmilkov in #874
- UI fixes for expanding long rows. by @nsthorat in #875
- Fix small bug with compute signal / concepts and filtering by valid dtypes. by @nsthorat in #877
- Add support for
map
field in the schema and UI by @dsmilkov in #878 - Fix a bug with previewing and comparing on repeated values. by @nsthorat in #879
- Allow custom signals to work with dask processes. by @nsthorat in #880
Full Changelog: v0.1.20...v0.1.21
v0.1.20
v0.1.19
v0.1.18
What's Changed
- Add Single Item as a view type with pagination by @dsmilkov in #846
- Add monaco and enable column-level diffing. by @nsthorat in #845
- Add parallelism to dataset.map with dask. by @nsthorat in #847
- Upgrade Cohere embeddings to v3-light by @brilee in #833
- Integrate Presidio into PII detection by @brilee in #839
- Simplify the UI for choosing media fields by @nsthorat in #844
Other Changes
- Fix the
build_docs.sh
andwatch_docs.sh
scripts to use the latest version of Lilac by @dsmilkov in #829 - Add backend support for sampling jsonl files by @brilee in #826
- Fix the HF deploy script for windows. by @nsthorat in #831
- Fix the flaky hdbscan test by setting a UMAP random_state from the unit test. by @nsthorat in #832
- Remove redundant dataset_cache call by @brilee in #835
- Invalidate the query after the redirect to avoid 500 errors from deleted dataset. by @nsthorat in #836
- Fix dataset uploading on windows. by @nsthorat in #837
- OpenAI Azure connector by @dechantoine in #838
- Expose hbdscan in the docs by @dsmilkov in #840
- Add a query type to SemanticSimilaritySignal and SemanticSearch: 'question' | 'document' by @nsthorat in #841
- Fix missing token in hf upload by @brilee in #842
- Add debouncing to file watcher recompilation by @brilee in #843
- Fix bug where missing keys in the filter constraint would raise KeyError by @brilee in #849
- Pass the job_id to the dataset.map map_fn. by @nsthorat in #848
- Add unit tests for num_jobs=-1 by @nsthorat in #850
Full Changelog: v0.1.17...v0.1.18
v0.1.17
What's Changed
Other Changes
- Simplify the lilac_deployer, add some links to make it easier. by @nsthorat in #817
- Add UI for dataset settings to edit tags of a dataset. by @nsthorat in #824
- Parquet source: When
pseudo_shuffle=True
, limit the number of shards we read from by @dsmilkov in #827
Full Changelog: v0.1.16...v0.1.17
v0.1.16
v0.1.14
Features
- Add "not exist" filter when somebody clicks on "N/A" in the histogram by @dsmilkov in #809
- Add a Lilac deployer UI that lets you deploy a dataset + Lilac from a streamlit UI. by @nsthorat in #812
Other Changes
Full Changelog: v0.1.13...v0.1.14
v0.1.13
What's Changed
- PaLM gcp connector by @dechantoine in #793
- Add JSONL caching for dataset.map(). by @nsthorat in #808
Bug fixes
New Contributors
- @dechantoine made their first contribution in #793
Full Changelog: v0.1.12...v0.1.13
v0.1.12
What's Changed
Other Changes
- Sleep for 2 seconds after publishing tags in the publish pip script. by @nsthorat in #799
- Add the video to readme and website. by @nsthorat in #800
- Fix the deploy_website script. by @nsthorat in #803
- Hardcode sentence_transformer batch size to 1024 for optimal length-sorting/padding. by @brilee in #804
- Fix NaT bug by @brilee in #806
Full Changelog: v0.1.11...v0.1.12