Skip to content

Releases: databricks/lilac

v0.1.22

29 Nov 21:22
Compare
Choose a tag to compare

High-level

  • Excluding a tag from the UI is now an option from the searchbox, enabling the workflow of keeping that filter on, and progressively tagging new data to be removed.
image - Signals can now be written without defining a schema.

Features

  • Add dataset.transform() where we pass the entire input as iterable by @dsmilkov in #897
  • Add support for input paths to dataset.map. by @nsthorat in #882
  • Improve ergonomics of map, relaxing the exact requirement of kwargs={row, job_id} by @nsthorat in #883
  • Add a second option in searchbox dropdown to exclude a tag by @brilee in #889
  • Add rendering of string spans that were derived from a map with input path by @dsmilkov in #888
  • Make schema in signals optional by @dsmilkov in #895
  • Add string filters by @brilee in #892

Bug fixes

  • Fix a few issues with batching, prefetching, and searches. by @nsthorat in #881
  • Upgrade duckdb to 0.9.2, fixing a crash in a dask process with fetch_df_chunk. by @nsthorat in #884
  • Fix UI bugs with span rendering of maps. by @nsthorat in #894
  • Fix span resolving for map outputs by @dsmilkov in #886
  • Prefer existing embedding in embedding retrieval function by @brilee in #890
  • Allow lilac to run tasks outside a running event loop. by @nsthorat in #899

Other Changes

  • Pass explicit schema during jsonl -> parquet conversion by @dsmilkov in #885
  • Rename lilac.lilac_span to lilac.span by @dsmilkov in #887
  • Make the tags & namespaces in the dataset panel expandable. by @nsthorat in #893
  • Fix trailing error with tests. by @nsthorat in #901
  • Make the tag expandables serializable in the URL for sharing. by @nsthorat in #898
  • Add the navigation store to the URL hash. by @nsthorat in #896

Full Changelog: v0.1.21...v0.1.22

v0.1.21

23 Nov 03:04
Compare
Choose a tag to compare

Features

  • Signal computations are now cached. If a signal fails half-way through, it will be resumed.
  • Source loading is much faster, up to 40x faster for some sources (e.g. HuggingFace)
  • Map dtype is now supported for parquet sources.

Details

  • Add jsonl intermediate caching to signals. Introduce a central spot for this cache abstraction. by @nsthorat in #858
  • Rename fast_process to load_to_parquet by @brilee in #862
  • Implement fast_process for parquet sources by @brilee in #860
  • Implement CSV direct to parquet by @brilee in #863
  • Implement fast json source by @brilee in #865
  • Add map<key, value> dtype. No support in the UI yet. by @dsmilkov in #870
  • Implement fast processing for huggingface datasets by @brilee in #869

Bug Fixes & Other Changes

  • add development docs on profiling by @brilee in #861
  • Add docs for settings and compare mode by @dsmilkov in #859
  • Add a nest_under field to dataset.map(). by @nsthorat in #866
  • Avoid computing stats for every single field on page load by @dsmilkov in #873
  • Fix a sample_size yaml bug by @dsmilkov in #874
  • UI fixes for expanding long rows. by @nsthorat in #875
  • Fix small bug with compute signal / concepts and filtering by valid dtypes. by @nsthorat in #877
  • Add support for map field in the schema and UI by @dsmilkov in #878
  • Fix a bug with previewing and comparing on repeated values. by @nsthorat in #879
  • Allow custom signals to work with dask processes. by @nsthorat in #880

Full Changelog: v0.1.20...v0.1.21

v0.1.20

16 Nov 13:13
Compare
Choose a tag to compare

Bug fixes

  • Small fix with rendering MetadataSearch in the schema view by @dsmilkov in #855
  • Fast dataset load by @brilee in #854
  • Fix a bug with single item mode and monaco diff not updating by @dsmilkov in #856

Full Changelog: v0.1.19...v0.1.20

v0.1.19

15 Nov 16:06
Compare
Choose a tag to compare

Bug fixes

  • Fix thread bug in hnswlib, which should fix CI python tests by @dsmilkov in #852
  • Fix bugs with the media fields selector where no fields showed up. by @nsthorat in #853

Full Changelog: v0.1.18...v0.1.19

v0.1.18

14 Nov 19:46
Compare
Choose a tag to compare

What's Changed

  • Add Single Item as a view type with pagination by @dsmilkov in #846
  • Add monaco and enable column-level diffing. by @nsthorat in #845
  • Add parallelism to dataset.map with dask. by @nsthorat in #847
  • Upgrade Cohere embeddings to v3-light by @brilee in #833
  • Integrate Presidio into PII detection by @brilee in #839
  • Simplify the UI for choosing media fields by @nsthorat in #844

Other Changes

  • Fix the build_docs.sh and watch_docs.sh scripts to use the latest version of Lilac by @dsmilkov in #829
  • Add backend support for sampling jsonl files by @brilee in #826
  • Fix the HF deploy script for windows. by @nsthorat in #831
  • Fix the flaky hdbscan test by setting a UMAP random_state from the unit test. by @nsthorat in #832
  • Remove redundant dataset_cache call by @brilee in #835
  • Invalidate the query after the redirect to avoid 500 errors from deleted dataset. by @nsthorat in #836
  • Fix dataset uploading on windows. by @nsthorat in #837
  • OpenAI Azure connector by @dechantoine in #838
  • Expose hbdscan in the docs by @dsmilkov in #840
  • Add a query type to SemanticSimilaritySignal and SemanticSearch: 'question' | 'document' by @nsthorat in #841
  • Fix missing token in hf upload by @brilee in #842
  • Add debouncing to file watcher recompilation by @brilee in #843
  • Fix bug where missing keys in the filter constraint would raise KeyError by @brilee in #849
  • Pass the job_id to the dataset.map map_fn. by @nsthorat in #848
  • Add unit tests for num_jobs=-1 by @nsthorat in #850

Full Changelog: v0.1.17...v0.1.18

v0.1.17

07 Nov 15:21
Compare
Choose a tag to compare

What's Changed

Other Changes

  • Simplify the lilac_deployer, add some links to make it easier. by @nsthorat in #817
  • Add UI for dataset settings to edit tags of a dataset. by @nsthorat in #824
  • Parquet source: When pseudo_shuffle=True, limit the number of shards we read from by @dsmilkov in #827

Full Changelog: v0.1.16...v0.1.17

v0.1.16

03 Nov 13:07
Compare
Choose a tag to compare

What's Changed

Other Changes

  • Update lilac version in deployer UI. Add tokens to HF API calls. by @nsthorat in #813
  • Update deployer lilac version to 0.1.15. by @nsthorat in #814
  • Pass token to deploy_project_operations. by @nsthorat in #816

Full Changelog: v0.1.14...v0.1.16

v0.1.14

02 Nov 13:37
Compare
Choose a tag to compare

Features

  • Add "not exist" filter when somebody clicks on "N/A" in the histogram by @dsmilkov in #809
  • Add a Lilac deployer UI that lets you deploy a dataset + Lilac from a streamlit UI. by @nsthorat in #812

Other Changes

Full Changelog: v0.1.13...v0.1.14

v0.1.13

31 Oct 20:49
Compare
Choose a tag to compare

What's Changed

Bug fixes

  • Fix topk on an indexed repeated field + metadata filter by @dsmilkov in #807

New Contributors

Full Changelog: v0.1.12...v0.1.13

v0.1.12

27 Oct 20:59
Compare
Choose a tag to compare

What's Changed

Other Changes

  • Sleep for 2 seconds after publishing tags in the publish pip script. by @nsthorat in #799
  • Add the video to readme and website. by @nsthorat in #800
  • Fix the deploy_website script. by @nsthorat in #803
  • Hardcode sentence_transformer batch size to 1024 for optimal length-sorting/padding. by @brilee in #804
  • Fix NaT bug by @brilee in #806

Full Changelog: v0.1.11...v0.1.12