Skip to content

Releases: databricks/lilac

v0.2.5

19 Jan 18:34
Compare
Choose a tag to compare

What's Changed

This release is mostly UI bug fixes.

We also added support for remote computation of GTE embeddings via Lilac Garden. If you are interested, please reach out to us.

Bug fixes

  • Always sort by rowid to make db results stable by @dsmilkov in #1086
  • Fix a couple of UI bugs by @dsmilkov in #1088
  • Fix some small UI bugs. by @nsthorat in #1084
  • Ignore folders that don't have a manifest.json and make project config source of truth for dataset listing by @brilee in #1083
  • Fix bug with keyword search highlighting every field. by @nsthorat in #1081

Garden

Other Changes

Full Changelog: v0.2.4...v0.2.5

v0.2.4

17 Jan 16:16
Compare
Choose a tag to compare

What's Changed

This release is mostly bug fixes and small changes to the upcoming clustering UI.

Clustering

  • Optimize the cluster view page to reduce number of requests by @dsmilkov in #1062
  • Improve the cluster titling and fix a few client-side bugs by @dsmilkov in #1058
  • Add share-gpt specific format selectors. by @nsthorat in #1060
  • Cluster spec deployer by @brilee in #1063
  • Improve the cluster/pivot UI by @dsmilkov in #1068
  • Tiny UI fix for "Clusters of" and shave off 1 call to dataset.stats in clustering by @dsmilkov in #1074
  • Support input selectors from config files. by @nsthorat in #1076

Bug fixes

Docs

Other Changes

New Contributors

Full Changelog: v0.2.3...v0.2.4

v0.2.3

12 Jan 16:56
Compare
Choose a tag to compare

What's Changed

We now have 2 CLI scripts for sharing Lilac datasets (via huggingface):

lilac upload local/Capybara --url_or_repo=lilacai/Capybara

To download the dataset to a local project directory:

lilac download lilacai/Capybara

For more details on sharing datasets, see the Sharing Guide

With this change, we added a new environment variable USE_TABLE_INDEX, useful for frozen demos. This will dramatically improve the performance of queries as we use a cached DuckDB table. This will slow down labeling, or any edits, as the table will get re-computed upon each change.

Upload / Download

  • Add an upload dataset script. Some other cleanups. by @nsthorat in #1059

Bug fixes

  • Fix a bug with CSV source reader for TSV files, and named columns. by @nsthorat in #1040
  • Progress bar by @brilee in #1043
  • Fix bug with ItemMedia not rendering media fields that are deeply nested siblings. by @nsthorat in #1044
  • Fix clustering an enriched field by @dsmilkov in #1048
  • Propagate filters in the group by panel by @dsmilkov in #1041

Performance

  • Add indexing on database startup, flag-guarded by @brilee in #1052

UI

  • Add clustering in the UI by @dsmilkov in #1045
  • Add search to the cluster UI. Add some polish. by @nsthorat in #1054
  • Add clusters to the schema menu. Migrate to a custom carousel component so the page doesn't freeze. by @nsthorat in #1050

Clustering

  • Add dataset.cluster(input) where input can be any lambda func by @dsmilkov in #1042
  • dataset.cluster() flattens any repeated before clustering by @dsmilkov in #1051

Lilac Garden

Other Changes

  • Move the import of .env.local in publish_pip to the top of the file. by @nsthorat in #1039
  • fix: migrate embeddings by azure openai to openai > 1.0.0 by @dechantoine in #1053
  • Streamline lilac deployment by @brilee in #1057
  • Add a notebook for working with concepts from python. by @nsthorat in #1055

Full Changelog: v0.2.2...v0.2.3

v0.2.2

08 Jan 14:27
Compare
Choose a tag to compare

Bug fixes

  • Fix a bug with OpenAI embeddings after upgrading. by @nsthorat in #1038
  • Remove an extra temporary column at the end of clustering by @dsmilkov in #1035

Other Changes

  • Convert the pivot viewer to a bunch of carousels. by @nsthorat in #1034

Full Changelog: v0.2.1...v0.2.2

v0.2.1

05 Jan 22:17
Compare
Choose a tag to compare

Keyboard shortcuts are now available for deleting, and labeling!

To delete a row: use backspace or delete.
To label, go to dataset settings, and configure key-bindings for each label.

keyboard_shortcuts.mp4

What's Changed

Features

Bug fixes

Docs

  • Update documentation for labels, keyboard shortcuts, deleting rows. by @nsthorat in #1030
  • Add documentation that points to the lilac deployer UI. by @nsthorat in #1020

UI

  • Improve the UI around deleting. by @nsthorat in #1024
  • Add a 2-feature pivot view, allowing you to view a hierarchy of 2 features by @nsthorat in #1023

Other Changes

New Contributors

Full Changelog: v0.2.0...v0.2.1

v0.2.0

03 Jan 22:12
Compare
Choose a tag to compare

What's Changed

The UI now supports deleting row(s), viewing the trash & undeleting. Exporting will now automatically drop deleted rows.

Breaking changes

  • Merge output_column and nest_under --> dataset.map(output_path=...) by @dsmilkov in #1001

UI

Performance

Bug fixes

Clustering (coming soon)

Map & signal changes

Other Changes

  • Update the Dockerfile to use port 80 so we can use it on GCE. by @nsthorat in #992
  • Make OpenAI calls threaded with exponential backoff by @dsmilkov in #1005

Full Changelog: v0.1.26...v0.2.0

v0.1.26

19 Dec 03:32
Compare
Choose a tag to compare

This release adds a markdown code block extractor signal, highlighting markdown code blocks and their languages.

image

What's Changed

Bug fixes

  • Emit membership prob in HDBScan, and fix "group by" UI bugs by @dsmilkov in #976
  • Fix ll.start_server() and add a test for full end-to-end server startup by @dsmilkov in #984
  • Add CLI integration tests. by @nsthorat in #985
  • Make ll.start_server() blocking outside an event loop by @dsmilkov in #986

Other Changes

Full Changelog: v0.1.25...v0.1.26

v0.1.25

18 Dec 17:55
Compare
Choose a tag to compare

This release drops dask for a thin multi-processing client, and comes with lots of performance improvements, namely the slow import time of lilac.

We have also added a simple API for loading from HuggingFace

import lilac as ll
from datasets import load_dataset
hf_ds = load_dataset('Open-Orca/SlimOrca-Dedup')
ds = ll.from_huggingface(hf_ds)

And a simple API for getting embeddings:

answer_emb = ds.get_embeddings('jina-v2-small', rowid, 'answer')[0]['vector']

We've also added some color to the UI, and organized components a little better
image

Features

Performance

Bug fixes

  • Fix memory leak caused by Iterable/Iterator mixups by @brilee in #974
  • Fix broken doc links. by @nsthorat in #964
  • Add color scales for semantic / concept search. Add openchat format. by @nsthorat in #975

Other Changes

Full Changelog: v0.1.24...v0.1.25

v0.1.24

12 Dec 19:29
Compare
Choose a tag to compare

This release changes the text media visualizer to Monaco (the engine that powers VSCode).

Monaco allows us to:

  • Deep-link to any line within a document.
  • Add right click menus to text.
  • Add "thumbs up" and "thumbs down" to concepts from the menu, for any text.
  • Search any text from the right click-menu, with semantic similarity or keyword search.

Here is a video explaining the changes: https://www.youtube.com/watch?v=83Rj006tVIk

This release also has custom support for the ShareGPT format in the UI:
image

Features

  • Add special support for a DELETED label by @brilee in #951
  • Switch to monaco for the main viewer. by @nsthorat in #952
  • Simplify monaco viewer. Add support for deep linking. by @nsthorat in #956
  • Infer dataset formats. Start with just ShareGPT. by @nsthorat in #948
  • Add UI for title slots for ShareGPT. by @nsthorat in #950

Bug fixes

  • Make the signal "try it" page work for signals w/o schema by @dsmilkov in #944
  • Fix UI bugs: monaco scroll, hash state forgotten, compare non-media fields by @nsthorat in #946
  • Eliminate setup count call from parquet_source. by @brilee in #959
  • Fix a bug where we highlighted all concept spans regardless of their score. by @nsthorat in #958
  • Fix bug with loading dataset and settings. by @nsthorat in #957

Docs

Other Changes

Full Changelog: v0.1.23...v0.1.24

v0.1.23

07 Dec 16:25
Compare
Choose a tag to compare

High-level

Lilac is now moving towards editing data directly in the tool. The first vehicle for this is Dataset.map.

New blog post on curating data with the new Dataset.map feature:
https://docs.lilacml.com/blog/curate-coding-dataset.html

Documentation on Dataset.map:
https://docs.lilacml.com/datasets/dataset_edit.html

Features

  • Add dataset.map support for limit/filter by @brilee in #933
  • Add support for arbitrary value type v in map<k, v> in parquet by @dsmilkov in #935
  • Add batch size support and collapse transform impl by @brilee in #934

Improvements

  • Improve the UI for repeated values. by @nsthorat in #904
  • Small ergonomic fixes while writing the "code formatting" blog post by @dsmilkov in #909
  • Merge multiple shards of the same task into the same progress bar. by @nsthorat in #910
  • Add threaded task execution. by @nsthorat in #920
  • Fix css style for markdown tables by @dsmilkov in #931
  • Fix tqdm progress bars by separating report_progress from show_progress. by @nsthorat in #929
  • Make parquet the default source by @dsmilkov in #941

Bug fixes

  • Fix keyword search to work with apostrophe ' by @dsmilkov in #907
  • Make sure the results of dataset.map() always returns an iterable. by @nsthorat in #925
  • Remove position= in tqdm. by @nsthorat in #913

Docs

Other

  • Refactor dataset/signal endpoints into separate module by @brilee in #900
  • Add memray dep and instructions by @brilee in #917
  • Add spec for select options by @brilee in #918
  • Simplify helper methods to closer align to API for select options by @brilee in #919
  • Start writing the query options compiler by @brilee in #924

Coming soon

Full Changelog: v0.1.22...v0.1.23