Releases: databricks/lilac
v0.2.5
What's Changed
This release is mostly UI bug fixes.
We also added support for remote computation of GTE embeddings via Lilac Garden. If you are interested, please reach out to us.
Bug fixes
- Always sort by rowid to make db results stable by @dsmilkov in #1086
- Fix a couple of UI bugs by @dsmilkov in #1088
- Fix some small UI bugs. by @nsthorat in #1084
- Ignore folders that don't have a manifest.json and make project config source of truth for dataset listing by @brilee in #1083
- Fix bug with keyword search highlighting every field. by @nsthorat in #1081
Garden
Other Changes
- Remove the media path x preferred-embedding logic. by @nsthorat in #1079
- Deploy dataset by @brilee in #1085
- Enable deploying at HEAD for demo as well as staging. by @brilee in #1089
Full Changelog: v0.2.4...v0.2.5
v0.2.4
What's Changed
This release is mostly bug fixes and small changes to the upcoming clustering UI.
Clustering
- Optimize the cluster view page to reduce number of requests by @dsmilkov in #1062
- Improve the cluster titling and fix a few client-side bugs by @dsmilkov in #1058
- Add share-gpt specific format selectors. by @nsthorat in #1060
- Cluster spec deployer by @brilee in #1063
- Improve the cluster/pivot UI by @dsmilkov in #1068
- Tiny UI fix for "Clusters of" and shave off 1 call to dataset.stats in clustering by @dsmilkov in #1074
- Support input selectors from config files. by @nsthorat in #1076
Bug fixes
- fix npm deps? by @brilee in #1070
- Fix a couple issues with the export menu. by @nsthorat in #1069
- Create tasks api to ensure exceptions caught by @brilee in #1071
- Override the OpenAPI base url when lilac is not being served from / by @nsthorat in #1073
- Make the entire svelte app use relative links by @dsmilkov in #1075
Docs
- Add documentation for sharing datasets. by @nsthorat in #1061
- Fix broken link in README by @albertvillanova in #1064
- Fix broken links in HuggingFaceSpaceWelcome web component by @albertvillanova in #1065
Other Changes
New Contributors
- @albertvillanova made their first contribution in #1064
Full Changelog: v0.2.3...v0.2.4
v0.2.3
What's Changed
We now have 2 CLI scripts for sharing Lilac datasets (via huggingface):
lilac upload local/Capybara --url_or_repo=lilacai/Capybara
To download the dataset to a local project directory:
lilac download lilacai/Capybara
For more details on sharing datasets, see the Sharing Guide
With this change, we added a new environment variable USE_TABLE_INDEX
, useful for frozen demos. This will dramatically improve the performance of queries as we use a cached DuckDB table. This will slow down labeling, or any edits, as the table will get re-computed upon each change.
Upload / Download
Bug fixes
- Fix a bug with CSV source reader for TSV files, and named columns. by @nsthorat in #1040
- Progress bar by @brilee in #1043
- Fix bug with ItemMedia not rendering media fields that are deeply nested siblings. by @nsthorat in #1044
- Fix clustering an enriched field by @dsmilkov in #1048
- Propagate filters in the group by panel by @dsmilkov in #1041
Performance
UI
- Add clustering in the UI by @dsmilkov in #1045
- Add search to the cluster UI. Add some polish. by @nsthorat in #1054
- Add clusters to the schema menu. Migrate to a custom carousel component so the page doesn't freeze. by @nsthorat in #1050
Clustering
- Add
dataset.cluster(input)
whereinput
can be any lambda func by @dsmilkov in #1042 dataset.cluster()
flattens any repeated before clustering by @dsmilkov in #1051
Lilac Garden
Other Changes
- Move the import of .env.local in publish_pip to the top of the file. by @nsthorat in #1039
- fix: migrate embeddings by azure openai to openai > 1.0.0 by @dechantoine in #1053
- Streamline lilac deployment by @brilee in #1057
- Add a notebook for working with concepts from python. by @nsthorat in #1055
Full Changelog: v0.2.2...v0.2.3
v0.2.2
v0.2.1
Keyboard shortcuts are now available for deleting, and labeling!
To delete a row: use backspace or delete.
To label, go to dataset settings, and configure key-bindings for each label.
keyboard_shortcuts.mp4
What's Changed
Features
Bug fixes
- Allows non folder exports by @hynky1999 in #1026
- Fixes incorrect destructuring by @hynky1999 in #1025
- Improve auto-binning, and sorting of histograms. by @nsthorat in #1033
- Fix lilac deployer for slashed datasets. by @nsthorat in #1021
Docs
- Update documentation for labels, keyboard shortcuts, deleting rows. by @nsthorat in #1030
- Add documentation that points to the lilac deployer UI. by @nsthorat in #1020
UI
- Improve the UI around deleting. by @nsthorat in #1024
- Add a 2-feature pivot view, allowing you to view a hierarchy of 2 features by @nsthorat in #1023
Other Changes
- Improve the title generation in clustering by @dsmilkov in #1022
- Fix some
map(overwrite=True)
bugs by @dsmilkov in #1031 - Add superclusters (categories) by @dsmilkov in #1032
New Contributors
- @hynky1999 made their first contribution in #1026
Full Changelog: v0.2.0...v0.2.1
v0.2.0
What's Changed
The UI now supports deleting row(s), viewing the trash & undeleting. Exporting will now automatically drop deleted rows.
Breaking changes
UI
- Add the ability to delete and restore rows from the UI. by @nsthorat in #1011
- Fix signal configs to use ClassVar by @dsmilkov in #1016
Performance
- Fix jina to also run on CUDA if available by @dsmilkov in #996
- Use CUDA when available for sentence transformers. by @nsthorat in #991
- Use the yaml CLoader loader if it's available. by @nsthorat in #995
- Use cuml for clustering when possible by @dsmilkov in #997
- Fix map by @brilee in #994
- Add Jina (Small) on Garden signal by @dsmilkov in #1009
Bug fixes
- Fix some small UI bugs. by @nsthorat in #987
- Fix issue with repeated of string rendering. by @nsthorat in #1015
- Load datasets in a separate thread from the UI. by @nsthorat in #1014
- Fix issue where we don't block on the server thread from the CLI. by @nsthorat in #1013
Clustering (coming soon)
- Make
ds.cluster()
have resumable title generation by @dsmilkov in #1000 dataset.cluster()
now usestransform()
which usesmap()
by @dsmilkov in #1002- Add topic clustering in
dataset.cluster()
by @dsmilkov in #993 - Allow clustering of a nested path by @dsmilkov in #1007
- Add
dataset.cluster(remote=True)
bit by @dsmilkov in #1010
Map & signal changes
- Add signal.map customization by @brilee in #1004
- Allow map to be called for arbitrary depth by @dsmilkov in #998
- remove VectorCompute path in dispatch_workers by @brilee in #1008
- Implement signals on top of the map infrastructure by @brilee in #1006
dataset.map
can now nest_under any repeated by @dsmilkov in #999- Remove TaskShardId by @brilee in #1003
Other Changes
- Update the Dockerfile to use port 80 so we can use it on GCE. by @nsthorat in #992
- Make OpenAI calls threaded with exponential backoff by @dsmilkov in #1005
Full Changelog: v0.1.26...v0.2.0
v0.1.26
This release adds a markdown code block extractor signal, highlighting markdown code blocks and their languages.
What's Changed
Bug fixes
- Emit membership prob in HDBScan, and fix "group by" UI bugs by @dsmilkov in #976
- Fix
ll.start_server()
and add a test for full end-to-end server startup by @dsmilkov in #984 - Add CLI integration tests. by @nsthorat in #985
- Make
ll.start_server()
blocking outside an event loop by @dsmilkov in #986
Other Changes
Full Changelog: v0.1.25...v0.1.26
v0.1.25
This release drops dask for a thin multi-processing client, and comes with lots of performance improvements, namely the slow import time of lilac.
We have also added a simple API for loading from HuggingFace
import lilac as ll
from datasets import load_dataset
hf_ds = load_dataset('Open-Orca/SlimOrca-Dedup')
ds = ll.from_huggingface(hf_ds)
And a simple API for getting embeddings:
answer_emb = ds.get_embeddings('jina-v2-small', rowid, 'answer')[0]['vector']
We've also added some color to the UI, and organized components a little better
Features
- Add Jina V2 embeddings by @dsmilkov in #966
- Add sugar for
ll.from_huggingface()
by @dsmilkov in #962 - Improve the row header to give us space for deleting. by @nsthorat in #965
Performance
- Reduce import times by @brilee in #961
- Using
loky
(thin wrapper aroundmultiprocessing
) instead of dask by @dsmilkov in #947 - fix iterable robustness by @brilee in #977
Bug fixes
- Fix memory leak caused by Iterable/Iterator mixups by @brilee in #974
- Fix broken doc links. by @nsthorat in #964
- Add color scales for semantic / concept search. Add openchat format. by @nsthorat in #975
Other Changes
Full Changelog: v0.1.24...v0.1.25
v0.1.24
This release changes the text media visualizer to Monaco (the engine that powers VSCode).
Monaco allows us to:
- Deep-link to any line within a document.
- Add right click menus to text.
- Add "thumbs up" and "thumbs down" to concepts from the menu, for any text.
- Search any text from the right click-menu, with semantic similarity or keyword search.
Here is a video explaining the changes: https://www.youtube.com/watch?v=83Rj006tVIk
This release also has custom support for the ShareGPT format in the UI:
Features
- Add special support for a DELETED label by @brilee in #951
- Switch to monaco for the main viewer. by @nsthorat in #952
- Simplify monaco viewer. Add support for deep linking. by @nsthorat in #956
- Infer dataset formats. Start with just ShareGPT. by @nsthorat in #948
- Add UI for title slots for ShareGPT. by @nsthorat in #950
Bug fixes
- Make the signal "try it" page work for signals w/o schema by @dsmilkov in #944
- Fix UI bugs: monaco scroll, hash state forgotten, compare non-media fields by @nsthorat in #946
- Eliminate setup count call from parquet_source. by @brilee in #959
- Fix a bug where we highlighted all concept spans regardless of their score. by @nsthorat in #958
- Fix bug with loading dataset and settings. by @nsthorat in #957
Docs
Other Changes
- Add youtube video for the blog post by @dsmilkov in #942
- Drive-by cleanup of schema.py code by @brilee in #955
Full Changelog: v0.1.23...v0.1.24
v0.1.23
High-level
Lilac is now moving towards editing data directly in the tool. The first vehicle for this is Dataset.map
.
New blog post on curating data with the new Dataset.map
feature:
https://docs.lilacml.com/blog/curate-coding-dataset.html
Documentation on Dataset.map:
https://docs.lilacml.com/datasets/dataset_edit.html
Features
- Add dataset.map support for limit/filter by @brilee in #933
- Add support for arbitrary value type
v
inmap<k, v>
in parquet by @dsmilkov in #935 - Add batch size support and collapse transform impl by @brilee in #934
Improvements
- Improve the UI for repeated values. by @nsthorat in #904
- Small ergonomic fixes while writing the "code formatting" blog post by @dsmilkov in #909
- Merge multiple shards of the same task into the same progress bar. by @nsthorat in #910
- Add threaded task execution. by @nsthorat in #920
- Fix css style for markdown tables by @dsmilkov in #931
- Fix tqdm progress bars by separating report_progress from show_progress. by @nsthorat in #929
- Make parquet the default source by @dsmilkov in #941
Bug fixes
- Fix keyword search to work with apostrophe
'
by @dsmilkov in #907 - Make sure the results of dataset.map() always returns an iterable. by @nsthorat in #925
- Remove position= in tqdm. by @nsthorat in #913
Docs
- Add a guide for iterating on dataset by @dsmilkov in #923
- Add blog post for diffing and
dataset.map
by @dsmilkov in #912 - Redo the docs.lilacml.com landing page by @dsmilkov in #932
- Small tweaks to improve the glaive dataset blog post. by @nsthorat in #938
- Rename the guide to edit a dataset by @dsmilkov in #930
- Revamp welcome/intro pages by @brilee in #908
Other
- Refactor dataset/signal endpoints into separate module by @brilee in #900
- Add memray dep and instructions by @brilee in #917
- Add spec for select options by @brilee in #918
- Simplify helper methods to closer align to API for select options by @brilee in #919
- Start writing the query options compiler by @brilee in #924
Coming soon
- Add server-side RAG python code. by @nsthorat in #911
- Migrate the UI to the server-side python RAG. by @nsthorat in #914
- Improve the RAG UI by @nsthorat in #916
Full Changelog: v0.1.22...v0.1.23