Releases: databricks/lilac
Releases · databricks/lilac
v0.1.1
Overview
- Embedding computation can now be larger-than-RAM! Computing lots of embeddings will iteratively write to a vector store.
- JSON and CSV sources are heavily optimized and go through duckdb for parsing.
- Clustering now supports semantic clustering with embeddings, using DBScan.
New features
- Add SQLite source and optimize the JSON and CSV sources by @dsmilkov in #710
- Add a dict source and convert
LangSmith
source to use it by @dsmilkov in #716 - Add clustering signal by @dsmilkov in #711
Performance
- Use iterables for compute_signal and compute_embedding. by @nsthorat in #706
- Write embeddings to the vector store iteratively by @nsthorat in #709
- Add SQLite source and optimize the JSON and CSV sources by @dsmilkov in #710
- Speed up the docker image build step by installing lilac from pip before installing the local wheel. by @nsthorat in #714
- Improve perf of server by removing UUID sort by @dsmilkov in #715
Bug fixes
- Fix semantic search on repeated by @dsmilkov in #704
- Fix syntax error with keyword search by @dsmilkov in #705
- Fix bug with span highlighting a repeated field by @nsthorat in #713
- Change the bootup load to be during the new FastAPI lifecycle API. by @nsthorat in #717
Full Changelog: v0.1.0...v0.1.1
v0.1.0
New Features
Lilac now supports labeling! For a detailed guide, see Labeling a dataset
Labels can be added for individual rows:
dataset.add_labels(
'good',
row_ids=['0003076800f1471f8f4c8a1b2deda742'])
Or for slices of the data:
dataset.add_labels(
'short',
filters=[
(('text', 'text_statistics', 'num_characters'), 'less', 1000)
]
)
They can then be exported:
short_rows = list(
dataset.select_rows(
['*', 'short'],
filters=[
(('short', 'label'), 'exists')
]
)
)
# Print the first row.
print(short_rows[0])
Output:
{
'__rowid__': '0003076800f1471f8f4c8a1b2deda742',
'text': 'If you want to truly experience the magic (?) of Don Dohler, then check out "Alien Factor" or maybe "Fiend", but not this. Alien Factor is actually rather imaginative considering the low budget and it\'s fairly creepy, but "Nightbeast", which I guess is sort of an updating of Alien Factor, is just plain dumb. Actors sleepwalk through their roles, especially Mr. Monotone sheriff, and the monster is some dumb Halloween-mask kind of thing instead of the wildly imaginative (but kind of stupid) looking critters from Alien Factor. A spaceship crashes on Earth and there\'s a critter inside, of course, who runs around vaporizing people. And ripping off arms, etc. And he has a cool ray gun that he uses to vaporize people too, until it gets shot out of his hand. And that\'s really about it. "Alien Factor" beats this mess hands down, if you really want to see a good Don Dohler movie, check that out instead. And RIP Don Dohler, 12/2/06.',
'label': 'neg',
'__hfsplit__': 'test',
'good': {
'label': 'true',
'created': datetime.datetime(2023, 9, 20, 10, 16, 15, 545277)
}
}
Labels can also be added via the UI:
What's changed
Bug fixes
- Allow
add_labels
andremove_labels
without selection by @dsmilkov in #698 - Fix UI regression and empty
lilac.yml
(no datasets) by @dsmilkov in #700
Full Changelog: v0.0.20...v0.1.0
v0.0.20
Features
- Add "More like this" button in the item viewer by @dsmilkov in #676
- Add simple labeling functionality in the item viewer by @dsmilkov in #679
- Add removing labels, and add row_ids to add labels. by @nsthorat in #680
- Improving the label download by @dsmilkov in #682
- Expose
LangSmithSource
to the public API and docs by @dsmilkov in #684 - Add UI to clear labels. by @nsthorat in #686
- Add a 'label all' button to label all results in view by @nsthorat in #687
- Add docs for labeling. Fix some labeling issues. by @nsthorat in #692
Bug fixes
- Tiny CSS fixes to make mobile not terrible by @nsthorat in #677
- Fix REST API with new labels API. by @nsthorat in #681
- Fix issue with overflow on text by @nsthorat in #683
- Fix upload scripts so we can push to a staging directory without uploading data. by @nsthorat in #689
- Add better error messaging when inferring schema by @dsmilkov in #691
- Fix the huggingface deploy script. by @nsthorat in #695
- Fix bug with UDFs after metadata separation by @nsthorat in #696
Other
Full Changelog: v0.0.19...v0.0.20
v0.0.19
What's Changed
New Features 🎉
- Improve the project API and documentation. by @nsthorat in #668
- Add the python API for adding labels. by @nsthorat in #667
- Add UI for viewing labels. by @nsthorat in #670
Other Changes
- Update homepage with short 10sec videos by @dsmilkov in #663
- Optional Outputs by @hinthornw in #666
- Fix bug with the selectRows cache not being cleared when labeling concepts in HF. by @nsthorat in #671
- Add a large guide for querying datasets. by @nsthorat in #669
- Re-design the items by @dsmilkov in #674
New Contributors
- @hinthornw made their first contribution in #666
Full Changelog: v0.0.18...v0.0.19
v0.0.18
New Features
- Add first version for Dataset Insights by @dsmilkov in #641
- Add a compute concept modal. by @nsthorat in #657
- Add expandable metadata by @dsmilkov in #644
- Expand parts of metadata according to the search context by @dsmilkov in #659
Other Changes
- Fix the huggingface deploy script. by @nsthorat in #638
- Fix bug with concept labeler not returning refreshed results. by @nsthorat in #639
- Improve documentation around GCS paths. by @nsthorat in #647
- When merging floats, check for closeness to avoid precision issues. Pin pandas version. by @nsthorat in #655
- Fix
RuntimeError
in HNSW index by @dsmilkov in #656 - Fix negative-sentiment and legal-terminal concepts due to missing top-level
version
field by @dsmilkov in #658 - Fix italics for N/A by @nsthorat in #662
Full Changelog: v0.0.17...v0.0.18
v0.0.17
What's Changed
- Fix bug in load script where we try to use the task manager when none is passed. by @nsthorat in #627
- Various bug fixes by @dsmilkov in #629
- Fix the async bug when starting the server by @dsmilkov in #636
- Fix bug with non-serializable schema in the concept labeler. by @nsthorat in #632
- Update the global project config during changes. by @nsthorat in #631
- Remove the explicit cache directory for sentence transformers. by @nsthorat in #637
Full Changelog: v0.0.16...v0.0.17
v0.0.16
New Features
Other Changes
- Improve memory usage of
lilac load
to unblock mosaic datasets by @dsmilkov in #620 - Add a project_path to lilac_start. by @nsthorat in #621
- Allow tanstack query result to contain non-serializable data by @dsmilkov in #625
- Fix auth bugs with concepts. Pip install lilac[all] in the dockerfile. by @nsthorat in #622
- Add ability to make concepts public. by @nsthorat in #624
Full Changelog: v0.0.15...v0.0.16
v0.0.15
v0.0.14
What's Changed
- updated from HuggingFaceDataset to HuggingFaceSource by @Contributorrandom in #611
- Use pip for the HuggingFace demos. by @nsthorat in #609
A bug with JavaScript not getting built for the pip package was fixed and released with this version. This includes the change to the searchbox: #603
New Contributors
- @Contributorrandom made their first contribution in #611
Full Changelog: v0.0.13...v0.0.14