switch to TextAnalysis #7

drizk1 · 2025-01-09T15:25:06Z

at some point i would like to switch to textanalyis.jl a the underlying tokenizer for a variety of reasons... just not sure there are actually any users at this point that warrant it

drizk1 · 2025-01-09T15:43:50Z

although maybe not? because i think the tokenizer i wrote might actually be faster..?

julia> @benchmark courses |>
           df -> DataFrame(
               id = [id for (id, tokens) in zip(df.rownames, tokenize.(Ref(Languages.English()), df.course)) for _ in 1:length(tokens)],
               token = [token for tokens in tokenize.(Ref(Languages.English()), df.course) for token in tokens]
           )
BenchmarkTools.Trial: 4243 samples with 1 evaluation.
 Range (min … max):  1.022 ms … 236.110 ms  ┊ GC (min … max): 0.00% … 99.33%
 Time  (median):     1.078 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.176 ms ±   3.611 ms  ┊ GC (mean ± σ):  4.96% ±  2.40%

  ▂ ▄▂ ▇█▅▂▅▅▄▂▂▂▁▂▁  ▁▁      ▁                               ▁
  █▇██▆████████████████████▇▇▇███▇███▇▆▆▇▆▆▅▄▆▆▆▄▆▄▅▂▅▆▇▆▅▄▄▄ █
  1.02 ms      Histogram: log(frequency) by time      1.49 ms <

 Memory estimate: 139.05 KiB, allocs estimate: 1949.

julia> @benchmark @chain courses @select(id = rownames, course) @unnest_tokens(word, course)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  111.829 μs …  1.463 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     119.353 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   133.804 μs ± 31.699 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▆██▇▅▃▂▁▁▁▂▁▁▁▂▂▂▂▁  ▁▂▂▁▁▁▁▁▂▂▁▁▁▁▁▁▂▂▂▁▁   ▁▁             ▂
  ███████████████████████████████████████████▇▇█████▇▇▆▆▅▅▆▄▅▆ █
  112 μs        Histogram: log(frequency) by time       228 μs <

 Memory estimate: 21.46 KiB, allocs estimate: 254.

drizk1 · 2025-01-09T18:00:04Z

More efficient text analysis


julia> @benchmark courses |> df -> begin
           tokens = tokenize.(Ref(Languages.English()), df.course)
           
           id_iter = Iterators.flatten([fill(id, length(tok)) for (id, tok) in zip(df.rownames, tokens)])
           
           token_column = collect(Iterators.flatten(tokens))
           
           id_column = collect(id_iter)
           
           DataFrame(id = id_column, token = token_column)
       end
BenchmarkTools.Trial: 8924 samples with 1 evaluation.
 Range (min … max):  519.428 μs …  93.626 ms  ┊ GC (min … max): 0.00% … 99.33%
 Time  (median):     538.861 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   558.674 μs ± 986.208 μs  ┊ GC (mean ± σ):  1.87% ±  1.05%

  ▃▇█▅▄█▇▄▃▄▂▁▁▂▂▂▁▁▂▂▁▁                                        ▂
  ███████████████████████▆▆█▇▇▇▇▇▇▆▆▆▇▇▇▇▇▆▆▆▆▆▆▆▇▆▆▅▅▄▆▃▅▅▄▄▇▆ █
  519 μs        Histogram: log(frequency) by time        731 μs <

 Memory estimate: 73.87 KiB, allocs estimate: 1001.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch to TextAnalysis #7

switch to TextAnalysis #7

drizk1 commented Jan 9, 2025

drizk1 commented Jan 9, 2025 •

edited

Loading

drizk1 commented Jan 9, 2025

switch to TextAnalysis #7

switch to TextAnalysis #7

Comments

drizk1 commented Jan 9, 2025

drizk1 commented Jan 9, 2025 • edited Loading

drizk1 commented Jan 9, 2025

drizk1 commented Jan 9, 2025 •

edited

Loading