Requesting `process_dataframe()` #167

maxschmitt · 2024-03-25T15:08:39Z

Problem

Currently, it is not possible to apply a processing function of audinterface.Process or audinterface.Feature to a DataFrame object.
Such a method would be meaningful, as process_index() cannot be efficiently used when a Segment object is passed as segment argument, because labels need to be carried to the resulting dataframe (the resulting index after segmentation has typically additional rows).

Solution

A new method process_dataframe() could solve this.

If no segmentation is required, the behaviour is very similar to process_index(), but all labels are kept and attached to the output.
If a segmentation is required, for each row in the input, the labels are duplicated and attached to all corresponding rows at the output.

@hagenw

The text was updated successfully, but these errors were encountered:

hagenw · 2024-03-26T07:51:07Z

Is the idea behind it, that the processing function has also access to the labels in the dataframe, or just that you can use the newly segmented dataframe with the column of the original labels as ground truth afterwards. The later, could most likely also be solved by a function that takes as input the original dataframe and the new index, and then assigns the original labels to the new segments.

The first part was handled so far by the special processing args idx, file, root, see https://audeering.github.io/audinterface/usage.html#special-processing-function-arguments.
This we introduced based on an earlier issue, that proposed to add process_table() in #25.

But I agree, that it might be more elegant to just add something like process_dataframe() or process_table().

maxschmitt · 2024-03-26T09:08:51Z

Processing the original labels was not something I had in my mind, so far, so the "external" function might be sufficient.
However, doing this is relatively time-consuming for a large table, so, the "elegant" solution would be preferred.

hagenw · 2024-03-26T09:29:21Z

Having access to the labels is also not that easy, as usually we provide a processing function, that works also for process_file(). Which means we cannot assume that it has access to the labels. For that I would stick to the solution introduced with https://audeering.github.io/audinterface/usage.html#special-processing-function-arguments.

Which means the new process_dataframe() can be restricted to update the index and assign the labels accordingly.
One challenge we might face here, is that there might be a naming clash between the original labels of the dataframe and the new ones added by process_func. Maybe it would be better if process_dataframe() returns two dataframes then. One with the original labels, and a second with the new labels? In the case of audinterface.Process() it should also return a series not a dataframe for the second object. And we might also want to support providing a series instead of a dataframe. So I guess, we also need a better name for the method. Maybe process_table()?

maxschmitt · 2024-03-26T09:45:56Z

process_table() sounds good

hagenw · 2024-03-26T11:58:33Z

Great, @maxschmitt would you be able to try to work on it?

maxschmitt · 2024-03-26T12:54:14Z

able to try

sounds reasonable ;) challenge accepted

maxschmitt · 2024-04-24T14:00:54Z

@hagenw

Thinking about it, would it actually be necessary to have process_table() also in Process and Feature?
I mean, we need it in Process as the segmentation is handled there, but I'm not sure whether it should be an API function.

My idea would be to have process_table() only in Segment.

For Process, it does not make sense imho and it should be aligned with Feature. I we added it, we would end up in the "dirty" solution that two objects are returned. Moreover, there might be rarely cases where we want to segment, keep labels, and compute new features at the same time.

maxschmitt · 2024-04-24T16:57:39Z

I am not sure if I got it correct, but:

The solution from #25 will not work as we would need the idx in the segmentation function, which won't be able to handle idx or labels (consider the case where we have "external" segmentation functions).
Moreover, Segment.process_index() cannot return a Series or DataFrame object.

Only Process.process_func does have access to the labels, but there, we do not have access to the original index anymore.

The idea of having process_table() somehow crashes the whole framework as the outputs of processing functions are not consistent anymore within each interface (Segment, Process, Features).

hagenw · 2024-04-26T13:55:48Z

The solution from #25 will not work as we would need the idx in the segmentation function, which won't be able to handle idx or labels (consider the case where we have "external" segmentation functions).

Yes and no. If your starting point is a filewise index, you can use the special argument file with the current implementation:

import audb
import audinterface
import auvad


# Prepare data
media = [ 
    "wav/03a01Fa.wav",
    "wav/03a01Nc.wav",
    "wav/16b10Wb.wav",
]
db = audb.load(
    "emodb",
    version="1.4.1",
    media=media,
    full_path=False,
    verbose=False,
)
df = db.get("emotion")


def access_label(signal, sampling_rate, file, df, label="emotion"):
    return df.loc[file, label]


vad = auvad.Vad(max_turn_length=1)
interface = audinterface.Feature(
    "emotion",
    process_func=access_label,
    process_func_args={"df": df},
    segment=vad,
)
df_segmented = interface.process_index(db.files, root=db.root)
print(df_segmented)

which returns

                                                                 emotion
file            start                  end                              
wav/03a01Fa.wav 0 days 00:00:00.120000 0 days 00:00:01.760000  happiness
wav/03a01Nc.wav 0 days 00:00:00.060000 0 days 00:00:01.390000    neutral
wav/16b10Wb.wav 0 days 00:00:00.040000 0 days 00:00:01.450000      anger
                0 days 00:00:01.540000 0 days 00:00:02.380000      anger

But you are right, if the starting dataframe contains already a segmented index, then we cannot handle it with the current solution. There you would need to first run the VAD and with the result create a new dataframe using the index returned by the VAD and assign the label accordingly. Afterwards, you can then use that dataframe together with idx in audinterface.

At the moment, I'm not sure how easy/complicated it will be to change audinterface to support this out-of-the-box.

hagenw · 2024-04-28T08:40:24Z

One straightforward fix for supporting also segmented indices would be to introduce start and end as special arguments as well. Then you would have access to file, start, end and could access the original segment as I do in the above example by using only file for a filewise index.

maxschmitt · 2024-04-28T14:45:25Z

But wouldn't the access to the labels be very inefficient, especially for large tables, as we need to get the labels for each row separately?

My (current) idea is to add a method Segment.process_table() that differs from process_index() in the loop where the new segments are generated by attaching also the labels:

audinterface/audinterface/core/segment.py

Line 501 in bdb078c

for (file, start, _), index in y.items():

If I am not completely wrong, this would result in only a minor change (the new Segment.process_table()) without affecting any existing code.

The drawback of this method is, of course, that we do not have this new method for the Process and Feature interfaces (not sure if it would be also straightforward to integrate them) but as I said before, I am not sure whether it makes sense to support this at all (given the "multiple-columns" issue).

hagenw · 2024-04-29T06:00:28Z

I also see the point in adding a process_table() method, since using the special arguments is always complicated to understand anyway. And if it is also more efficient to have an extra process_table() method, the better.

But I'm not so sure if we could add it only to Segment. Then you would have to run first a Segment object on your dataframe, and afterwards you run Feature.process_index() on it's index. I would prefer to instantiate Feature with the Segment object provided via the segment argument, and when running Feature.process_table(), it then calls automatically Segment.process_table() under the hood.
But maybe, I also misunderstand your suggestion. If you like, you could create a pull request showing how you would solve the issue.

maxschmitt · 2024-04-29T11:05:51Z

To be honest, I usually do not feel too comfortable when mixing two independent (segmentation, feature extraction) steps into a single function/method, because it makes the package more complex and less transparent.
Is there any disadvantage other than having an additional line of code?

Generally, when doing segmentation and feature extraction, there are two cases:

segmentation -> features
features -> segmentation

At the moment, only 1. is supported but it might also be relevant to have 2, which requires using/calling audinterface twice, anyway.
Just as a thought, I don't want to "ruin" the concept of audinterface, of course.

I implemented a first version of Segment.process_table() here:
fd35a83

Please check and we can see if it makes sense and if we should also have it in Process and Feature.

Test:

import audb
import audinterface
import numpy as np
import os
import pandas as pd


def rms(signal, sampling_rate):
    return 20 * np.log10(np.sqrt(np.mean(signal ** 2)))


def segment(signal, sampling_rate):
    duration = signal.shape[-1] / sampling_rate
    chunk_len = 0.7
    chunks = []
    for i in range(int(duration // chunk_len) + 1):
        chunks.append((i * chunk_len,
                       np.min([(i+1) * chunk_len, duration])))
    index = pd.MultiIndex.from_tuples(
        [
            (
                pd.Timedelta(start, unit="s"),
                pd.Timedelta(end, unit="s"),
            )
            for start, end in chunks
        ],
        names=["start", "end"],
    )
    return index


media = [
    "wav/03a01Fa.wav",
    "wav/03a01Nc.wav",
    "wav/16b10Wb.wav",
]
db = audb.load(
    "emodb",
    version="1.3.0",
    media=media,
    verbose=False,
)

files = list(db.files)
folder = os.path.dirname(files[0])
index = db["emotion"].index

# Compute RMS
interface = audinterface.Process(process_func=rms)
table_series = interface.process_index(index)
print(table_series)

# Segmentation with Series
seg_interface = audinterface.Segment(process_func=segment)
print(seg_interface.process_table(table_series))

# Segmentation with Dataframe
table_df = pd.DataFrame(np.concatenate((table_series.values.reshape(-1, 1),
                                        table_series.values.reshape(-1, 1) * 2),
                                       axis=-1),
                        table_series.index, columns=["RMS", "RMSx2"])
print(seg_interface.process_table(table_df))

hagenw added the enhancement New feature or request label Mar 25, 2024

maxschmitt mentioned this issue Apr 29, 2024

Add audinterface.Segment.process_table() #172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting `process_dataframe()` #167

Requesting `process_dataframe()` #167

maxschmitt commented Mar 25, 2024

hagenw commented Mar 26, 2024

maxschmitt commented Mar 26, 2024

hagenw commented Mar 26, 2024

maxschmitt commented Mar 26, 2024

hagenw commented Mar 26, 2024

maxschmitt commented Mar 26, 2024

maxschmitt commented Apr 24, 2024

maxschmitt commented Apr 24, 2024

hagenw commented Apr 26, 2024

hagenw commented Apr 28, 2024

maxschmitt commented Apr 28, 2024

hagenw commented Apr 29, 2024

maxschmitt commented Apr 29, 2024

Requesting process_dataframe() #167

Requesting process_dataframe() #167

Comments

maxschmitt commented Mar 25, 2024

Problem

Solution

hagenw commented Mar 26, 2024

maxschmitt commented Mar 26, 2024

hagenw commented Mar 26, 2024

maxschmitt commented Mar 26, 2024

hagenw commented Mar 26, 2024

maxschmitt commented Mar 26, 2024

maxschmitt commented Apr 24, 2024

maxschmitt commented Apr 24, 2024

hagenw commented Apr 26, 2024

hagenw commented Apr 28, 2024

maxschmitt commented Apr 28, 2024

hagenw commented Apr 29, 2024

maxschmitt commented Apr 29, 2024

Requesting `process_dataframe()` #167

Requesting `process_dataframe()` #167