Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting process_dataframe() #167

Open
maxschmitt opened this issue Mar 25, 2024 · 13 comments
Open

Requesting process_dataframe() #167

maxschmitt opened this issue Mar 25, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@maxschmitt
Copy link
Contributor

Problem

Currently, it is not possible to apply a processing function of audinterface.Process or audinterface.Feature to a DataFrame object.
Such a method would be meaningful, as process_index() cannot be efficiently used when a Segment object is passed as segment argument, because labels need to be carried to the resulting dataframe (the resulting index after segmentation has typically additional rows).

Solution

A new method process_dataframe() could solve this.

  • If no segmentation is required, the behaviour is very similar to process_index(), but all labels are kept and attached to the output.
  • If a segmentation is required, for each row in the input, the labels are duplicated and attached to all corresponding rows at the output.

@hagenw

@hagenw hagenw added the enhancement New feature or request label Mar 25, 2024
@hagenw
Copy link
Member

hagenw commented Mar 26, 2024

Is the idea behind it, that the processing function has also access to the labels in the dataframe, or just that you can use the newly segmented dataframe with the column of the original labels as ground truth afterwards. The later, could most likely also be solved by a function that takes as input the original dataframe and the new index, and then assigns the original labels to the new segments.

The first part was handled so far by the special processing args idx, file, root, see https://audeering.github.io/audinterface/usage.html#special-processing-function-arguments.
This we introduced based on an earlier issue, that proposed to add process_table() in #25.

But I agree, that it might be more elegant to just add something like process_dataframe() or process_table().

@maxschmitt
Copy link
Contributor Author

Processing the original labels was not something I had in my mind, so far, so the "external" function might be sufficient.
However, doing this is relatively time-consuming for a large table, so, the "elegant" solution would be preferred.

@hagenw
Copy link
Member

hagenw commented Mar 26, 2024

Having access to the labels is also not that easy, as usually we provide a processing function, that works also for process_file(). Which means we cannot assume that it has access to the labels. For that I would stick to the solution introduced with https://audeering.github.io/audinterface/usage.html#special-processing-function-arguments.

Which means the new process_dataframe() can be restricted to update the index and assign the labels accordingly.
One challenge we might face here, is that there might be a naming clash between the original labels of the dataframe and the new ones added by process_func. Maybe it would be better if process_dataframe() returns two dataframes then. One with the original labels, and a second with the new labels? In the case of audinterface.Process() it should also return a series not a dataframe for the second object. And we might also want to support providing a series instead of a dataframe. So I guess, we also need a better name for the method. Maybe process_table()?

@maxschmitt
Copy link
Contributor Author

process_table() sounds good

@hagenw
Copy link
Member

hagenw commented Mar 26, 2024

Great, @maxschmitt would you be able to try to work on it?

@maxschmitt
Copy link
Contributor Author

able to try

sounds reasonable ;) challenge accepted

@maxschmitt
Copy link
Contributor Author

@hagenw

Thinking about it, would it actually be necessary to have process_table() also in Process and Feature?
I mean, we need it in Process as the segmentation is handled there, but I'm not sure whether it should be an API function.

My idea would be to have process_table() only in Segment.

For Process, it does not make sense imho and it should be aligned with Feature. I we added it, we would end up in the "dirty" solution that two objects are returned. Moreover, there might be rarely cases where we want to segment, keep labels, and compute new features at the same time.

@maxschmitt
Copy link
Contributor Author

I am not sure if I got it correct, but:

The solution from #25 will not work as we would need the idx in the segmentation function, which won't be able to handle idx or labels (consider the case where we have "external" segmentation functions).
Moreover, Segment.process_index() cannot return a Series or DataFrame object.

Only Process.process_func does have access to the labels, but there, we do not have access to the original index anymore.

The idea of having process_table() somehow crashes the whole framework as the outputs of processing functions are not consistent anymore within each interface (Segment, Process, Features).

@hagenw
Copy link
Member

hagenw commented Apr 26, 2024

The solution from #25 will not work as we would need the idx in the segmentation function, which won't be able to handle idx or labels (consider the case where we have "external" segmentation functions).

Yes and no. If your starting point is a filewise index, you can use the special argument file with the current implementation:

import audb
import audinterface
import auvad


# Prepare data
media = [ 
    "wav/03a01Fa.wav",
    "wav/03a01Nc.wav",
    "wav/16b10Wb.wav",
]
db = audb.load(
    "emodb",
    version="1.4.1",
    media=media,
    full_path=False,
    verbose=False,
)
df = db.get("emotion")


def access_label(signal, sampling_rate, file, df, label="emotion"):
    return df.loc[file, label]


vad = auvad.Vad(max_turn_length=1)
interface = audinterface.Feature(
    "emotion",
    process_func=access_label,
    process_func_args={"df": df},
    segment=vad,
)
df_segmented = interface.process_index(db.files, root=db.root)
print(df_segmented)

which returns

                                                                 emotion
file            start                  end                              
wav/03a01Fa.wav 0 days 00:00:00.120000 0 days 00:00:01.760000  happiness
wav/03a01Nc.wav 0 days 00:00:00.060000 0 days 00:00:01.390000    neutral
wav/16b10Wb.wav 0 days 00:00:00.040000 0 days 00:00:01.450000      anger
                0 days 00:00:01.540000 0 days 00:00:02.380000      anger

But you are right, if the starting dataframe contains already a segmented index, then we cannot handle it with the current solution. There you would need to first run the VAD and with the result create a new dataframe using the index returned by the VAD and assign the label accordingly. Afterwards, you can then use that dataframe together with idx in audinterface.

At the moment, I'm not sure how easy/complicated it will be to change audinterface to support this out-of-the-box.

@hagenw
Copy link
Member

hagenw commented Apr 28, 2024

One straightforward fix for supporting also segmented indices would be to introduce start and end as special arguments as well. Then you would have access to file, start, end and could access the original segment as I do in the above example by using only file for a filewise index.

@maxschmitt
Copy link
Contributor Author

But wouldn't the access to the labels be very inefficient, especially for large tables, as we need to get the labels for each row separately?

My (current) idea is to add a method Segment.process_table() that differs from process_index() in the loop where the new segments are generated by attaching also the labels:

for (file, start, _), index in y.items():

If I am not completely wrong, this would result in only a minor change (the new Segment.process_table()) without affecting any existing code.

The drawback of this method is, of course, that we do not have this new method for the Process and Feature interfaces (not sure if it would be also straightforward to integrate them) but as I said before, I am not sure whether it makes sense to support this at all (given the "multiple-columns" issue).

@hagenw
Copy link
Member

hagenw commented Apr 29, 2024

I also see the point in adding a process_table() method, since using the special arguments is always complicated to understand anyway. And if it is also more efficient to have an extra process_table() method, the better.

But I'm not so sure if we could add it only to Segment. Then you would have to run first a Segment object on your dataframe, and afterwards you run Feature.process_index() on it's index. I would prefer to instantiate Feature with the Segment object provided via the segment argument, and when running Feature.process_table(), it then calls automatically Segment.process_table() under the hood.
But maybe, I also misunderstand your suggestion. If you like, you could create a pull request showing how you would solve the issue.

@maxschmitt
Copy link
Contributor Author

To be honest, I usually do not feel too comfortable when mixing two independent (segmentation, feature extraction) steps into a single function/method, because it makes the package more complex and less transparent.
Is there any disadvantage other than having an additional line of code?

Generally, when doing segmentation and feature extraction, there are two cases:

  1. segmentation -> features
  2. features -> segmentation

At the moment, only 1. is supported but it might also be relevant to have 2, which requires using/calling audinterface twice, anyway.
Just as a thought, I don't want to "ruin" the concept of audinterface, of course.

I implemented a first version of Segment.process_table() here:
fd35a83

Please check and we can see if it makes sense and if we should also have it in Process and Feature.

Test:

import audb
import audinterface
import numpy as np
import os
import pandas as pd


def rms(signal, sampling_rate):
    return 20 * np.log10(np.sqrt(np.mean(signal ** 2)))


def segment(signal, sampling_rate):
    duration = signal.shape[-1] / sampling_rate
    chunk_len = 0.7
    chunks = []
    for i in range(int(duration // chunk_len) + 1):
        chunks.append((i * chunk_len,
                       np.min([(i+1) * chunk_len, duration])))
    index = pd.MultiIndex.from_tuples(
        [
            (
                pd.Timedelta(start, unit="s"),
                pd.Timedelta(end, unit="s"),
            )
            for start, end in chunks
        ],
        names=["start", "end"],
    )
    return index


media = [
    "wav/03a01Fa.wav",
    "wav/03a01Nc.wav",
    "wav/16b10Wb.wav",
]
db = audb.load(
    "emodb",
    version="1.3.0",
    media=media,
    verbose=False,
)

files = list(db.files)
folder = os.path.dirname(files[0])
index = db["emotion"].index

# Compute RMS
interface = audinterface.Process(process_func=rms)
table_series = interface.process_index(index)
print(table_series)

# Segmentation with Series
seg_interface = audinterface.Segment(process_func=segment)
print(seg_interface.process_table(table_series))

# Segmentation with Dataframe
table_df = pd.DataFrame(np.concatenate((table_series.values.reshape(-1, 1),
                                        table_series.values.reshape(-1, 1) * 2),
                                       axis=-1),
                        table_series.index, columns=["RMS", "RMSx2"])
print(seg_interface.process_table(table_df))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants