Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something is off in the pitch / NFCC computing functions #43

Open
FredrikKarlssonSpeech opened this issue Sep 28, 2022 · 1 comment
Open

Comments

@FredrikKarlssonSpeech
Copy link

i just wanted to point out that there is something off in the NFCC calculations in torchaudio that is not present in pytorchaudio and therefore seems to not be carried over from there. I run some code that uses pytorchaudio to compute pitch using the kaldi method

import torch
import torchaudio
import torchaudio.functional as F
import torchaudio.transforms as T

SPEECH_WAVEFORM, SAMPLE_RATE = torchaudio.load("/Users/frkkan96/Desktop/a1.wav")

pitch_feature = F.compute_kaldi_pitch(waveform=SPEECH_WAVEFORM, 
	sample_rate=SAMPLE_RATE, 
	frame_length= 25.0, 
	frame_shift= 10.0, 
	min_f0= 50, 
	max_f0= 400, 
	soft_min_f0= 10.0, 
	penalty_factor= 0.1, 
	lowpass_cutoff= 1000, 
	resample_frequency= 4000, 
	delta_pitch= 0.005, 
	nccf_ballast= 7000, 
	lowpass_filter_width= 1, 
	upsample_filter_width= 5, 
	max_frames_latency= 0, 
	frames_per_chunk= 0, 
	simulate_first_pass_online= False, 
	recompute_frame= 500, 
	snip_edges=True)
pitch, nfcc = pitch_feature[..., 0], pitch_feature[..., 1]

and when I then look at the output, I am convinced that what I actually got was values from windowed portions of the signal.

pitch.size()
torch.Size([1, 402])
nfcc.size()
torch.Size([1, 402])

Kaldi pitch extraction is not exposed by torchaudio, but you can get NFCCs using the functional__compute_ncc function. But then I get confused as while this code (using the native pitch detection), I get nothing like the python output in NFCCs

origSoundFile <- "/Users/frkkan96/Desktop/a1.wav"
audio = transform_to_tensor(audiofile_loader(filepath=origSoundFile,
                                             offset=beginTime,
                                             duration=(endTime - beginTime), #A duration of 0 seems to be interpreted as the complete file
                                             unit="time"))
waveform <- audio[[1]]
sample_rate <- audio[[2]]
windowShift <- 10

pitch <- functional_detect_pitch_frequency(waveform,
                                         sample_rate = sample_rate,
                                         frame_time = windowShift /1000,
                                         win_length = windowSize,
                                         freq_low=minF,
                                         freq_high=maxF) # Expects seconds

nfcc <- functional__compute_nccf(waveform,
                               sample_rate = sample_rate,
                               frame_time = windowShift/1000,
                               freq_low = minF)
> str(pitch)
Float [1:1, 1:389]
> str(nfcc)
Float [1:1, 1:404, 1:630]

Optimally, these two R functions should correspond in dimensions with the python interface ones, and with identical window shift lengths (10ms in this case), the dimensions should be the same from detect_pitch and compute_nfcc, right?

@skeydan
Copy link
Collaborator

skeydan commented Feb 9, 2023

Hi, sorry for the late response!

From looking at the source(s), I would assume that

  • in F.compute_kaldi_pitch frame_length (in milliseconds) would correspond to frame_time in the other two functions (in seconds)
  • frame_shift in F.compute_kaldi_pitch would indicate overlap (as in spectrogram calculation), and
  • win_length in functional_detect_pitch_frequency relates only to median smoothing (but I didn't look into that in detail)

But I might be wrong :-)
Let me know if that helps?

By the way, detect_pitch_frequency is available in Python, too, in case you'd like to cross-test the function directly (https://pytorch.org/audio/stable/generated/torchaudio.functional.detect_pitch_frequency.html#torchaudio.functional.detect_pitch_frequency).

Oh and please install the most recent torchaudio from CRAN :-)
You would now load a sound like this:

audio <- transform_to_tensor(torchaudio_load(origSoundFile))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants