This project provides a Dockerized Python script to manage audio transcription and PII redaction using AWS services. The script handles the following tasks:
- Transcribes audio files from an S3 bucket using AWS Transcribe.
- Redacts PII (Personally Identifiable Information) from the transcriptions using AWS Comprehend.
- Saves redacted transcriptions to a separate S3 bucket.
- Concurrency Control: Limits the number of concurrent transcription jobs.
- Error Handling: Includes basic error handling for AWS API calls.
- Queue Management: Uses queues to manage transcription and redaction tasks.
- AWS Account: Ensure you have an AWS account with access to S3, Transcribe, and Comprehend services.
- AWS Permissions:
AmazonS3FullAccess
,AmazonTranscribeFullAccess
,ComprehendFullAccess
. Attach these permissions to the AWS user with the above credentials - Docker: Make sure Docker is installed on your machine.
-
Environment Variables:
AUDIO_INPUT_BUCKET
: S3 bucket containing audio files to transcribe.AUDIO_TRANSCRIPTION_BUCKET
: S3 bucket where transcriptions will be stored.AUDIO_TRANSCRIPTION_REDACTION_BUCKET
: S3 bucket for storing redacted transcriptions.AUDIO_LANGUAGE_SUPPORT
: Comma-separated list of supported languages (default: "en-IN,hi-IN").THREAD_COUNT
: Number of threads for concurrent processing (default: 4).MAX_CONCURRENT_JOBS
: Maximum number of concurrent transcription jobs (default: 5).AWS_ACCESS_KEY_ID
: AWS access key ID.AWS_SECRET_ACCESS_KEY
: AWS secret access key.AWS_DEFAULT_REGION
: AWS region (default: "us-east-1").
-
Dockerfile:
- A Dockerfile is provided to build the Docker image for this script.
docker run -e AUDIO_INPUT_BUCKET=your-input-bucket \
-e AUDIO_TRANSCRIPTION_BUCKET=your-transcription-bucket \
-e AUDIO_TRANSCRIPTION_REDACTION_BUCKET=your-redaction-bucket \
-e AUDIO_LANGUAGE_SUPPORT=en-IN,hi-IN \
-e THREAD_COUNT=4 \
-e MAX_CONCURRENT_JOBS=5 \
-e AWS_ACCESS_KEY_ID=your-access-key-id \
-e AWS_SECRET_ACCESS_KEY=your-secret-access-key \
-e AWS_DEFAULT_REGION=us-east-1 \
rvizsatiz/aws-transcribe-redact:v1
docker build -t aws-transcribe-redact .
docker run -e AUDIO_INPUT_BUCKET=your-input-bucket \
-e AUDIO_TRANSCRIPTION_BUCKET=your-transcription-bucket \
-e AUDIO_TRANSCRIPTION_REDACTION_BUCKET=your-redaction-bucket \
-e AUDIO_LANGUAGE_SUPPORT=en-IN,hi-IN \
-e THREAD_COUNT=4 \
-e MAX_CONCURRENT_JOBS=5 \
-e AWS_ACCESS_KEY_ID=your-access-key-id \
-e AWS_SECRET_ACCESS_KEY=your-secret-access-key \
-e AWS_DEFAULT_REGION=us-east-1 \
aws-transcribe-redact
Replace the environment variables with your actual bucket names, AWS credentials, and configurations.
- Place your audio files in the specified input S3 bucket.
- Run the Docker container. The script will start transcription jobs for all audio files in the input bucket.
- Transcriptions will be stored in the specified transcription bucket.
- The script will then redact PII from the transcriptions and save the redacted files to the redaction bucket.
Contributions are welcome! Please open an issue or submit a pull request for any improvements, bug fixes, or feature requests.
- Make redaction optional using parameters.
- Making queue distributed using Redis, rabbitmq or equivalent.
For questions or support, [email protected]