Skip to content

MLSpeech/FALCON

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FALCON — Forced Alignment through Contrastive Optimization Networks

arXiv Project Page Hugging Face Spaces Python 3.8+ License: MIT

FALCON (Forced Alignment through Contrastive Optimization Networks) is an end-to-end, fully differentiable neural system for phoneme-level and word-level (and also multilingual) forced alignment — given a speech waveform and a known phoneme/words transcript sequence, it predicts precise phoneme boundary timestamps.

Rotem Rousso, Eyal Cohen, Joseph Keshet
"Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming"
Preprint, 2026 — arXiv:2606.25460
Project page: mlspeech.github.io/FALCON


Citation

If you use FALCON in your research, please cite:

@article{rousso2026falcon,
  title     = {Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming},
  author    = {Rousso, Rotem and Cohen, Eyal and Keshet, Joseph},
  journal   = {arXiv preprint arXiv:2606.25460},
  year      = {2026}
}

Highlights

  • Phoneme-level precision — operates at ~10 ms frame resolution, the finest granularity among neural forced aligners.
  • Fully differentiable — a Soft Dynamic Programming decoder enables gradient flow through the entire alignment pipeline.
  • Zero-shot multilingual — trained on English, generalizes to unseen languages at inference (tested on Dutch, German, and Hebrew, among others) without any additional training.
  • Word-level generalization — word boundaries derived from phoneme predictions at test time, competitive with word-level neural aligners.

How It Works

The system has three jointly trained components:

Raw waveform (16 kHz)
        │
        ▼
 ┌──────────────────┐
 │ CNN Encoder f_θ  │  learns boundary-sensitive representations
 │ (5 strided conv) │  via MNCE contrastive loss
 └───────┬──────────┘
         │ Z  (frame-level latent features)
         ├──────────────────────────────────────┐
         ▼                                      │ Z
 ┌────────────────────┐                         │
 │ BiLSTM Context g_ψ │  frame-wise phoneme     │
 │ (5 layers, 512 dim)│  posterior probs U      │
 └───────┬────────────┘                         │
         │ U                                    │
         ▼                                      ▼
 ┌───────────────────────────────────────────────────┐
 │        Soft-DP Decoder h_W                        │
 │  φ₁(Z): acoustic boundary sharpness              │
 │  φ₂(U,p): phoneme posterior consistency          │
 │  LogSumExp forward + soft-argmax backtrack       │
 └───────────────────────────────────────────────────┘
         │
         ▼
  Predicted boundary timestamps  ŷ = (ŷ₁, …, ŷₙ)

Example

FALCON aligns the example fasw0sa2.wav — a TIMIT sentence, "Don't ask me to carry an oily rag like that." — using the included English checkpoint. The same utterance is provided in every supported input format so you can try them in the web demo: .phn (phonemes), .wrd (words), and .txt (plain transcript).

Waveform with predicted phoneme boundaries:

FALCON phoneme alignment example

Under the hood — every representation on one shared time axis. Waveform, log-mel spectrogram, the phoneme posteriors, the Soft-DP cost matrix with its backtracked alignment path, and the contrastive boundary score with its derivative. The same predicted boundaries (crimson dashed) and ground-truth boundaries (charcoal dotted) line up across all panels, with the input phonemes written under each axis, so every component's contribution is visible at a glance:

FALCON aligned representations

Output (first rows of the phones tier; the full alignment is written to a Praat .TextGrid):

Phoneme Start (s) End (s)
h# 0.000 0.121
d 0.121 0.141
ow 0.141 0.313
n 0.313 0.333
q 0.333 0.434
ae 0.434 0.575
s 0.575 0.645
epi 0.645 0.726
m 0.726 0.776
iy 0.776 0.897

Reproduce:

# alignment + TextGrid
python generate_textgrids.py --wav assets/fasw0sa2.wav --mode phoneme --lang english --annotation phn

# the multi-panel "under the hood" figure
python falcon_viz.py

Results

Numbers are from the paper (arXiv:2606.25460). Specialist = trained on the target English corpus; joint = a single model jointly trained on TIMIT+Buckeye. Multilingual results are zero-shot — no target-language training data. Accuracy is the percentage of reference boundaries matched within the given ms tolerance.

Phone-level Alignment Accuracy [%]: MFA vs. FALCON (Ours)

Dataset Model t≤10 t≤25 t≤50 t≤100
TIMIT MFA 38.6 72.3 81.1 84.6
TIMIT FALCON specialist 37.66 83.88 94.85 98.62
TIMIT FALCON joint 34.70 82.62 94.91 98.60
Buckeye MFA 35.3 60.6 68.9 72.7
Buckeye FALCON specialist 29.69 69.93 90.07 97.40
Buckeye FALCON joint 28.87 69.40 89.53 97.13

Phoneme-Level: Unseen Multilingual Generalization Accuracy

Test set Model ≤10 ≤15 ≤20 ≤25 ≤50 ≤100
Dutch — IFA FALCON joint 26.85 36.16 44.56 51.17 69.94 84.11
Dutch — IFA FALCON specialist 26.86 35.79 43.85 50.34 68.68 83.22
Dutch — IFA MFA 11.01 14.70 19.05 21.80 33.90 51.02
German — PHONDAT FALCON joint 25.63 34.12 41.87 49.07 70.04 84.58
German — PHONDAT FALCON specialist 25.08 33.37 40.76 47.43 68.27 82.44
German — PHONDAT MFA 20.60 31.75 37.17 45.83 66.78 79.19
Hebrew FALCON joint 21.98 30.10 36.91 42.78 63.07 80.41
Hebrew FALCON specialist 21.03 27.78 34.30 39.79 59.38 77.76

Word-Level Alignment Accuracy [%]: Comparative Analysis

Dataset Model t≤10 t≤25 t≤50 t≤100
TIMIT FALCON spec (MFA-G2P) 49.22 81.79 93.04 98.37
TIMIT FALCON joint (MFA-G2P) 49.50 80.60 92.86 98.46
TIMIT MFA 41.60 72.80 89.40 97.40
TIMIT MMS 18.60 43.50 75.70 94.70
TIMIT WhisperX 22.40 52.70 82.40 94.20
TIMIT Nvidia-Canary-1b 9.23 23.11 44.23 72.81
Buckeye FALCON spec (MFA-G2P) 50.06 77.85 91.51 96.63
Buckeye FALCON joint (MFA-G2P) 50.42 77.98 91.01 96.55
Buckeye MFA 39.80 69.90 84.90 91.80
Buckeye MMS 25.00 52.70 75.00 87.90
Buckeye WhisperX 18.80 43.10 67.40 77.40
Buckeye Nvidia-Canary-1b 8.06 18.83 36.31 63.29

Word-Level: Unseen Multilingual Generalization Accuracy

Dataset Model t≤10 t≤25 t≤50 t≤100
German — PHONDAT FALCON (MFA-G2P) 44.20 68.48 86.12 95.11
German — PHONDAT MFA 29.9 65.4 82.1 94.3
German — PHONDAT MMS 21.8 44.3 74.9 91.8
Dutch — IFA FALCON (MFA-G2P) 26.38 45.15 61.16 76.49
Dutch — IFA MFA 4.7 7.3 11.6 19.0
Dutch — IFA MMS 16.0 37.9 62.9 76.6
Hebrew FALCON 31.91 56.72 75.18 87.89
Hebrew MMS 14.3 41.3 76.5 94.7

Web Demo (Interactive Inference)

Try it live — no install: huggingface.co/spaces/MLSpeech/FALCON (free CPU Space; the first alignment downloads a model).

FALCON includes an interactive Gradio web interface for zero-shot forced alignment. Upload audio of any sample rate along with a plain-text transcript (or standard annotation file), and instantly visualize predicted boundaries, the Soft-DP path, and download a TextGrid.

conda activate falcon
python app.py

This launches a local server (default http://localhost:7860). Inputs (audio, transcript, options) are on the left; the Alignment Data and Visualizations tabs are on the right.

FALCON web demo — alignment data (boundary tables + TextGrid download)

FALCON web demo — time-aligned visualizations

How to use it — upload audio (any sample rate; resampled to 16 kHz internally) and a transcript, set the options, and click Run Alignment:

Option What it does
Mode phoneme — align the phonemes; word — align words (boundaries derived from the predicted phonemes).
Language english — transcript is already TIMIT-39 phonemes (no G2P); multilingual — any other language (runs the G2P path). Auto-selects the matching checkpoint.
Pretrained checkpoint Read English (TIMIT), Spontaneous English (Buckeye), or Multilingual (joint). Follows Language by default; override it, or upload your own .pt.
Word G2P (word mode) Auto (recommended — best backend for the language), espeak, MFA-like, or none (romanization — only for languages with no G2P model). If the chosen backend lacks the language it falls back automatically.
Input language (word mode) Optional but recommended — lets Auto pick the right G2P and set the voice/dictionary behind the scenes.
Annotation .phn (phoneme timestamps), .wrd (word timestamps), or .txt (plain token sequence, no timestamps).

Outputs: the Alignment Data tab gives the boundary tables and a downloadable Praat .TextGrid; the Visualizations tab shows the time-aligned panels (waveform · spectrogram · phoneme posteriors · Soft-DP path · contrastive score).


Clone Repository

Note: the official public release will be hosted at github.com/MLSpeech/FALCON (as referenced in the paper). This repository (github.com/RotemRousso/FDNFA) is the active development version.

git clone https://github.com/RotemRousso/FDNFA.git
cd FDNFA

Setup Environment

Requirements

  • Python 3.8+
  • CUDA-capable GPU (recommended)

Option 1 — Conda (recommended)

conda create -n falcon python=3.8
conda activate falcon
pip install -r requirements.txt

Option 2 — pip only

pip install -r requirements.txt

Note: torch and torchaudio are pinned to 2.4.1. For a different CUDA version, install them separately first following PyTorch's official instructions before running pip install -r requirements.txt.


Data Format

FALCON expects data in TIMIT-style format:

<dataset_root>/
├── train/
│   ├── speaker1_utt1.wav
│   ├── speaker1_utt1.phn     ← required
│   └── ...
├── val/                       ← only for Buckeye-style splits
└── test/
    ├── speaker2_utt1.wav
    ├── speaker2_utt1.phn
    └── ...

Each .phn file contains one phoneme segment per line:

<start_sample>  <end_sample>  <phoneme_label>
0     3050  h#
3050  4559  sh
4559  5723  ix
...

Audio must be 16 kHz mono WAV. Phoneme labels should use the TIMIT 61-phoneme set (automatically mapped to Lee-Hon 39 internally).

Supported datasets:

  • TIMITdata: timit in config (standard train/test split, 10% of train used for validation)
  • Buckeyedata: buckeye in config (explicit train/, val/, test/ dirs)

Training

Edit conf/config.yaml to set your dataset paths and output directory:

# Data
data: timit                                   # or 'buckeye'
timit_path: /path/to/your/timit/
buckeye_path: /path/to/your/buckeye/

# Output run directory (Hydra timestamped)
hydra:
  run:
    dir: /path/to/your/runs/my_run/${now:%Y-%m-%d_%H-%M-%S}-${exp_name}

Then run:

conda activate falcon
python main.py

Checkpoints are saved every epoch as {epoch}_best_model.pt in the Hydra run directory.

Key hyperparameters

Parameter Default Description
epochs 200 Total training epochs
batch_size 8 Per-GPU batch size
lr 3e-4 Learning rate (AdamW)
devices [7] GPU index(es) to train on
num_classes 39 Phoneme set size (Lee-Hon 39)
z_dim 256 CNN encoder output dimension
z_proj 64 Projection head output dimension

Pretrained Checkpoints

Three checkpoints are used by FALCON (under pretrained_models/):

File Trained on Best for
falcon_timit_english.pt TIMIT (read English) English phoneme alignment
falcon_buckeye_english.pt Buckeye (spontaneous English) Spontaneous / conversational English
falcon_joint_multilingual.pt Joint TIMIT+Buckeye Best for cross-lingual / multilingual zero-shot alignment (Dutch, German, Hebrew, ...) at both phoneme and word level — the joint model generalizes better to unseen languages than either single-corpus model.

The CLI tools (predict.py, generate_textgrids.py, test_results.py) auto-pick a checkpoint from the --lang flag — --lang english → TIMIT, --lang multilingual → joint — or pass --ckpt /path/to/your.pt to use any other (e.g. the Buckeye model). The web demo (app.py) exposes all three as selectable options.

The .pt files are not committed to git; place them under pretrained_models/, or set HF_MODEL_REPO (and HF_TOKEN if private) so app.py / HuggingFace Spaces fetch them from that model repo on first use.


TextGrid Generation (CLI Inference)

The main entry point for generating alignments is generate_textgrids.py. This script processes your audio and annotations, and outputs standard Praat .TextGrid files ready for linguistic analysis (similar to tools like Montreal Forced Aligner).

--ckpt is optional — if omitted, the bundled TIMIT checkpoint is used for --lang english and the joint TIMIT+Buckeye checkpoint for --lang multilingual.

Option 1: Single File

conda activate falcon
python generate_textgrids.py \
  --wav  /path/to/audio.wav \
  --mode phoneme \
  --lang english \
  --annotation phn

Option 2: Full Dataset Directory

conda activate falcon
python generate_textgrids.py \
  --wav_dir /path/to/dataset/test/ \
  --mode    word \
  --lang    english \
  --annotation wrd

Flags:

Flag Choices Default Description
--wav file path Path to a single input .wav file (16 kHz mono)
--wav_dir directory path Path to a directory containing .wav files to process
--ckpt file path bundled Path to a trained checkpoint (.pt). Defaults to pretrained_models/falcon_timit_english.pt (or falcon_joint_multilingual.pt when --lang multilingual).
--mode phoneme, word phoneme Alignment granularity. phoneme = phoneme-level (default). word = word-level alignment (zero-shot).
--lang english, multilingual english Language setting. english = trained English phoneme alignment. multilingual = any non-English language (zero-shot cross-lingual).
--annotation any string phn Annotation file extension to look for. Use txt for plain text transcripts, or standard extensions (e.g. phn, wrd).

The .wav file must have a paired annotation file in the same directory, providing the phoneme or word sequence to align to.

Examples:

# English phoneme-level (default — uses TIMIT checkpoint)
python generate_textgrids.py --wav_dir my_dataset/

# English word-level (zero-shot)
python generate_textgrids.py --wav_dir my_dataset/ --mode word --annotation wrd

# Multilingual phoneme-level (zero-shot — uses joint TIMIT+Buckeye checkpoint)
python generate_textgrids.py --wav_dir my_dataset/ --lang multilingual --annotation phn

# Plain-text transcript (no timestamps needed)
python generate_textgrids.py --wav_dir my_dataset/ --mode word --annotation txt

Output:

  • A standard .TextGrid file saved next to each input .wav file.
  • If --mode word is used, the .TextGrid will contain two tiers: words and phones.
  • Visualizations saved next to the input WAV: <basename>_probs.png, <basename>_logits.png, <basename>_boundaries.png

Evaluation — Batch

Evaluate a checkpoint over a directory of .wav files and report precision at multiple time thresholds:

conda activate falcon
python test_results.py \
  --wav  /path/to/test/wavs/ \
  --mode phoneme \
  --lang english \
  --annotation phn

Flags:

Flag Choices Default Description
--wav directory path Directory containing .wav files to evaluate
--ckpt path bundled A single .pt file or a directory (sweeps {idx}_best_model.pt). Defaults to the bundled TIMIT/Buckeye checkpoint matching --lang.
--mode phoneme, word phoneme Alignment granularity — same meaning as in predict.py
--lang english, multilingual english Language setting — same meaning as in predict.py
--annotation any string phn Annotation file extension — same meaning as in predict.py
--w-phi float 0.5 Acoustic ↔ linguistic feature weight
--out-plot file path If set (and sweeping a directory), save a precision-vs-checkpoint plot
--no-plots flag off Skip per-file diagnostic plots — much faster on large datasets

Reports precision at 10, 15, 20, 25, 50, and 100 ms tolerances.


Multilingual / Cross-Lingual Alignment

FALCON can align speech in unseen languages (Dutch, German, Hebrew, and others) without any additional training. Phonemes from the target language are automatically mapped to the Lee-Hon 39 set used during English training via articulatory feature distance (PanPhon).

Just pass --lang multilingual — this both selects the joint TIMIT+Buckeye checkpoint (which generalizes better cross-lingually) and routes the input through the G2P/articulatory-mapping pipeline:

python generate_textgrids.py \
  --wav_dir /path/to/dutch_audio_dir/ \
  --lang multilingual \
  --annotation phn

.phn / .wrd files can use any standard phoneme/word notation — dutch_preprocess.aligner_pipeline() handles IFA → IPA → LH39 conversion.


Latent Space Visualization

Visualize how the CNN encoder learns phoneme-boundary-sensitive representations:

conda activate falcon
python visualize_latent_representation.py \
  --wav     /path/to/audio.wav \
  --run-dir /path/to/training/run/directory/

Saves four heatmap plots to <run_dir>/latent_representations/:

  • latent_epoch0_untrained.png — CNN features before training
  • latent_best_trained.png — CNN features at the best epoch
  • latent_comparison_epoch0_vs_best.png — side-by-side comparison
  • latent_difference_best_minus_epoch0.png — difference map

Project Structure

FALCON/
├── app.py                           # Gradio web demo
├── falcon_viz.py                    # Time-aligned alignment visualizations (web demo + README)
├── generate_textgrids.py            # MFA-style CLI for TextGrid output
├── main.py                          # Training entry point
├── predict.py                       # Low-level single-file inference
├── test_results.py                  # Batch evaluation
├── solver.py                        # Train/val/test loop
├── next_frame_classifier.py         # Model + loss functions
├── dataloader.py                    # Dataset classes
├── utils.py                         # Soft-DP, metrics, phoneme maps
├── dutch_preprocess.py              # Cross-lingual phoneme mapping (IPA → LH39)
├── word_g2p.py                      # Word-level G2P front-end (espeak / char)
├── mfa_g2p.py                       # Word-level G2P front-end (MFA-style)
├── visualize_latent_representation.py
├── pretrained_models/               # Checkpoints (gitignored; via HF Hub at runtime)
│   ├── falcon_timit_english.pt
│   ├── falcon_buckeye_english.pt
│   └── falcon_joint_multilingual.pt
├── conf/
│   └── config.yaml                  # All hyperparameters (Hydra)
├── scripts/                         # Data preparation & utilities
├── assets/                          # README screenshots & example figures
├── requirements.txt

License

This project is released under the MIT License.

About

Based on the paper - Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages