FALCON — Forced Alignment through Contrastive Optimization Networks

FALCON (Forced Alignment through Contrastive Optimization Networks) is an end-to-end, fully differentiable neural system for phoneme-level and word-level (and also multilingual) forced alignment — given a speech waveform and a known phoneme/words transcript sequence, it predicts precise phoneme boundary timestamps.

Rotem Rousso, Eyal Cohen, Joseph Keshet
"Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming"
Preprint, 2026 — arXiv:2606.25460
Project page: mlspeech.github.io/FALCON

Citation

If you use FALCON in your research, please cite:

@article{rousso2026falcon,
  title     = {Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming},
  author    = {Rousso, Rotem and Cohen, Eyal and Keshet, Joseph},
  journal   = {arXiv preprint arXiv:2606.25460},
  year      = {2026}
}

Highlights

Phoneme-level precision — operates at ~10 ms frame resolution, the finest granularity among neural forced aligners.
Fully differentiable — a Soft Dynamic Programming decoder enables gradient flow through the entire alignment pipeline.
Zero-shot multilingual — trained on English, generalizes to unseen languages at inference (tested on Dutch, German, and Hebrew, among others) without any additional training.
Word-level generalization — word boundaries derived from phoneme predictions at test time, competitive with word-level neural aligners.

How It Works

The system has three jointly trained components:

Raw waveform (16 kHz)
        │
        ▼
 ┌──────────────────┐
 │ CNN Encoder f_θ  │  learns boundary-sensitive representations
 │ (5 strided conv) │  via MNCE contrastive loss
 └───────┬──────────┘
         │ Z  (frame-level latent features)
         ├──────────────────────────────────────┐
         ▼                                      │ Z
 ┌────────────────────┐                         │
 │ BiLSTM Context g_ψ │  frame-wise phoneme     │
 │ (5 layers, 512 dim)│  posterior probs U      │
 └───────┬────────────┘                         │
         │ U                                    │
         ▼                                      ▼
 ┌───────────────────────────────────────────────────┐
 │        Soft-DP Decoder h_W                        │
 │  φ₁(Z): acoustic boundary sharpness              │
 │  φ₂(U,p): phoneme posterior consistency          │
 │  LogSumExp forward + soft-argmax backtrack       │
 └───────────────────────────────────────────────────┘
         │
         ▼
  Predicted boundary timestamps  ŷ = (ŷ₁, …, ŷₙ)

Example

FALCON aligns the example fasw0sa2.wav — a TIMIT sentence, "Don't ask me to carry an oily rag like that." — using the included English checkpoint. The same utterance is provided in every supported input format so you can try them in the web demo: .phn (phonemes), .wrd (words), and .txt (plain transcript).

Waveform with predicted phoneme boundaries:

Under the hood — every representation on one shared time axis. Waveform, log-mel spectrogram, the phoneme posteriors, the Soft-DP cost matrix with its backtracked alignment path, and the contrastive boundary score with its derivative. The same predicted boundaries (crimson dashed) and ground-truth boundaries (charcoal dotted) line up across all panels, with the input phonemes written under each axis, so every component's contribution is visible at a glance:

Output (first rows of the phones tier; the full alignment is written to a Praat .TextGrid):

Phoneme	Start (s)	End (s)
h#	0.000	0.121
d	0.121	0.141
ow	0.141	0.313
n	0.313	0.333
q	0.333	0.434
ae	0.434	0.575
s	0.575	0.645
epi	0.645	0.726
m	0.726	0.776
iy	0.776	0.897
…

Reproduce:

# alignment + TextGrid
python generate_textgrids.py --wav assets/fasw0sa2.wav --mode phoneme --lang english --annotation phn

# the multi-panel "under the hood" figure
python falcon_viz.py

Results

Numbers are from the paper (arXiv:2606.25460). Specialist = trained on the target English corpus; joint = a single model jointly trained on TIMIT+Buckeye. Multilingual results are zero-shot — no target-language training data. Accuracy is the percentage of reference boundaries matched within the given ms tolerance.

Phone-level Alignment Accuracy [%]: MFA vs. FALCON (Ours)

Dataset	Model	t≤10	t≤25	t≤50	t≤100
TIMIT	MFA	38.6	72.3	81.1	84.6
TIMIT	FALCON specialist	37.66	83.88	94.85	98.62
TIMIT	FALCON joint	34.70	82.62	94.91	98.60
Buckeye	MFA	35.3	60.6	68.9	72.7
Buckeye	FALCON specialist	29.69	69.93	90.07	97.40
Buckeye	FALCON joint	28.87	69.40	89.53	97.13

Phoneme-Level: Unseen Multilingual Generalization Accuracy

Test set	Model	≤10	≤15	≤20	≤25	≤50	≤100
Dutch — IFA	FALCON joint	26.85	36.16	44.56	51.17	69.94	84.11
Dutch — IFA	FALCON specialist	26.86	35.79	43.85	50.34	68.68	83.22
Dutch — IFA	MFA	11.01	14.70	19.05	21.80	33.90	51.02
German — PHONDAT	FALCON joint	25.63	34.12	41.87	49.07	70.04	84.58
German — PHONDAT	FALCON specialist	25.08	33.37	40.76	47.43	68.27	82.44
German — PHONDAT	MFA	20.60	31.75	37.17	45.83	66.78	79.19
Hebrew	FALCON joint	21.98	30.10	36.91	42.78	63.07	80.41
Hebrew	FALCON specialist	21.03	27.78	34.30	39.79	59.38	77.76

Word-Level Alignment Accuracy [%]: Comparative Analysis

Dataset	Model	t≤10	t≤25	t≤50	t≤100
TIMIT	FALCON spec (MFA-G2P)	49.22	81.79	93.04	98.37
TIMIT	FALCON joint (MFA-G2P)	49.50	80.60	92.86	98.46
TIMIT	MFA	41.60	72.80	89.40	97.40
TIMIT	MMS	18.60	43.50	75.70	94.70
TIMIT	WhisperX	22.40	52.70	82.40	94.20
TIMIT	Nvidia-Canary-1b	9.23	23.11	44.23	72.81
Buckeye	FALCON spec (MFA-G2P)	50.06	77.85	91.51	96.63
Buckeye	FALCON joint (MFA-G2P)	50.42	77.98	91.01	96.55
Buckeye	MFA	39.80	69.90	84.90	91.80
Buckeye	MMS	25.00	52.70	75.00	87.90
Buckeye	WhisperX	18.80	43.10	67.40	77.40
Buckeye	Nvidia-Canary-1b	8.06	18.83	36.31	63.29

Word-Level: Unseen Multilingual Generalization Accuracy

Dataset	Model	t≤10	t≤25	t≤50	t≤100
German — PHONDAT	FALCON (MFA-G2P)	44.20	68.48	86.12	95.11
German — PHONDAT	MFA	29.9	65.4	82.1	94.3
German — PHONDAT	MMS	21.8	44.3	74.9	91.8
Dutch — IFA	FALCON (MFA-G2P)	26.38	45.15	61.16	76.49
Dutch — IFA	MFA	4.7	7.3	11.6	19.0
Dutch — IFA	MMS	16.0	37.9	62.9	76.6
Hebrew	FALCON	31.91	56.72	75.18	87.89
Hebrew	MMS	14.3	41.3	76.5	94.7

Web Demo (Interactive Inference)

Try it live — no install: huggingface.co/spaces/MLSpeech/FALCON (free CPU Space; the first alignment downloads a model).

FALCON includes an interactive Gradio web interface for zero-shot forced alignment. Upload audio of any sample rate along with a plain-text transcript (or standard annotation file), and instantly visualize predicted boundaries, the Soft-DP path, and download a TextGrid.

conda activate falcon
python app.py

This launches a local server (default http://localhost:7860). Inputs (audio, transcript, options) are on the left; the Alignment Data and Visualizations tabs are on the right.

How to use it — upload audio (any sample rate; resampled to 16 kHz internally) and a transcript, set the options, and click Run Alignment:

Option	What it does
Mode	`phoneme` — align the phonemes; `word` — align words (boundaries derived from the predicted phonemes).
Language	`english` — transcript is already TIMIT-39 phonemes (no G2P); `multilingual` — any other language (runs the G2P path). Auto-selects the matching checkpoint.
Pretrained checkpoint	Read English (TIMIT), Spontaneous English (Buckeye), or Multilingual (joint). Follows Language by default; override it, or upload your own `.pt`.
Word G2P (word mode)	Auto (recommended — best backend for the language), espeak, MFA-like, or none (romanization — only for languages with no G2P model). If the chosen backend lacks the language it falls back automatically.
Input language (word mode)	Optional but recommended — lets Auto pick the right G2P and set the voice/dictionary behind the scenes.
Annotation	`.phn` (phoneme timestamps), `.wrd` (word timestamps), or `.txt` (plain token sequence, no timestamps).

Outputs: the Alignment Data tab gives the boundary tables and a downloadable Praat .TextGrid; the Visualizations tab shows the time-aligned panels (waveform · spectrogram · phoneme posteriors · Soft-DP path · contrastive score).

Clone Repository

Note: the official public release will be hosted at github.com/MLSpeech/FALCON (as referenced in the paper). This repository (github.com/RotemRousso/FDNFA) is the active development version.

git clone https://github.com/RotemRousso/FDNFA.git
cd FDNFA

Setup Environment

Requirements

Python 3.8+
CUDA-capable GPU (recommended)

Option 1 — Conda (recommended)

conda create -n falcon python=3.8
conda activate falcon
pip install -r requirements.txt

Option 2 — pip only

pip install -r requirements.txt

Note: torch and torchaudio are pinned to 2.4.1. For a different CUDA version, install them separately first following PyTorch's official instructions before running pip install -r requirements.txt.

Data Format

FALCON expects data in TIMIT-style format:

<dataset_root>/
├── train/
│   ├── speaker1_utt1.wav
│   ├── speaker1_utt1.phn     ← required
│   └── ...
├── val/                       ← only for Buckeye-style splits
└── test/
    ├── speaker2_utt1.wav
    ├── speaker2_utt1.phn
    └── ...

Each .phn file contains one phoneme segment per line:

<start_sample>  <end_sample>  <phoneme_label>
0     3050  h#
3050  4559  sh
4559  5723  ix
...

Audio must be 16 kHz mono WAV. Phoneme labels should use the TIMIT 61-phoneme set (automatically mapped to Lee-Hon 39 internally).

Supported datasets:

TIMIT — data: timit in config (standard train/test split, 10% of train used for validation)
Buckeye — data: buckeye in config (explicit train/, val/, test/ dirs)

Training

Edit conf/config.yaml to set your dataset paths and output directory:

# Data
data: timit                                   # or 'buckeye'
timit_path: /path/to/your/timit/
buckeye_path: /path/to/your/buckeye/

# Output run directory (Hydra timestamped)
hydra:
  run:
    dir: /path/to/your/runs/my_run/${now:%Y-%m-%d_%H-%M-%S}-${exp_name}

Then run:

conda activate falcon
python main.py

Checkpoints are saved every epoch as {epoch}_best_model.pt in the Hydra run directory.

Key hyperparameters

Parameter	Default	Description
`epochs`	200	Total training epochs
`batch_size`	8	Per-GPU batch size
`lr`	3e-4	Learning rate (AdamW)
`devices`	`[7]`	GPU index(es) to train on
`num_classes`	39	Phoneme set size (Lee-Hon 39)
`z_dim`	256	CNN encoder output dimension
`z_proj`	64	Projection head output dimension

Pretrained Checkpoints

Three checkpoints are used by FALCON (under pretrained_models/):

File	Trained on	Best for
`falcon_timit_english.pt`	TIMIT (read English)	English phoneme alignment
`falcon_buckeye_english.pt`	Buckeye (spontaneous English)	Spontaneous / conversational English
`falcon_joint_multilingual.pt`	Joint TIMIT+Buckeye	Best for cross-lingual / multilingual zero-shot alignment (Dutch, German, Hebrew, ...) at both phoneme and word level — the joint model generalizes better to unseen languages than either single-corpus model.

The CLI tools (predict.py, generate_textgrids.py, test_results.py) auto-pick a checkpoint from the --lang flag — --lang english → TIMIT, --lang multilingual → joint — or pass --ckpt /path/to/your.pt to use any other (e.g. the Buckeye model). The web demo (app.py) exposes all three as selectable options.

The .pt files are not committed to git; place them under pretrained_models/, or set HF_MODEL_REPO (and HF_TOKEN if private) so app.py / HuggingFace Spaces fetch them from that model repo on first use.

TextGrid Generation (CLI Inference)

The main entry point for generating alignments is generate_textgrids.py. This script processes your audio and annotations, and outputs standard Praat .TextGrid files ready for linguistic analysis (similar to tools like Montreal Forced Aligner).

--ckpt is optional — if omitted, the bundled TIMIT checkpoint is used for --lang english and the joint TIMIT+Buckeye checkpoint for --lang multilingual.

Option 1: Single File

conda activate falcon
python generate_textgrids.py \
  --wav  /path/to/audio.wav \
  --mode phoneme \
  --lang english \
  --annotation phn

Option 2: Full Dataset Directory

conda activate falcon
python generate_textgrids.py \
  --wav_dir /path/to/dataset/test/ \
  --mode    word \
  --lang    english \
  --annotation wrd

Flags:

Flag	Choices	Default	Description
`--wav`	file path	—	Path to a single input `.wav` file (16 kHz mono)
`--wav_dir`	directory path	—	Path to a directory containing `.wav` files to process
`--ckpt`	file path	bundled	Path to a trained checkpoint (`.pt`). Defaults to `pretrained_models/falcon_timit_english.pt` (or `falcon_joint_multilingual.pt` when `--lang multilingual`).
`--mode`	`phoneme`, `word`	`phoneme`	Alignment granularity. `phoneme` = phoneme-level (default). `word` = word-level alignment (zero-shot).
`--lang`	`english`, `multilingual`	`english`	Language setting. `english` = trained English phoneme alignment. `multilingual` = any non-English language (zero-shot cross-lingual).
`--annotation`	any string	`phn`	Annotation file extension to look for. Use `txt` for plain text transcripts, or standard extensions (e.g. `phn`, `wrd`).

The .wav file must have a paired annotation file in the same directory, providing the phoneme or word sequence to align to.

Examples:

# English phoneme-level (default — uses TIMIT checkpoint)
python generate_textgrids.py --wav_dir my_dataset/

# English word-level (zero-shot)
python generate_textgrids.py --wav_dir my_dataset/ --mode word --annotation wrd

# Multilingual phoneme-level (zero-shot — uses joint TIMIT+Buckeye checkpoint)
python generate_textgrids.py --wav_dir my_dataset/ --lang multilingual --annotation phn

# Plain-text transcript (no timestamps needed)
python generate_textgrids.py --wav_dir my_dataset/ --mode word --annotation txt

Output:

A standard .TextGrid file saved next to each input .wav file.
If --mode word is used, the .TextGrid will contain two tiers: words and phones.
Visualizations saved next to the input WAV: <basename>_probs.png, <basename>_logits.png, <basename>_boundaries.png

Evaluation — Batch

Evaluate a checkpoint over a directory of .wav files and report precision at multiple time thresholds:

conda activate falcon
python test_results.py \
  --wav  /path/to/test/wavs/ \
  --mode phoneme \
  --lang english \
  --annotation phn

Flags:

Flag	Choices	Default	Description
`--wav`	directory path	—	Directory containing `.wav` files to evaluate
`--ckpt`	path	bundled	A single `.pt` file or a directory (sweeps `{idx}_best_model.pt`). Defaults to the bundled TIMIT/Buckeye checkpoint matching `--lang`.
`--mode`	`phoneme`, `word`	`phoneme`	Alignment granularity — same meaning as in `predict.py`
`--lang`	`english`, `multilingual`	`english`	Language setting — same meaning as in `predict.py`
`--annotation`	any string	`phn`	Annotation file extension — same meaning as in `predict.py`
`--w-phi`	float	`0.5`	Acoustic ↔ linguistic feature weight
`--out-plot`	file path	—	If set (and sweeping a directory), save a precision-vs-checkpoint plot
`--no-plots`	flag	off	Skip per-file diagnostic plots — much faster on large datasets

Reports precision at 10, 15, 20, 25, 50, and 100 ms tolerances.

Multilingual / Cross-Lingual Alignment

FALCON can align speech in unseen languages (Dutch, German, Hebrew, and others) without any additional training. Phonemes from the target language are automatically mapped to the Lee-Hon 39 set used during English training via articulatory feature distance (PanPhon).

Just pass --lang multilingual — this both selects the joint TIMIT+Buckeye checkpoint (which generalizes better cross-lingually) and routes the input through the G2P/articulatory-mapping pipeline:

python generate_textgrids.py \
  --wav_dir /path/to/dutch_audio_dir/ \
  --lang multilingual \
  --annotation phn

.phn / .wrd files can use any standard phoneme/word notation — dutch_preprocess.aligner_pipeline() handles IFA → IPA → LH39 conversion.

Latent Space Visualization

Visualize how the CNN encoder learns phoneme-boundary-sensitive representations:

conda activate falcon
python visualize_latent_representation.py \
  --wav     /path/to/audio.wav \
  --run-dir /path/to/training/run/directory/

Saves four heatmap plots to <run_dir>/latent_representations/:

latent_epoch0_untrained.png — CNN features before training
latent_best_trained.png — CNN features at the best epoch
latent_comparison_epoch0_vs_best.png — side-by-side comparison
latent_difference_best_minus_epoch0.png — difference map

Project Structure

FALCON/
├── app.py                           # Gradio web demo
├── falcon_viz.py                    # Time-aligned alignment visualizations (web demo + README)
├── generate_textgrids.py            # MFA-style CLI for TextGrid output
├── main.py                          # Training entry point
├── predict.py                       # Low-level single-file inference
├── test_results.py                  # Batch evaluation
├── solver.py                        # Train/val/test loop
├── next_frame_classifier.py         # Model + loss functions
├── dataloader.py                    # Dataset classes
├── utils.py                         # Soft-DP, metrics, phoneme maps
├── dutch_preprocess.py              # Cross-lingual phoneme mapping (IPA → LH39)
├── word_g2p.py                      # Word-level G2P front-end (espeak / char)
├── mfa_g2p.py                       # Word-level G2P front-end (MFA-style)
├── visualize_latent_representation.py
├── pretrained_models/               # Checkpoints (gitignored; via HF Hub at runtime)
│   ├── falcon_timit_english.pt
│   ├── falcon_buckeye_english.pt
│   └── falcon_joint_multilingual.pt
├── conf/
│   └── config.yaml                  # All hyperparameters (Hydra)
├── scripts/                         # Data preparation & utilities
├── assets/                          # README screenshots & example figures
├── requirements.txt

License

This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FALCON — Forced Alignment through Contrastive Optimization Networks

Citation

Highlights

How It Works

Example

Results

Web Demo (Interactive Inference)

Clone Repository

Setup Environment

Requirements

Option 1 — Conda (recommended)

Option 2 — pip only

Data Format

Training

Key hyperparameters

Pretrained Checkpoints

TextGrid Generation (CLI Inference)

Option 1: Single File

Option 2: Full Dataset Directory

Evaluation — Batch

Multilingual / Cross-Lingual Alignment

Latent Space Visualization

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
assets		assets
best_ckpts		best_ckpts
conf		conf
demo_workspace		demo_workspace
docs		docs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
compare_mfa_korean.py		compare_mfa_korean.py
dataloader.py		dataloader.py
dutch_preprocess.py		dutch_preprocess.py
evaluate_fdnfa_g2p.py		evaluate_fdnfa_g2p.py
falcon_viz.py		falcon_viz.py
generate_textgrids.py		generate_textgrids.py
korean_preprocess.py		korean_preprocess.py
main.py		main.py
mfa_g2p.py		mfa_g2p.py
next_frame_classifier.py		next_frame_classifier.py
predict.py		predict.py
prepare_korean_aligned.py		prepare_korean_aligned.py
prepare_korean_aligned_from_g2p.py		prepare_korean_aligned_from_g2p.py
prepare_korean_for_mfa.py		prepare_korean_for_mfa.py
requirements.txt		requirements.txt
run_korean_eval.py		run_korean_eval.py
solver.py		solver.py
test_results.py		test_results.py
utils.py		utils.py
visualize_latent_representation.py		visualize_latent_representation.py
word_g2p.py		word_g2p.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FALCON — Forced Alignment through Contrastive Optimization Networks

Citation

Highlights

How It Works

Example

Results

Web Demo (Interactive Inference)

Clone Repository

Setup Environment

Requirements

Option 1 — Conda (recommended)

Option 2 — pip only

Data Format

Training

Key hyperparameters

Pretrained Checkpoints

TextGrid Generation (CLI Inference)

Option 1: Single File

Option 2: Full Dataset Directory

Evaluation — Batch

Multilingual / Cross-Lingual Alignment

Latent Space Visualization

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages