A sentence segmentation library written in Rust language with wide language support optimized for speed and utility.
Besides native Rust, bindings for the following programming languages are available:
However, it is not ‘period’ for many languages. So we will use a list of known punctuations that can cause a sentence break in as many languages as possible.
We also collect a list of known, popular abbreviations in as many languages as possible.
Sometimes, it is very hard to get the segmentation correct. In such cases this library is opinionated and prefer not segmenting than wrong segmentation. If two sentences are accidentally together, that is ok. It is better than sentence being split in middle. Avoid over engineering to get everything linguistically 100% accurate.
This approach would be suitable for applications like text to speech, machine translation.
Consider this example: We make a good team, you and I. Did you see Albert I. Jones yesterday?
The accurate splitting of this sentence is
["We make a good team, you and I." ,"Did you see Albert I. Jones yesterday?"]
However, to achieve this level precision, complex rules need to be added and it could create side effects. Instead, if we just don’t segment between I. Did, it is ok for most of downstream applications.
The sentence segmentation in this library is non-destructive. This means, if the sentences are combined together, you can reconstruct the original text. Line breaks, punctuations and whitespaces are preserved in the output.
Install the library using
cargo add sentencex
Then, any text can be segmented as follows.
use sentencex::segment;
fn main() {
let text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";
let sentences = segment("en", text);
for (i, sentence) in sentences.iter().enumerate() {
println!("{}. {}", i + 1, sentence);
}
}
The first argument is language code, second argument is text to segment. The segment method returns an array of identified sentences.
Install from PyPI:
pip install sentencex
import sentencex
text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development."
# Segment text into sentences
sentences = sentencex.segment("en", text)
for i, sentence in enumerate(sentences, 1):
print(f"{i}. {sentence}")
# Get sentence boundaries with indices
boundaries = sentencex.get_sentence_boundaries("en", text)
for boundary in boundaries:
print(f"Sentence: '{boundary['text']}' (indices: {boundary['start_index']}-{boundary['end_index']})")
See bindings/python/example.py for more examples.
Install from npm:
npm install sentencex
import { segment, get_sentence_boundaries } from 'sentencex';
const text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";
// Segment text into sentences
const sentences = segment("en", text);
sentences.forEach((sentence, i) => {
console.log(`${i + 1}. ${sentence}`);
});
// Get sentence boundaries with indices
const boundaries = get_sentence_boundaries("en", text);
boundaries.forEach(boundary => {
console.log(`Sentence: '${boundary.text}' (indices: ${boundary.start_index}-${boundary.end_index})`);
});
For CommonJS usage:
const { segment, get_sentence_boundaries } = require('sentencex');
See bindings/nodejs/example.js for more examples.
Install from npm:
npm install sentencex-wasm
or use a CDN like https://esm.sh/sentencex-wasm
import init, { segment, get_sentence_boundaries } from 'https://esm.sh/sentencex-wasm;
async function main() {
// Initialize the WASM module
await init();
const text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";
// Segment text into sentences
const sentences = segment("en", text);
sentences.forEach((sentence, i) => {
console.log(`${i + 1}. ${sentence}`);
});
// Get sentence boundaries with indices
const boundaries = get_sentence_boundaries("en", text);
boundaries.forEach(boundary => {
console.log(`Sentence: '${boundary.text}' (indices: ${boundary.start_index}-${boundary.end_index})`);
});
}
main();
The aim is to support all languages where there is a wikipedia. Instead of falling back on English for languages not defined in the library, a fallback chain is used. The closest language which is defined in the library will be used. Fallbacks for ~244 languages are defined.
Following is a sample output of sentence segmenting The Complete Works of William Shakespeare. This file is 5.29MB. As you can see below, it took 178 milli second.
$ curl https://www.gutenberg.org/files/100/100-0.txt | ./target/release/sentencex -l en > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 5295k 100 5295k 0 0 630k 0 0:00:08 0:00:08 --:--:-- 1061k
Time taken for segment(): 178.745108ms
Total sentences: 150254
Measured on English Golden Rule Set (GRS) using mean F1 score across 60 test cases. List cases are excluded.
The benchmark script is at benchmarks/compare.py and can be run with uv run benchmarks/compare.py.
The following libraries are compared:
| Tokenizer | English GRS F1 Score |
|---|---|
| sentencex | 100.00 |
| pysbd | 93.00 |
| blingfire | 91.67 |
| syntok | 85.67 |
| spacy | 81.67 |
| mwtokenizer | 78.00 |
| nltk | 72.33 |
MIT license. See License.txt