Trankit training for an unsupported language and with a custom embedding model

Hi! I just recently came across Trankit, even though it's been around for three years! I’m surprised that I haven’t seen this solution before, although I’ve worked quite a lot with classic NLP.
We are currently working on a corpus of a low-resource language (Faroese) and we have some hand-labeled data (about 1200 sentences, ~15000 tokens) in CONLL-U. XML-RoBERTa does not support Faroese, and therefore it is not listed as supported in your trainable pipeline, but there is another BERT-like model that does have Faroese data. I have several questions:
1. Can we expect Trankit to perform better when training on a small dataset than other tools such as Stanza or SpaCy?
2. Does Trainable Pipeline support the ability to specify a custom BERT model as an embedding model?
3. How can I specify a language that is not natively supported?
4. How were the models trained, for example, for ancient languages ​​such as Ancient Greek, Old Russian or Old French? Were the modern languages ​​specified for them (Greek, Russian and French)?

I will be very grateful for your answers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trankit training for an unsupported language and with a custom embedding model #89

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Trankit training for an unsupported language and with a custom embedding model #89

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions