GitHub - sebastian-davidson/Japanese-to-Hiragana: Using just machine learning, can we convert the kanji in Japanese sentences to hiragana?

This project started as a project for CMSC 473, the Natural Language Processing class at UMBC. It compares two different Seq2Seq NLP models, which are supposed to take a Japanese sentence and give back the same sentence with the kanji words converted to the correct phonetic readings in hiragana. Both are implemented using PyTorch, and both are tokenized at the unicode codepoint level (so just Python characters). So far, both have been unsuccessful.

The training data is a file of about 245k Japanese sentence pairs with the input in the first column and the desired output in the second column. The original sentence corpus was taken from Tatoeba, which is released under a CC-BY License. The output column was generated using Sudachi, a morphological analyzer and lattice-based tokenizer of Japanese that provides all the correct readings without machine learning.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
char_level_LSTM		char_level_LSTM
char_level_transformer		char_level_transformer
common		common
.gitignore		.gitignore
README.md		README.md
generate-sentences.py		generate-sentences.py
jpn_sentences.txt		jpn_sentences.txt
kanji_hiragana_pairs.tsv		kanji_hiragana_pairs.tsv
run_lstm.py		run_lstm.py
run_transformer.py		run_transformer.py
transformer_gui.py		transformer_gui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

sebastian-davidson/Japanese-to-Hiragana

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages