Skip to content

Custom Data Notebook: Spaces in file paths can cause issues with bash commands #137

@cdleong

Description

@cdleong

For example, /content/drive/My Drive/masakhane/$src-$tgt-$tag can cause issues, but also the following situation caused an error for me:

source_file = f"/content/drive/My Drive/Research/Hani Machine Translation/hni_story_corpus/v2/hani_story_corpus_train.{source_language}"
target_file = f"/content/drive/My Drive/Research/Hani MachineTranslation/hni_story_corpus/v2/hani_story_corpus_train.{target_language}"

# They should both have the same length.
! wc -l $source_file
! wc -l $target_file

Mitigations we could do:

"MyDrive" instead of "My Drive" helps

Actually, it seems you can just change from using My Drive to MyDrive paths, which helps a lot so long as there aren't spaces elsewhere in the path, e.g. in my case where Hani Machine Translation was in the path to train.eng and train.hni

Add quotes around bash variables

For example
! wc -l "$source_file" instead of wc -l $source_file

and `

! head "$source_file"* instead of ! head "$source_file"*

but this doesn't completely solve it, and can get complicated when we've got some of the more complex cases later in the notebook, like

!cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

or within the yaml file:

#load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint

Warn the user about whitespaces.

Add a section that checks all the paths for white spaces and warns the user that, maybe it would be easier if they just removed them?

Do all our file manipulations with Python

We could rewrite a lot of these to use pathlib

See also pjreddie/darknet#1672 and https://stackoverflow.com/questions/56640534/cannot-open-train-txt-with-white-space-my-drivehe

Originally posted this on masakhane-io/masakhane-community#25, whoops.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions