The Stanford NER tagger tags individual words as SMO or not. For example, Occupy Wall Street is returned as [('Occupy', 'ORGANIZATION'), ('Wall', 'ORGANIZATION'), ('Street', 'ORGANIZATION')].
To parse this into a single string I've made the assumption that all consecutive organization tags indicate the same SMO. Does this seem like a reasonably robust approach, or should we try to come up with something else?
It seems to work as long as punctuation is included as separate tokens (i.e. a list of SMOs is separated by non-organization tagged commas), but I probably haven't thought about all edge cases.