How to transform an open-source data for NLP training