Reuters-21578 text classification with Gensim and Keras

Fork Reuters-21578 is a collection of about 20K news-lines (see reference for more information, downloads and copyright notice), structured using SGML and categorized with 672 labels. They are diveded into five main categories: Topics Places People Organizations Exchanges However, most of them are unused and, looking at the distribution, it’s possible to notice a complete lack of homogeneity. These are the 20 top categories (the prefix is made with the two initial letter of each main category) with the number of related news-lines: ID Name Category Newslines 161 pl_usa Places 12542 533 to_earn Topics 3987 498 to_acq Topics 2448 158 pl_uk Places 1489 84 pl_japan Places 1138 31 pl_canada Places 1104 571 to_money-fx Topics 801 526 to_crude Topics 634 543 to_grain Topics 628 167 pl_west-germany Places 567 624 to_trade Topics 552 553 to_interest Topics 513 56 pl_france Places 469 185 or_ec Organizations 349 23 pl_brazil Places 332 628 to_wheat Topics […]