Programmatically collect normalized news from (almost) any website. – kotartemiy/newscatcher

Programmatically collect normalized news from (almost) any website.
Filter by topic, country, or language.
By newscatcherapi.com (this package is fully
self-sufficient, you can just use it. No dependency on external services/API)
Demo
Motivation
While working on newscatcherapi – JSON API
to query news articles,
I came up with an idea to make a simple Python package that would allow
to easily grab the live news data.
When I used to be a junior data scientist working on my own side projects,
it was difficult for me to operate with external data sources. I knew Python
quite well, but in most cases it was not enough to build proper data pipelines
that required gathering data on my own. I hope that this package will help you
with your next project.
Even though I do not recommend to use this package for any production systems,
I believe that it should be enough to test your assumptions and build some MVPs.
Installation
pip install newscatcher –upgrade
Quick Start
fromnewscatcherimportNewscatcher
Get the latest news from nytimes.com
(we support thousands of news websites, try yourself!) main news feed
nc=Newscatcher(website=’nytimes.com’)
results=nc.get_news()
# results.keys()# ‘url’, ‘topic’, ‘language’, ‘country’, ‘articles’# Get the articlesarticles=results[‘articles’]
first_article_summary=articles[0][‘summary’]
first_article_title=articles[0][‘title’]
Get the latest news from nytimes.compolitics feed
nc=Newscatcher(website=’nytimes.com’, topic=’politics’)
results=nc.get_news()
articles=results[‘articles’]
There is a limited set of topic that you might find:
‘tech’, ‘news’, ‘business’, ‘science’, ‘finance’, ‘food’, ‘politics’, ‘economics’, ‘travel’, ‘entertainment’, ‘music’, ‘sport’, ‘world’
However, not all topics are supported by every newspaper.
How to check which topics are supported by which newspaper:
fromnewscatcherimportdescribe_urldescribe=describe_url(‘nytimes.com’)
print(describe[‘topics’])
Get the list of all news feeds by topic/language/country
If you want to find the full list of supported news websites
you can always do so using urls() function
fromnewscatcherimporturls# URLs by TOPICpolitic_urls=urls(topic=’politics’)
# URLs by COUNTRYamerican_urls=urls(country=’US’)
# URLs by LANGUAGEenglish_urls=urls(language=’en’)
# Combine any from topic, country, languageamerican_english_politics_urls=urls(country=’US’, topic=’politics’, language=’en’)
# note some websites do not explicitly declare their language # as a result they will be excluded from queries based on language
Documentation
Newscatcher Class
fromnewscatcherimportNewscatcherNewscatcher(website, topic=None)
Please take the base form url of a website (without www.,neither https://, nor / at the end of url).
For example: nytimes.com, news.ycombinator.com or theverge.com.
Newscatcher.get_news() – Get the latest news from the website of interest.
Allowed topics:
tech, news, business, science, finance, food,
politics, economics, travel, entertainment,
music, sport, world
If no topic is provided, the main feed is returned.
Returns a dictionary of 5 elements:

  1. url – URL of the website
  2. topic – topic of the returned feed
  3. language – language of returned feed
  4. country – country of returned feed
  5. articles – articles of the feed. Feedparser object

Newscatcher.get_headlines() – Returns only the headlines
Newscatcher.print_headlines(n) – Print top n headlines
describe_url() & urls()
Those functions exist to help you navigate through this package
fromnewscatcherimportdescribe_url
describe_url(website) – Get the main info on the website.
Returns a dictionary of 5 elements:

  1. url – URL of the website
  2. topics – list of all supported topics
  3. language – language of website
  4. country – country of returned feed
  5. main_topic – main topic of a website

fromnewscatcherimporturls
urls(topic = None, language = None, country = None) – Get a list of all supported
news websites given any combination of topic, language, country
Returns a list of websites that match your combination of topic, language, country
Supported topics:
tech, news, business, science, finance, food,
politics, economics, travel, entertainment,
music, sport, world
Supported countries:
US, GB, DE, FR, IN, RU, ES, BR, IT, CA, AU, NL, PL, NZ, PT, RO, UA, JP, AR, IR, IE, PH, IS, ZA, AT, CL, HR, BG, HU, KR, SZ, AE, EG, VE, CO, SE, CZ, ZH, MT, AZ, GR, BE, LU, IL, LT, NI, MY, TR, BM, NO, ME, SA, RS, BA
Supported languages:
EL, IT, ZH, EN, RU, CS, RO, FR, JA, DE, PT, ES, AR, HE, UK, PL, NL, TR, VI, KO, TH, ID, HR, DA, BG, NO, SK, FA, ET, SV, BN, GU, MK, PA, HU, SL, FI, LT, MR, HI
Tech/framework used
The package itself is nothing more than a SQLite database with
RSS feed endpoints for each website and some basic wrapper of
feedparser.
About Us
We are Newscatcher API team. We are glad that you liked our package.
If you want to search for any news data, consider using our API
Artem Bugara – co-founder of Newscatcher, made v.0.1.0
Maksym Sugonyaka – co-founder of Newscatcher, made v.0.1.0
Becket Trotter – Python Developer, made v.0.2.0
Licence
MIT