Python and the NYTimes Api

After installing Python v2.7 and registering a key for the New York Times Article Search API i was good to go and writing code.

The first thing we have to do, is to find out, how we can request data from the NYT API, this means reading the documentation http://developer.nytimes.com/docs and looking for tutorials on the net:

http://data-gov.tw.rpi.edu/wiki/How_to_use_New_York_Times_Article_Search_API

There is also a tool for generating url-request strings: http://prototype.nytimes.com/gst/apitool/index.html

These url string can simply be typed into the url bar of your browser, and it will show you the results. These results will only be a line of text and symbols and are not very human-readable. The format i used was JSON, XML is also available in some of the API’s.

Our weapon of choice, to sort through these data-formats will be python. Python 2.7 has support for reading websites and decoding JSON/XML.

First off, we want to request the data from the API, we use the urllib2 module for this:


request_string = 'http://api.nytimes.com/svc/search/v1/article?format=json&query=germany+finances&offset='+str(offset)+'&api-key=####'
response = urllib2.urlopen(request_string)
content = response.read()

Here, we build our request url, in this case i’m searching for german finances with an offset of 2, which will deliver the 20th to the 30th article. The urllib2.urlopen method will send an HTTP request to the url and store the resulting website in the “response” variable.
We know that the result is a string in JSON format. This means that we can use the json module to decode and traverse the data sent to us

import urllib2
import json
....
decoded = json.loads(content)
date_of_first_article = decoded['results'][0]['date']

This will decode the JSON string into objects and lists in python, which are much easier to work with. You can visualise this data as a tree, which starts with the ‘results’ node at the top, which has the articles as children:tree of json
To understand the structure of JSON strings, you can simple look at the JSON string and read the API. In this case, the results-node contains a list of articles( accessed by [0] to [9] ), and every article node has data, like its text, date, title and url to the original story.
In this example, we will make multiple queries via a for-loop and extraxt the date, title and url.
Here is how a for-loop could look like:
for offset in xrange(10):

 request_string = 'http://api.nytimes.com/svc/search/v1/article?format=json&query=germany+finances&offset='+str(offset)+'&api-key=bcbdffbec58353edaf892db8b2e4d3fb:5:67631538'
 response = urllib2.urlopen(request_string)
 content = response.read()

this for-loop will will assign the values 0 to 10 to the “offset” variable and run the indented code every time. For now, this only makes queries with different offsets, but we do not do anything with the data yet. We can now use the JSON objects to extract data and write it to a file

import urllib2
import json
def year(d):
 return d[0:4] 

def month(d):
 return d[4:6]

def day(d):
 return d[6:8]
f = open('newfile','w')
for offset in xrange(2):
 print "-"
 request_string = 'http://api.nytimes.com/svc/search/v1/article?format=json&query=germany+finances&offset='+str(offset)+'&api-key=####'
 response = urllib2.urlopen(request_string)
 content = response.read()
 #decoded = json.loads(response_string)
decoded = json.loads(content)
for x in decoded['results']:
 string = x['date']
 f.write(year(string) + " " + month(string) + " " + day(string)+"\t")
 f.write((x['title'].encode('utf-8') ).replace("’","'").replace("'","'")+"\t")
 f.write(x['url']+"\n")
raw_input("Press enter to quit")

I’ll walk you through this code bit by bit.

The first two lines import modules, so we can use them in our program. We need urllib2 to make http requests and json to decode the json string we get from the NYT API.

Then there are three definitions, these will make it a little bit easier to work with dates. In the JSON objects, dates are just string like 20130523, which is the 23th day of the 5th month in 2013. The syntax d[0:4] takes the string stored in variable d and returns the first to the fourth character as a new string. In this example 2013.

The next line creates a new file, which we can ‘w’rite to.

The forloop then assigns offset the values 0,1 and 2.

We request data for every offset and decode the answer. The next forloop is a little more complicated. We basically assign every article that we currently have in our result to the variable x. This reads: for each value in the list of results, assign the value to x.

We then read the articles date, title and url and write it to the file we created earlier. We have to encode the title-string to UTF-8, and replace certain special characters, because python does not like the ‘-character ( apostrophe ) in unicode format ( which the json string currently is in)

Notice that we use “\t” to seperate data, url and title. This is so we can later use the resulting text-file as a table and for example load it into google fusion tables.

Also notice that the last line is NOT inside the for-loops anymore. It basically waits for the user to press enter, when the program runs. This way, we can print text on the screen, without having the window disappear instantly ( on windows )

 

After saving this code as a .py file, i ran it, and got the following output:

2013 05 08 Backing Grows for European Bank Plan http://www.nytimes.com/2013/05/08/business/global/08iht-euro08.html
2013 05 04 Euro Area Recession Is Expected to Deepen http://www.nytimes.com/2013/05/04/business/global/04iht-euro04.html
2013 05 01 In Continuing Sign of Weakness, Unemployment Hits New High in the Euro Zone http://www.nytimes.com/2013/05/01/business/global/european-unemployment-sets-another-record.html
2013 04 30 Italy's New Premier Puts Stimulus First http://www.nytimes.com/2013/04/30/world/europe/enrico-letta-italys-new-premier-puts-stimulus-first.html
2013 04 30 Italy's New Premier Puts Stimulus First http://query.nytimes.com/gst/fullpage.html?res=9E07E3DB1439F933A05757C0A9659D8B63
2013 04 28 Germans' Dominance Is Peak of a Long Climb http://www.nytimes.com/2013/04/28/sports/soccer/germanys-champions-league-dominance-is-peak-of-a-long-climb-back.html
2013 04 23 German Soccer Hero Faces Prison Amid Record Run and Scandal http://www.nytimes.com/2013/04/23/world/europe/bayern-president-faces-prison-in-tax-evasion-case.html
2013 04 23 MEMO FROM EUROPE; Shrinking Europe Military Spending Stirs Concern http://www.nytimes.com/2013/04/23/world/europe/europes-shrinking-military-spending-under-scrutiny.html
2013 04 19 Plan for Cyprus Bailout Wins Easy Approval in Germany http://www.nytimes.com/2013/04/19/business/global/german-lawmakers-back-cyprus-bailout.html
2013 04 16 Path to Growth Splits Europe, With Austerity A Central Issue http://www.nytimes.com/2013/04/16/business/global/europe-split-over-austerity-as-a-path-to-growth.html
2013 04 13 Bailout Terms Are Eased for Ireland and Portugal http://www.nytimes.com/2013/04/13/business/global/euro-zone-finance-ministers-gather-in-ireland.html
2013 04 12 OP-ED CONTRIBUTOR; Cybersecurity: A View From the Front http://www.nytimes.com/2013/04/12/opinion/global/cybersecurity-a-view-from-the-front.html
2013 04 11 Economies of France and Italy Are Risks to the Euro Zone, a Report Says http://www.nytimes.com/2013/04/11/business/global/italy-and-france-are-risks-to-euro-zone-report-says.html
2013 04 11 Hollande Creates a Prosecutor for Fraud and Vows to End Tax Havens http://query.nytimes.com/gst/fullpage.html?res=9C06EFDF1F3FF932A25757C0A9659D8B63
2013 04 11 Hollande Creates a Prosecutor for Fraud and Vows to End Tax Havens http://www.nytimes.com/2013/04/11/business/global/european-countries-move-to-toughen-stance-on-tax-evasion.html
2013 04 10 Europe's Data on Wealth Needs a Grain of Salt http://www.nytimes.com/2013/04/10/business/global/germans-are-poor-and-italians-are-frugal-huh.html
2013 04 07 FUNDAMENTALLY; Europe's Markets, No Longer in Lock Step http://query.nytimes.com/gst/fullpage.html?res=9D05E3DF1F3CF934A35757C0A9659D8B63
2013 04 07 FUNDAMENTALLY; Europe's Markets, No Longer in Lock Step http://www.nytimes.com/2013/04/07/your-money/europes-markets-no-longer-in-lock-step.html
2013 04 06 WEALTH MATTERS; Overseas Finances Can Trip Up Americans Abroad http://www.nytimes.com/2013/04/06/your-money/rules-aimed-at-tax-evasion-abroad-trip-up-average-americans.html
2013 04 02 INSIDE EUROPE; Germany Appoints Itself Parent to Restive Euro Children http://www.nytimes.com/2013/04/02/business/global/germany-appoints-itself-parent-to-restive-euro-children.html
Advertisements
Posted in Uncategorized | Leave a comment

Sources for data-journalism

Reading a lot about data-journalism. Looking for examples and guidelines.

http://www.guardian.co.uk/news/datablog/2011/jul/28/data-journalism
http://datajournalismhandbook.org/1.0/en/introduction.html
http://datajournalismhandbook.org/1.0/en/getting_data_0.html
http://datajournalismhandbook.org/1.0/en/understanding_data_6.html

examples:
http://www.guardian.co.uk/profile/davidmccandless
http://jonathanstray.com/a-full-text-visualization-of-the-iraq-war-logs

 

very important:

http://datajournalismhandbook.org/1.0/en/understanding_data_7.html

Posted in Uncategorized | Leave a comment

As a journalist, it is important to get data from reliable sources. One such source is Danmarks Statistik: http://www.dst.dk/da

 

In this post, we are going to collect data from dst.dk and import it to Google Fusion Tables.

Also, we are going to encounter lots of problems and we have to learn, how to be flexible and how to deal with them.

First, click on “Find Statistik” and then “Statistikbanken”, here you can find a wealth of information. I chose the following table: http://www.statistikbanken.dk/BOP6

Clicking on this link will show you the following site:Image

 

Here, you can choose what data specifically you want to see.

I was interested in the transport sector, especially in relation to Germany over the course of the last decade. If you press the control-key on your keyboard, you can choose multiple entries in each list, my selection looked like this:Image

Then you can click “Vis Tabel” and you will see the data set, and be able to download it as a file.

Google Tables understands .xls and tab separated .cvs files. It also supports other, but i chose the tab-separated file this time around.Image

 

We now want to import the file to Fusion Tables. Go to fusion tables http://www.google.com/drive/start/apps.html#fusiontables and create a new table, choose to upload the file and select “tab” as your separator character.Image

After the file is uploaded, you will notice that the table does not look reasonable at all!Image

Apparently, we have to clean our data set first, so Fusion Tables can understand it. Use a text-editor, f.eks Notepad++ for windows or gedit for Linux to open the file.

Here is what the file looks like for me:Image

 

We can see that the file has multiple parts, one for “Tyskland” and one for “Lande i alt” and some more information, like the name of the table. To fix this, i am going to create two new, empty files and copy and paste the data for Germany and the other countries separately. Then i add the column names ( years ) with tabs in between, make sure there are tabs between the numbers and delete unnecessary quotes.

The file for the other countries looks like this now:Image

Which i now can upload to fusion tables, which now understands the file properly

Image

 

On the next page we attribute the data to “Danmark Statistik” and provide a link, to where the data can be found.

Image

We end up with a nice table

Image

I apply the exact same process to the table with the Germany data

You can now press the + button and select “new chart”, here you can play around with different visualizations of the data. When i did this, i found that none of the visualizations really showed what i wanted. Then it hit me: the organization of the data in the table made not a lot of sense.

It is much more useful to have years organized in rows instead of having one column for each year.

I went back to the Danmark Statistik webpage and looked for options to change the format of the data. Looks like we are in luck. The page offers us an option to pivot the dataset. After some fiddling around i came up with the following organization:Image

Now i downloaded the xls file, which can be openend in Microsoft Excel or Libre / Open Office.

After a bit of copy pasting to textfiles, i had this:Image

 

Overall, the use of the xls file format made it a lot easier to create this file.

Upload the new file to fusion tables and check out the charts again. Now the graphs make a lot more sense.

Image

 

My point with this post is, that we are going to run into lots of trouble when we are working with tables and data. Formatting and organization of data is important and we have to learn how to get it right, and what tools can support us with these tasks.

Aside | Posted on by | Leave a comment

As a prerequisite, you should have a google account. If not, go to https://accounts.google.com/ServiceLogin?hl=da&continue=https://www.google.dk/ and create your account.

If you have registered your account and logged in, you can get access to your Google Drive, by using the menu at the top of the page, which should look lige this:Image

There you can download Google Drive for Pc, which will make matters easier for you in the long run.

Now we want to use Google Fusion Tables to create a very simple dataset and a map.

You can find Fusion Tables here: http://www.google.com/drive/start/apps.html#fusiontables

If you’re using the Chrome browser, you can also install the corresponding app.

Click on “Create a new table” and you will see a screen similar to this.Image

I already entered some keywords i want to look for. Currently i am interested in how games sell in different countries, so i look for “game sales per country” and hit the search button.

Fusion tables tries to find tables it understands, and the first result is to my liking.Image

You can tell Fusion Tables to import the tables it finds. This will save the table in your Google Drive folder and sync the file to your computer ( if you have Google Drive for PC installed ).

In the Online view of Drive, the file shows up:Image

And you now have the possibility to open the file in Google Tables. Depending on when you read this blog post, there may be a notification to use the new Look or View to the right. Click that too. This will take you to the Fusion Table view of the data we just imported.Image

Google fusion tables automatically recognizes the “Countries” column, so just click on “Map of Countries” and it will generate a simple map, showing the datapoints.Image

We are now on track, to learn more about data-gathering and visualization.

Aside | Posted on by | 1 Comment

Multiple Users for Blogs

Apparently, you can add multiple users to your blog. This means that each group can have one shared blog, instead of everyone using an individual one.

Posted in Uncategorized | Leave a comment

Initial commit

This is the first entry. It only takes five minutes to set up a wordpress blog.

Graphs and tables will follow shortly.

Posted in Uncategorized | 1 Comment