You are on page 1of 5

3.

3 Data Wrangling Project

July 2, 2017

0.1 3.3 Data Wrangling Project


In [2]: import pandas as pd
import json
from pandas.io.json import json_normalize

In [198]: m_data = pd.read_json('data_wrangling_json/data/world_bank_projects.json'


m_data.head()

Out[198]: _id approvalfy board_approval_month


0 {'$oid': '52b213b38594d8a2be17c780'} 1999 November
1 {'$oid': '52b213b38594d8a2be17c781'} 2015 November
2 {'$oid': '52b213b38594d8a2be17c782'} 2014 November
3 {'$oid': '52b213b38594d8a2be17c783'} 2014 October
4 {'$oid': '52b213b38594d8a2be17c784'} 2014 October

boardapprovaldate borrower \
0 2013-11-12T00:00:00Z FEDERAL DEMOCRATIC REPUBLIC OF ETHIOPIA
1 2013-11-04T00:00:00Z GOVERNMENT OF TUNISIA
2 2013-11-01T00:00:00Z MINISTRY OF FINANCE AND ECONOMIC DEVEL
3 2013-10-31T00:00:00Z MIN. OF PLANNING AND INT'L COOPERATION
4 2013-10-31T00:00:00Z MINISTRY OF FINANCE

closingdate country_namecode \
0 2018-07-07T00:00:00Z Federal Democratic Republic of Ethiopia!$!ET
1 NaN Republic of Tunisia!$!TN
2 NaN Tuvalu!$!TV
3 NaN Republic of Yemen!$!RY
4 2019-04-30T00:00:00Z Kingdom of Lesotho!$!LS

countrycode countryname countryshortnam


0 ET Federal Democratic Republic of Ethiopia Ethiopi
1 TN Republic of Tunisia Tunisi
2 TV Tuvalu Tuval
3 RY Republic of Yemen Yemen, Republic o
4 LS Kingdom of Lesotho Lesoth

... sectorcode source

1
0 ... ET,BS,ES,EP IBRD
1 ... BZ,BS IBRD
2 ... TI IBRD
3 ... JB IBRD
4 ... FH,YW,YZ IBRD

status supplementprojectflg \
0 Active N
1 Active N
2 Active Y
3 Active N
4 Active N

theme1 \
0 {'Name': 'Education for all', 'Percent': 100}
1 {'Name': 'Other economic management', 'Percent...
2 {'Name': 'Regional integration', 'Percent': 46}
3 {'Name': 'Participation and civic engagement',...
4 {'Name': 'Export development and competitivene...

theme_namecode themecode total


0 [{'name': 'Education for all', 'code': '65'}] 65 130000
1 [{'name': 'Other economic management', 'code':... 54,24
2 [{'name': 'Regional integration', 'code': '47'... 52,81,25,47 6060
3 [{'name': 'Participation and civic engagement'... 59,57
4 [{'name': 'Export development and competitiven... 41,45 13100

totalcommamt url
0 130000000 http://www.worldbank.org/projects/P129828/ethi...
1 4700000 http://www.worldbank.org/projects/P144674?lang=en
2 6060000 http://www.worldbank.org/projects/P145310?lang=en
3 1500000 http://www.worldbank.org/projects/P144665?lang=en
4 13100000 http://www.worldbank.org/projects/P144933/seco...

[5 rows x 50 columns]

In [183]: m_data.columns

Out[183]: Index(['_id', 'approvalfy', 'board_approval_month', 'boardapprovaldate',


'borrower', 'closingdate', 'country_namecode', 'countrycode',
'countryname', 'countryshortname', 'docty', 'envassesmentcategoryc
'grantamt', 'ibrdcommamt', 'id', 'idacommamt', 'impagency',
'lendinginstr', 'lendinginstrtype', 'lendprojectcost',
'majorsector_percent', 'mjsector_namecode', 'mjtheme',
'mjtheme_namecode', 'mjthemecode', 'prodline', 'prodlinetext',
'productlinetype', 'project_abstract', 'project_name', 'projectdoc
'projectfinancialtype', 'projectstatusdisplay', 'regionname', 'sec
'sector1', 'sector2', 'sector3', 'sector4', 'sector_namecode',

2
'sectorcode', 'source', 'status', 'supplementprojectflg', 'theme1'
'theme_namecode', 'themecode', 'totalamt', 'totalcommamt', 'url'],
dtype='object')

0.1.1 1. Find the 10 countries with most projects


In [177]: m_data[m_data.countryname !=""][m_data.project_name != ""]["countryname"]

Out[177]: People's Republic of China 19


Republic of Indonesia 19
Socialist Republic of Vietnam 17
Republic of India 16
Republic of Yemen 13
People's Republic of Bangladesh 12
Kingdom of Morocco 12
Nepal 12
Republic of Mozambique 11
Africa 11
Name: countryname, dtype: int64

0.1.2 2. Find the top 10 major project themes (using column mjtheme_namecode)
In [199]: theme = m_data['mjtheme_namecode']
theme[0]

Out[199]: [{'code': '8', 'name': 'Human development'}, {'code': '11', 'name': ''}]

In [203]: new_theme = pd.DataFrame(columns=['code','name'])


for item in theme:
new_theme = new_theme.append(json_normalize(item))
new_theme.head(10)

Out[203]: code name


0 8 Human development
1 11
0 1 Economic management
1 6 Social protection and risk management
0 5 Trade and integration
1 2 Public sector governance
2 11 Environment and natural resources management
3 6 Social protection and risk management
0 7 Social dev/gender/inclusion
1 7 Social dev/gender/inclusion

0.1.3 3. In 2. above you will notice that some entries have only the code and the name is
missing. Create a dataframe with the missing names filled in.
In [185]: # create a ref
complete_theme = new_theme[new_theme.name !=""]

3
uniq_theme = complete_theme.drop_duplicates()
name_dict = uniq_theme.set_index("code").to_dict()["name"] # code as inde
name_dict

Out[185]: {'1': 'Economic management',


'10': 'Rural development',
'11': 'Environment and natural resources management',
'2': 'Public sector governance',
'3': 'Rule of law',
'4': 'Financial and private sector development',
'5': 'Trade and integration',
'6': 'Social protection and risk management',
'7': 'Social dev/gender/inclusion',
'8': 'Human development',
'9': 'Urban development'}

In [192]: name_dict["1"]

Out[192]: 'Economic management'

In [223]: new_theme.index = range(len(new_theme)) # re-label index


new_theme.loc[0]["name"] # use loc to find row

Out[223]: 'Human development'

In [233]: # fill in the blank values


for item in new_theme.itertuples(): # get a named tuples for each row
if item[2] == "":
new_theme.set_value(item[0], 'name', name_dict[item[1]]) # since
[new_theme.head(10), new_theme.tail(10)]

Out[233]: [ code name


0 8 Human development
1 11 Environment and natural resources management
2 1 Economic management
3 6 Social protection and risk management
4 5 Trade and integration
5 2 Public sector governance
6 11 Environment and natural resources management
7 6 Social protection and risk management
8 7 Social dev/gender/inclusion
9 7 Social dev/gender/inclusion,
code name
1489 8 Human development
1490 10 Rural development
1491 6 Social protection and risk management
1492 10 Rural development
1493 10 Rural development
1494 10 Rural development

4
1495 9 Urban development
1496 8 Human development
1497 5 Trade and integration
1498 4 Financial and private sector development]

You might also like