• Home
  • About
    • shawvey photo

      shawvey

      在写bug比赛中荣获第一名🐛

    • Learn More
    • Email
    • Twitter
    • Instagram
    • Github
  • Posts
    • All Posts
    • All Tags
  • Notes

调用API进行数据爬虫

12 Feb 2020

Reading time ~161 minutes

本文目录🧾

  • The Guardian API
  • Inspect all sections and search for technology-based sections
  • Manual query on whole API
  • Parsing the JSON
  • Verifying the status code
  • Listing the results

正文部分📝

ADS课上老师讲的调用API爬虫,真的比beautifulsoup方便太多了!(虽然不是每个网站都会给API哈哈哈哈)。另外,意外发现jupyter notebook可以将ipy文件直接转换为markdown!🥳🥳🥳

源文件链接🔗: API数据爬虫

The Guardian API

In the beautiful_soup.ipynb notebook, I showed how BeautifulSoup can be used to parse messy HTML, tp extract information, and to act as a rudimentary web crawler. I used The Guardian as an illustrative example about how this can be achieved. The reason for choosing The Guardian was because they provide a REST API to their servers. With theise it is possible to perform specific queries on their servers, and to receive current information from their servers according to their API guide (ie in JSON)

http://open-platform.theguardian.com/

In order to use their API, you will need to register for an API key. At the time of writing (Feb 1, 2017) this was an automated process that can be completed at

http://open-platform.theguardian.com/access/

The API is documented here:

http://open-platform.theguardian.com/documentation/

and Python bindings to their API are provided by The Guardian here

https://github.com/prabhath6/theguardian-api-python

and these can easily be integrated into a web-crawler based on API calls, rather than being based on HTML parsing, etc.

We use four parameters in our queries here:

  1. section: the section of the newspaper that we are interested in querying. In this case I’m lookin in the technology section

  2. order-by: I have specifie that the newest items should be closer to the front of the query list

  3. api-key: I have left this as test (which works here), but for real deployment of such a spider a real API key should be specified

  4. page-size: The number of results to return.

from __future__ import print_function

import requests 
import json 

Inspect all sections and search for technology-based sections

url = 'https://content.guardianapis.com/sections?api-key=02b16042-1f5f-41e9-8d28-52f17e3982da'
req = requests.get(url)
src = req.text 

将数据转成Python数据结构,查看状态:

json.loads(src)['response']['status']
'ok'

json.loads:将json转成python数据结构

sections = json.loads(src)['response']

print(sections.keys())
dict_keys(['status', 'userTier', 'total', 'results'])

json.dumps:将python数据结构转成json

print(json.dumps(sections['results'][0], indent=2, sort_keys=True))
{
  "apiUrl": "https://content.guardianapis.com/about",
  "editions": [
    {
      "apiUrl": "https://content.guardianapis.com/about",
      "code": "default",
      "id": "about",
      "webTitle": "About",
      "webUrl": "https://www.theguardian.com/about"
    }
  ],
  "id": "about",
  "webTitle": "About",
  "webUrl": "https://www.theguardian.com/about"
}

lower():将所有大写字母转换成小写字母

for result in sections['results']: 
    if 'tech' in result['id'].lower(): 
        print(result['webTitle'], result['apiUrl'])
Technology https://content.guardianapis.com/technology

Manual query on whole API

# Specify the arguments
args = {
    'section': 'technology', 
    'order-by': 'newest', 
    'api-key': '02b16042-1f5f-41e9-8d28-52f17e3982da', 
    'page-size': '100',
    'q' : 'privacy%20AND%20data'
}

# Construct the URL
base_url = 'http://content.guardianapis.com/search'
url = '{}?{}'.format(
    base_url, 
    '&'.join(["{}={}".format(kk, vv) for kk, vv in args.items()])
)
print(url)
# Make the request and extract the source
req = requests.get(url) 
src = req.text
http://content.guardianapis.com/search?section=technology&order-by=newest&api-key=02b16042-1f5f-41e9-8d28-52f17e3982da&page-size=100&q=privacy%20AND%20data

打印字节数:

print('Number of byes received:', len(src))
Number of byes received: 59506

The API returns JSON, so we parse this using the in-built JSON library. The API specifies that all data are returned within the response key, even under failure. Thereofre, I have immediately descended to the response field

Parsing the JSON

response = json.loads(src)['response']
print('The following are available:\n ', sorted(response.keys()))
The following are available:
  ['currentPage', 'orderBy', 'pageSize', 'pages', 'results', 'startIndex', 'status', 'total', 'userTier']

Verifying the status code

It is important to verify that the status message is ok before continuing - if it is not ok no ‘real’ data will have been received.

assert response['status'] == 'ok'
#在开发一个程序时候,与其让它运行时崩溃,不如在它出现错误条件时就崩溃(返回错误)。
#这时候断言assert 就显得非常有用。
#运行没有报错说明没有什么问题

Listing the results

The API standard states that the results will be found in the results field under the response field. Furthermore, the URLs will be found in the webUrl field, and the title will be found in the webTitle field.

First let’s look to see what a single result looks like in full, and then I will print a restricted set of parameters on the full set of results .

print(json.dumps(response['results'][0], indent=2, sort_keys=True))
# indent=2:缩进两格
{
  "apiUrl": "https://content.guardianapis.com/technology/2020/feb/08/fears-over-sale-anonymous-nhs-patient-data",
  "id": "technology/2020/feb/08/fears-over-sale-anonymous-nhs-patient-data",
  "isHosted": false,
  "pillarId": "pillar/news",
  "pillarName": "News",
  "sectionId": "technology",
  "sectionName": "Technology",
  "type": "article",
  "webPublicationDate": "2020-02-08T21:03:47Z",
  "webTitle": "Revealed: how drugs giants can access your health records",
  "webUrl": "https://www.theguardian.com/technology/2020/feb/08/fears-over-sale-anonymous-nhs-patient-data"
}
for result in response['results']: 
    print(result['webUrl'][:70], result['webTitle'][:20])
#设置[:70]只显示前70个字节,删除后可以得到完整的URL,标题同理
https://www.theguardian.com/technology/2020/feb/08/fears-over-sale-ano Revealed: how drugs 
https://www.theguardian.com/technology/2020/feb/08/tories-concern-huaw Tories express conce
https://www.theguardian.com/technology/2020/feb/05/welfare-surveillanc Welfare surveillance
https://www.theguardian.com/technology/2020/feb/04/google-software-gli Google software glit
https://www.theguardian.com/technology/2020/jan/30/mike-pompeo-restate Mike Pompeo restates
https://www.theguardian.com/technology/2020/jan/30/facebook-pays-550m- Facebook pays $550m 
https://www.theguardian.com/technology/2020/jan/28/boris-johnson-gets- Boris Johnson gets f
https://www.theguardian.com/technology/commentisfree/2020/jan/25/facia Quick, cheap to make
https://www.theguardian.com/technology/2020/jan/25/peter-diamandis-fut Peter Diamandis: ‘In
https://www.theguardian.com/technology/2020/jan/24/met-police-begin-us Met police to begin 
https://www.theguardian.com/technology/2020/jan/22/tell-us-about-the-w Tell us about the we
https://www.theguardian.com/technology/2020/jan/22/un-investigators-to Bezos hack: UN to ad
https://www.theguardian.com/technology/2020/jan/22/tech-firms-fail-pro Watchdog cracks down
https://www.theguardian.com/technology/2020/jan/18/1-trillion-dollars- $1tn is just the sta
https://www.theguardian.com/technology/2020/jan/17/google-owner-alphab Google owner Alphabe
https://www.theguardian.com/technology/2020/jan/16/instagram-my-data-c Was anyone ever so y
https://www.theguardian.com/technology/2020/jan/16/google-nest-mini-re Google Nest Mini rev
https://www.theguardian.com/technology/2020/jan/15/twitter-drops-grind Twitter drops Grindr
https://www.theguardian.com/technology/2020/jan/12/anger-over-use-faci Anger over use of fa
https://www.theguardian.com/technology/2020/jan/10/skype-audio-graded- Skype audio graded b
https://www.theguardian.com/technology/2020/jan/08/facial-recognition- Facial recognition a
https://www.theguardian.com/technology/2020/jan/08/travelex-hack-staff Travelex hack: staff
https://www.theguardian.com/technology/2020/jan/03/metoobots-scientist Rise of #MeTooBots: 
https://www.theguardian.com/technology/2020/jan/03/technology-2050-sav Technology in 2050: 
https://www.theguardian.com/technology/2019/dec/31/get-cybersecure-for Get yourself cyberse
https://www.theguardian.com/technology/2019/dec/29/lack-of-guidance-le Lack of guidance lea
https://www.theguardian.com/technology/askjack/2019/dec/19/how-can-i-g How can I get better
https://www.theguardian.com/technology/shortcuts/2019/dec/16/alexa-can 'Mind your own busin
https://www.theguardian.com/technology/2019/dec/14/twenty-tech-trends- Twenty tech trends f
https://www.theguardian.com/technology/2019/dec/13/ring-hackers-report Ring hackers are rep
https://www.theguardian.com/technology/askjack/2019/dec/12/duckduckgo- Can DuckDuckGo repla
https://www.theguardian.com/technology/2019/dec/12/ring-alarm-review-a Ring Alarm review: A
https://www.theguardian.com/technology/askjack/2019/nov/28/security-so What sort of securit
https://www.theguardian.com/technology/2019/nov/24/tim-berners-lee-unv Tim Berners-Lee unve
https://www.theguardian.com/technology/2019/nov/23/facebook-google-hum Tech giants watch ou
https://www.theguardian.com/technology/2019/nov/21/google-project-nigh Warren and group of 
https://www.theguardian.com/technology/2019/nov/19/technology-laws-are Technology laws are 
https://www.theguardian.com/technology/2019/nov/17/firefox-mozilla-fig Firefox’s fight for 
https://www.theguardian.com/technology/2019/nov/17/porn-public-transpo Porn, public transpo
https://www.theguardian.com/technology/2019/nov/14/google-healthcare-d Will Google get away
https://www.theguardian.com/technology/2019/nov/12/google-medical-data Google's secret cach
https://www.theguardian.com/technology/2019/nov/08/the-rise-of-microch The rise of microchi
https://www.theguardian.com/technology/2019/nov/06/google-nest-hub-max Google Nest Hub Max 
https://www.theguardian.com/technology/2019/nov/04/uber-los-angeles-pe LA suspends Uber’s s
https://www.theguardian.com/technology/2019/nov/01/whatsapp-hack-is-se WhatsApp 'hack' is s
https://www.theguardian.com/technology/2019/oct/30/apple-lets-users-op Apple lets users opt
https://www.theguardian.com/technology/2019/oct/30/facebook-agrees-to- Facebook agrees to p
https://www.theguardian.com/technology/2019/oct/29/labour-calls-for-ha Labour calls for hal
https://www.theguardian.com/technology/2019/oct/29/google-pixel-4-xl-r Google Pixel 4 XL re
https://www.theguardian.com/technology/2019/oct/26/china-technology-so Why you should worry
https://www.theguardian.com/technology/2019/oct/24/mind-reading-tech-p Mind-reading tech? H
https://www.theguardian.com/technology/2019/oct/22/oneplus-7t-pro-revi OnePlus 7T Pro revie
https://www.theguardian.com/technology/2019/oct/21/google-eye-detectio Google to add eye de
https://www.theguardian.com/technology/2019/oct/18/how-the-wheels-came How the wheels came 
https://www.theguardian.com/culture/2019/oct/16/uk-drops-plans-for-onl UK drops plans for o
https://www.theguardian.com/technology/2019/oct/16/digital-welfare-sta ‘Digital welfare sta
https://www.theguardian.com/society/2019/oct/15/alexa-do-you-recall-th Alexa, do you recall
https://www.theguardian.com/technology/2019/oct/15/google-launches-che Google launches chea
https://www.theguardian.com/technology/2019/oct/11/elizabeth-warren-fa Elizabeth Warren tro
https://www.theguardian.com/technology/2019/oct/10/tim-cook-apple-hong Tim Cook defends App
https://www.theguardian.com/technology/2019/oct/09/alexa-are-you-invad 'Alexa, are you inva
https://www.theguardian.com/technology/2019/oct/08/what-does-peter-dut What does Peter Dutt
https://www.theguardian.com/technology/2019/oct/08/us-whistleblower-th US whistleblower bla
https://www.theguardian.com/technology/2019/oct/05/facial-recognition- 'We are hurtling tow
https://www.theguardian.com/technology/2019/oct/03/facebook-surveillan US, UK and Australia
https://www.theguardian.com/technology/2019/oct/03/google-data-harvest Google reportedly ta
https://www.theguardian.com/technology/2019/oct/01/mark-zuckerberg-fac Zuckerberg: I'll 'go
https://www.theguardian.com/technology/2019/oct/01/iphone-11-review-ip iPhone 11 review: an
https://www.theguardian.com/technology/2019/sep/29/plan-for-massive-fa Plan for massive fac
https://www.theguardian.com/technology/2019/sep/26/amazon-launches-ale Amazon launches Alex
https://www.theguardian.com/technology/2019/sep/26/pulp-diction-samuel Pulp diction: Samuel
https://www.theguardian.com/technology/2019/sep/24/firefox-no-uk-plans Firefox: 'no UK plan
https://www.theguardian.com/technology/2019/sep/18/facebook-portal-sma Facebook to launch n
https://www.theguardian.com/technology/2019/sep/17/tech-climate-change To decarbonize we mu
https://www.theguardian.com/technology/2019/sep/17/imagenet-roulette-a The viral selfie app
https://www.theguardian.com/technology/2019/sep/17/youtube-fine-and-ch YouTube’s fine and c
https://www.theguardian.com/technology/2019/sep/13/google-facebook-ama Google, Facebook, Am
https://www.theguardian.com/technology/askjack/2019/sep/12/can-i-still Can I still use my C
https://www.theguardian.com/technology/2019/sep/06/facebook-google-ant US states to launch 
https://www.theguardian.com/technology/2019/sep/06/apple-rewrote-siri- Apple made Siri defl
https://www.theguardian.com/technology/2019/sep/04/facebook-users-phon Facebook confirms 41
https://www.theguardian.com/technology/2019/sep/04/police-use-of-facia Police use of facial
https://www.theguardian.com/technology/2019/sep/04/a-deep-fake-app-wil A ‘deep fake’ app wi
https://www.theguardian.com/technology/2019/sep/04/android-10-released Android 10 released:
https://www.theguardian.com/technology/2019/aug/29/apple-apologises-li Apple apologises for
https://www.theguardian.com/technology/2019/aug/24/alexa-nhs-future-am Does Amazon have ans
https://www.theguardian.com/technology/2019/aug/22/apple-card-wallet-p Apple warns new cred
https://www.theguardian.com/technology/2019/aug/18/manchester-city-fac Manchester City warn
https://www.theguardian.com/technology/2019/aug/16/privacy-campaigners Privacy campaigners 
https://www.theguardian.com/technology/2019/aug/15/ico-opens-investiga ICO opens investigat
https://www.theguardian.com/technology/2019/aug/13/facebook-messenger- Facebook admits cont
https://www.theguardian.com/technology/2019/aug/13/people-at-kings-cro People at King’s Cro
https://www.theguardian.com/technology/2019/aug/07/south-wales-police- South Wales police t
https://www.theguardian.com/technology/2019/aug/05/alexa-allows-users- Alexa users can now 
https://www.theguardian.com/technology/2019/aug/04/facial-recognition- Facial recognition… 
https://www.theguardian.com/technology/2019/aug/04/innocence-lost-what Innocence lost: What
https://www.theguardian.com/technology/2019/aug/02/apple-halts-practic Apple halts practice
https://www.theguardian.com/technology/2019/jul/29/what-is-facial-reco What is facial recog
https://www.theguardian.com/technology/2019/jul/26/apple-contractors-r Apple contractors 'r
https://www.theguardian.com/technology/2019/jul/24/facebook-revenue-fi Facebook revenues so

Let’s now request a specific piece of content from the API.

We select the ith result from the above response and get its apiUrl and id:

i = 0
api_url = response['results'][i]['apiUrl']
api_id = response['results'][i]['id']

print(api_url)
print(api_id)
https://content.guardianapis.com/technology/2020/feb/08/fears-over-sale-anonymous-nhs-patient-data
technology/2020/feb/08/fears-over-sale-anonymous-nhs-patient-data

We then use the id to contstruct a search url string to request this piece of content from the API.

(Note that you need to include the api-key in the search, this is what I forgot in the lecture. You also need to specify if you want to include data fields other than the article metadata e.g. body and headline are included in the example below.)

base_url = "https://content.guardianapis.com/search?"
search_string = "ids=%s&api-key=02b16042-1f5f-41e9-8d28-52f17e3982da&show-fields=headline,body" %api_id

url = base_url + search_string
print(url)
https://content.guardianapis.com/search?ids=technology/2020/feb/08/fears-over-sale-anonymous-nhs-patient-data&api-key=02b16042-1f5f-41e9-8d28-52f17e3982da&show-fields=headline,body

查看状态:

req = requests.get(url) 
src = req.text
response = json.loads(src)['response']
assert response['status'] == 'ok'

打印标题:

print(response['results'][0]['fields']['headline'])
Revealed: how drugs giants can access your health records

打印body部分:

body = response['results'][0]['fields']['body']
print(body)
<p>The Department of Health and Social Care has been selling the medical data of millions of NHS patients to American and other international drugs companies having misled the public into believing the information would be “anonymous”, according to leading experts in the field.</p> <p>Senior NHS figures have told the <em>Observer</em> that patient data compiled from GP surgeries and hospitals – and then sold for huge sums for research – can routinely be linked back to individual patients’ medical records via their GP surgeries. They say there is clear evidence this is already being done by companies and organisations that have bought data from the DHSC, having identified individuals whose medical histories are of particular interest.</p> <p>Concerns that the data is not truly “anonymous” have been raised by senior NHS officials, who believe the public are not being told the full truth. But the DHSC insists it only sells on information after thorough measures have been taken to ensure the complete anonymity and confidentiality of patients’ personal information.</p> <p>In December, <a href="https://www.theguardian.com/politics/2019/dec/07/nhs-medical-data-sales-american-pharma-lack-transparency" title="">the </a><em><a href="https://www.theguardian.com/politics/2019/dec/07/nhs-medical-data-sales-american-pharma-lack-transparency" title="">Observer</a></em><a href="https://www.theguardian.com/politics/2019/dec/07/nhs-medical-data-sales-american-pharma-lack-transparency" title=""> revealed </a>that the government had raised £10m in 2018 by granting licences to commercial and academic organisations across the world that wanted access to so-called anonymised data. If patients do not want their data to be used for research they have to actively “opt out” of the system at their GP surgery.</p> <p>Access to NHS data is increasingly sought by researchers and global drugs companies because it is one of the largest and most centralised public organisations of its kind in the world, with unique data resources.</p> <p>Washington has already made clear it wants unrestricted access to Britain’s 55 million health records – estimated to have a total value of £10bn a year – as part of any post-Brexit trade agreement. Leaked details of meetings between US and UK trade officials late last year showed that the acquisition of as much UK medical data as possible is a top priority for the US drugs industry.</p> <p>Now the DHSC and the agencies responsible for handling and selling data are increasingly under pressure to tighten up controls, to protect patient privacy and prevent information being misused.</p> <p>Asked if it was right to say that the patient data was anonymous, as claimed, Professor Eerke Boiten, director of the Cyber Technology Institute at De Montfort University in Leicester, said: “The answer is no, it is not anonymous.</p> <p>“If it is rich medical data about individuals then the richer that data is, the easier it is for people who are experts to reconstruct it and re-identify individuals.”</p> <p>Boiten believes more thought should be given to controlling and limiting the sale of data to prevent it potentially being sold on by the initial purchaser to companies with huge information stores and global reach. “If Google, for instance, were to use this data and end up finding a cure for cancer, and then sold the cure back to the NHS for huge sums of money, then I think we could say we had missed a trick,” he said.</p> <p>The NHS has previously faced claims that medical data from millions of patients has been <a href="https://www.theguardian.com/society/2014/jan/19/nhs-patient-data-available-companies-buy" title="">sold to insurance companies</a>.</p> <p>Phil Booth, coordinator of medConfidential, which campaigns for the privacy of health data, said the public was being betrayed by claims that the information could not be linked back to individuals. “Removing or obscuring a few obvious identifiers, like someone’s name or NHS number from the data, doesn’t make their medical history anonymous,” he said. “Indeed, the unique combination of medical events that makes individuals’ health data so ripe for exploitation is precisely what makes it so identifiable. Your medical record is like a fingerprint of your whole life.</p> <p>“Patients must know how their data is used, and by who. Alleging their data is anonymous when it isn’t, then selling it to drugs and tech companies – or, through intermediaries, to heaven knows who – is a gross betrayal of trust. People who are rightly concerned about such guile and lack of respect have every right to opt out, if they want their and their family’s medical information kept confidential and for their own care.”</p> <p>Licences to buy data are issued by the Clinical Practice Research Datalink (CPRD), part of the Medicines and Healthcare Products Regulatory Agency (MHRA). A spokesman said any information sold had been “anonymised in accordance with the Information Commissioner’s Office (ICO) anonymisation code of practice”.</p> <p>Until early December, the CPRD said on its website the data it made available for research was “anonymous” but, following the <em>Observer’s</em> story, it changed the wording to say that the data from GPs and hospitals had been “anonymised” – meaning only that some measures had been taken to de-identify it.</p> <p>Booth added: “Following the ICO’s code of practice does not mean that data is necessarily anonymous. The law now recognises that one of the most common methods of ‘anonymisation’ – the use of pseudonyms to obscure some bits of information – means that data is still identifiable. Indeed, the information commissioner herself says it must be considered personal data.”</p> <p>A spokesman for the MHRA said the wording on the website had changed – but only to be consistent: “We have replaced the word ‘anonymous’ with ‘anonymised’ to be in line with the ICO terminology ‘anonymised,’ which is the term we use throughout our website. We have done this to be consistent and to avoid any confusion.”</p> <p>Information disclosed by some of CPRD’s customers clearly suggests they can link the information back to individual patient records via their GP surgeries. The Boston Collaborative Drug Surveillance Program in the US, which uses DHSC data, says <a href="http://www.bu.edu/bcdsp/gprd/" title="">on its website</a>: “Anonymized information from the CPRD on demographics, outpatient visits, hospitalizations and prescriptions dispensed is available to [our] researchers. Validation of diagnoses, reports of diagnostic tests and anonymized notes from hospitalizations and referrals can be obtained from the general practitioner upon request.”</p> <p>If the data were truly anonymous it would be impossible to retrieve an individual patient’s medical notes.Neil Bhatia, a GP who is Information Governance Lead and data protection 0fficer in Hampshire, said: “Truly anonymous data – utterly incapable of being traced back to an individual – is very hard to achieve, given that there is so much information about us in the public domain and held by companies such as Facebookand Google, because so much of our personal data is out there thanks to the massive data breaches over the last few years. In fact, it’s almost impossible for record-level data (where each line of the dataset corresponds to an individual) to be made truly anonymous.”</p> <p>• This article was amended on 11 February 2020 to include a response from the MHRA that was received after the initial publication deadline.</p>

We can now do some simple text processing on the article text. e.g. count the word frequnecies:

split:将每个单词分割

list(set(words)):去重操作

words = body.replace('<p>','').replace('</p>','').split()
#split将每个单词分割
print(words)
print(len(words))
unique_words = list(set(words))
#list(set(words)):去重操作!
print(len(unique_words))
#count_dictionary = {word: count for word, count in zip(words, [words.count(w) for w in words])}
count_dictionary = {'word': unique_words, 'count': [words.count(w) for w in unique_words]}
#给words的每个词归类计数
print(count_dictionary)
['The', 'Department', 'of', 'Health', 'and', 'Social', 'Care', 'has', 'been', 'selling', 'the', 'medical', 'data', 'of', 'millions', 'of', 'NHS', 'patients', 'to', 'American', 'and', 'other', 'international', 'drugs', 'companies', 'having', 'misled', 'the', 'public', 'into', 'believing', 'the', 'information', 'would', 'be', '“anonymous”,', 'according', 'to', 'leading', 'experts', 'in', 'the', 'field.', 'Senior', 'NHS', 'figures', 'have', 'told', 'the', '<em>Observer</em>', 'that', 'patient', 'data', 'compiled', 'from', 'GP', 'surgeries', 'and', 'hospitals', '–', 'and', 'then', 'sold', 'for', 'huge', 'sums', 'for', 'research', '–', 'can', 'routinely', 'be', 'linked', 'back', 'to', 'individual', 'patients’', 'medical', 'records', 'via', 'their', 'GP', 'surgeries.', 'They', 'say', 'there', 'is', 'clear', 'evidence', 'this', 'is', 'already', 'being', 'done', 'by', 'companies', 'and', 'organisations', 'that', 'have', 'bought', 'data', 'from', 'the', 'DHSC,', 'having', 'identified', 'individuals', 'whose', 'medical', 'histories', 'are', 'of', 'particular', 'interest.', 'Concerns', 'that', 'the', 'data', 'is', 'not', 'truly', '“anonymous”', 'have', 'been', 'raised', 'by', 'senior', 'NHS', 'officials,', 'who', 'believe', 'the', 'public', 'are', 'not', 'being', 'told', 'the', 'full', 'truth.', 'But', 'the', 'DHSC', 'insists', 'it', 'only', 'sells', 'on', 'information', 'after', 'thorough', 'measures', 'have', 'been', 'taken', 'to', 'ensure', 'the', 'complete', 'anonymity', 'and', 'confidentiality', 'of', 'patients’', 'personal', 'information.', 'In', 'December,', '<a', 'href="https://www.theguardian.com/politics/2019/dec/07/nhs-medical-data-sales-american-pharma-lack-transparency"', 'title="">the', '</a><em><a', 'href="https://www.theguardian.com/politics/2019/dec/07/nhs-medical-data-sales-american-pharma-lack-transparency"', 'title="">Observer</a></em><a', 'href="https://www.theguardian.com/politics/2019/dec/07/nhs-medical-data-sales-american-pharma-lack-transparency"', 'title="">', 'revealed', '</a>that', 'the', 'government', 'had', 'raised', '£10m', 'in', '2018', 'by', 'granting', 'licences', 'to', 'commercial', 'and', 'academic', 'organisations', 'across', 'the', 'world', 'that', 'wanted', 'access', 'to', 'so-called', 'anonymised', 'data.', 'If', 'patients', 'do', 'not', 'want', 'their', 'data', 'to', 'be', 'used', 'for', 'research', 'they', 'have', 'to', 'actively', '“opt', 'out”', 'of', 'the', 'system', 'at', 'their', 'GP', 'surgery.', 'Access', 'to', 'NHS', 'data', 'is', 'increasingly', 'sought', 'by', 'researchers', 'and', 'global', 'drugs', 'companies', 'because', 'it', 'is', 'one', 'of', 'the', 'largest', 'and', 'most', 'centralised', 'public', 'organisations', 'of', 'its', 'kind', 'in', 'the', 'world,', 'with', 'unique', 'data', 'resources.', 'Washington', 'has', 'already', 'made', 'clear', 'it', 'wants', 'unrestricted', 'access', 'to', 'Britain’s', '55', 'million', 'health', 'records', '–', 'estimated', 'to', 'have', 'a', 'total', 'value', 'of', '£10bn', 'a', 'year', '–', 'as', 'part', 'of', 'any', 'post-Brexit', 'trade', 'agreement.', 'Leaked', 'details', 'of', 'meetings', 'between', 'US', 'and', 'UK', 'trade', 'officials', 'late', 'last', 'year', 'showed', 'that', 'the', 'acquisition', 'of', 'as', 'much', 'UK', 'medical', 'data', 'as', 'possible', 'is', 'a', 'top', 'priority', 'for', 'the', 'US', 'drugs', 'industry.', 'Now', 'the', 'DHSC', 'and', 'the', 'agencies', 'responsible', 'for', 'handling', 'and', 'selling', 'data', 'are', 'increasingly', 'under', 'pressure', 'to', 'tighten', 'up', 'controls,', 'to', 'protect', 'patient', 'privacy', 'and', 'prevent', 'information', 'being', 'misused.', 'Asked', 'if', 'it', 'was', 'right', 'to', 'say', 'that', 'the', 'patient', 'data', 'was', 'anonymous,', 'as', 'claimed,', 'Professor', 'Eerke', 'Boiten,', 'director', 'of', 'the', 'Cyber', 'Technology', 'Institute', 'at', 'De', 'Montfort', 'University', 'in', 'Leicester,', 'said:', '“The', 'answer', 'is', 'no,', 'it', 'is', 'not', 'anonymous.', '“If', 'it', 'is', 'rich', 'medical', 'data', 'about', 'individuals', 'then', 'the', 'richer', 'that', 'data', 'is,', 'the', 'easier', 'it', 'is', 'for', 'people', 'who', 'are', 'experts', 'to', 'reconstruct', 'it', 'and', 're-identify', 'individuals.”', 'Boiten', 'believes', 'more', 'thought', 'should', 'be', 'given', 'to', 'controlling', 'and', 'limiting', 'the', 'sale', 'of', 'data', 'to', 'prevent', 'it', 'potentially', 'being', 'sold', 'on', 'by', 'the', 'initial', 'purchaser', 'to', 'companies', 'with', 'huge', 'information', 'stores', 'and', 'global', 'reach.', '“If', 'Google,', 'for', 'instance,', 'were', 'to', 'use', 'this', 'data', 'and', 'end', 'up', 'finding', 'a', 'cure', 'for', 'cancer,', 'and', 'then', 'sold', 'the', 'cure', 'back', 'to', 'the', 'NHS', 'for', 'huge', 'sums', 'of', 'money,', 'then', 'I', 'think', 'we', 'could', 'say', 'we', 'had', 'missed', 'a', 'trick,”', 'he', 'said.', 'The', 'NHS', 'has', 'previously', 'faced', 'claims', 'that', 'medical', 'data', 'from', 'millions', 'of', 'patients', 'has', 'been', '<a', 'href="https://www.theguardian.com/society/2014/jan/19/nhs-patient-data-available-companies-buy"', 'title="">sold', 'to', 'insurance', 'companies</a>.', 'Phil', 'Booth,', 'coordinator', 'of', 'medConfidential,', 'which', 'campaigns', 'for', 'the', 'privacy', 'of', 'health', 'data,', 'said', 'the', 'public', 'was', 'being', 'betrayed', 'by', 'claims', 'that', 'the', 'information', 'could', 'not', 'be', 'linked', 'back', 'to', 'individuals.', '“Removing', 'or', 'obscuring', 'a', 'few', 'obvious', 'identifiers,', 'like', 'someone’s', 'name', 'or', 'NHS', 'number', 'from', 'the', 'data,', 'doesn’t', 'make', 'their', 'medical', 'history', 'anonymous,”', 'he', 'said.', '“Indeed,', 'the', 'unique', 'combination', 'of', 'medical', 'events', 'that', 'makes', 'individuals’', 'health', 'data', 'so', 'ripe', 'for', 'exploitation', 'is', 'precisely', 'what', 'makes', 'it', 'so', 'identifiable.', 'Your', 'medical', 'record', 'is', 'like', 'a', 'fingerprint', 'of', 'your', 'whole', 'life.', '“Patients', 'must', 'know', 'how', 'their', 'data', 'is', 'used,', 'and', 'by', 'who.', 'Alleging', 'their', 'data', 'is', 'anonymous', 'when', 'it', 'isn’t,', 'then', 'selling', 'it', 'to', 'drugs', 'and', 'tech', 'companies', '–', 'or,', 'through', 'intermediaries,', 'to', 'heaven', 'knows', 'who', '–', 'is', 'a', 'gross', 'betrayal', 'of', 'trust.', 'People', 'who', 'are', 'rightly', 'concerned', 'about', 'such', 'guile', 'and', 'lack', 'of', 'respect', 'have', 'every', 'right', 'to', 'opt', 'out,', 'if', 'they', 'want', 'their', 'and', 'their', 'family’s', 'medical', 'information', 'kept', 'confidential', 'and', 'for', 'their', 'own', 'care.”', 'Licences', 'to', 'buy', 'data', 'are', 'issued', 'by', 'the', 'Clinical', 'Practice', 'Research', 'Datalink', '(CPRD),', 'part', 'of', 'the', 'Medicines', 'and', 'Healthcare', 'Products', 'Regulatory', 'Agency', '(MHRA).', 'A', 'spokesman', 'said', 'any', 'information', 'sold', 'had', 'been', '“anonymised', 'in', 'accordance', 'with', 'the', 'Information', 'Commissioner’s', 'Office', '(ICO)', 'anonymisation', 'code', 'of', 'practice”.', 'Until', 'early', 'December,', 'the', 'CPRD', 'said', 'on', 'its', 'website', 'the', 'data', 'it', 'made', 'available', 'for', 'research', 'was', '“anonymous”', 'but,', 'following', 'the', '<em>Observer’s</em>', 'story,', 'it', 'changed', 'the', 'wording', 'to', 'say', 'that', 'the', 'data', 'from', 'GPs', 'and', 'hospitals', 'had', 'been', '“anonymised”', '–', 'meaning', 'only', 'that', 'some', 'measures', 'had', 'been', 'taken', 'to', 'de-identify', 'it.', 'Booth', 'added:', '“Following', 'the', 'ICO’s', 'code', 'of', 'practice', 'does', 'not', 'mean', 'that', 'data', 'is', 'necessarily', 'anonymous.', 'The', 'law', 'now', 'recognises', 'that', 'one', 'of', 'the', 'most', 'common', 'methods', 'of', '‘anonymisation’', '–', 'the', 'use', 'of', 'pseudonyms', 'to', 'obscure', 'some', 'bits', 'of', 'information', '–', 'means', 'that', 'data', 'is', 'still', 'identifiable.', 'Indeed,', 'the', 'information', 'commissioner', 'herself', 'says', 'it', 'must', 'be', 'considered', 'personal', 'data.”', 'A', 'spokesman', 'for', 'the', 'MHRA', 'said', 'the', 'wording', 'on', 'the', 'website', 'had', 'changed', '–', 'but', 'only', 'to', 'be', 'consistent:', '“We', 'have', 'replaced', 'the', 'word', '‘anonymous’', 'with', '‘anonymised’', 'to', 'be', 'in', 'line', 'with', 'the', 'ICO', 'terminology', '‘anonymised,’', 'which', 'is', 'the', 'term', 'we', 'use', 'throughout', 'our', 'website.', 'We', 'have', 'done', 'this', 'to', 'be', 'consistent', 'and', 'to', 'avoid', 'any', 'confusion.”', 'Information', 'disclosed', 'by', 'some', 'of', 'CPRD’s', 'customers', 'clearly', 'suggests', 'they', 'can', 'link', 'the', 'information', 'back', 'to', 'individual', 'patient', 'records', 'via', 'their', 'GP', 'surgeries.', 'The', 'Boston', 'Collaborative', 'Drug', 'Surveillance', 'Program', 'in', 'the', 'US,', 'which', 'uses', 'DHSC', 'data,', 'says', '<a', 'href="http://www.bu.edu/bcdsp/gprd/"', 'title="">on', 'its', 'website</a>:', '“Anonymized', 'information', 'from', 'the', 'CPRD', 'on', 'demographics,', 'outpatient', 'visits,', 'hospitalizations', 'and', 'prescriptions', 'dispensed', 'is', 'available', 'to', '[our]', 'researchers.', 'Validation', 'of', 'diagnoses,', 'reports', 'of', 'diagnostic', 'tests', 'and', 'anonymized', 'notes', 'from', 'hospitalizations', 'and', 'referrals', 'can', 'be', 'obtained', 'from', 'the', 'general', 'practitioner', 'upon', 'request.”', 'If', 'the', 'data', 'were', 'truly', 'anonymous', 'it', 'would', 'be', 'impossible', 'to', 'retrieve', 'an', 'individual', 'patient’s', 'medical', 'notes.Neil', 'Bhatia,', 'a', 'GP', 'who', 'is', 'Information', 'Governance', 'Lead', 'and', 'data', 'protection', '0fficer', 'in', 'Hampshire,', 'said:', '“Truly', 'anonymous', 'data', '–', 'utterly', 'incapable', 'of', 'being', 'traced', 'back', 'to', 'an', 'individual', '–', 'is', 'very', 'hard', 'to', 'achieve,', 'given', 'that', 'there', 'is', 'so', 'much', 'information', 'about', 'us', 'in', 'the', 'public', 'domain', 'and', 'held', 'by', 'companies', 'such', 'as', 'Facebookand', 'Google,', 'because', 'so', 'much', 'of', 'our', 'personal', 'data', 'is', 'out', 'there', 'thanks', 'to', 'the', 'massive', 'data', 'breaches', 'over', 'the', 'last', 'few', 'years.', 'In', 'fact,', 'it’s', 'almost', 'impossible', 'for', 'record-level', 'data', '(where', 'each', 'line', 'of', 'the', 'dataset', 'corresponds', 'to', 'an', 'individual)', 'to', 'be', 'made', 'truly', 'anonymous.”', '•', 'This', 'article', 'was', 'amended', 'on', '11', 'February', '2020', 'to', 'include', 'a', 'response', 'from', 'the', 'MHRA', 'that', 'was', 'received', 'after', 'the', 'initial', 'publication', 'deadline.']
1128
535
{'word': ['surgeries.', 'must', 'identifiers,', 'mean', 'isn’t,', 'anonymity', 'Your', 'title="">the', 'taken', 'think', 'researchers.', 'makes', 'data,', 'what', 'suggests', 'evidence', 'anonymisation', 'complete', 'exploitation', 'the', 'Technology', 'Washington', 'year', 'claimed,', '[our]', 'available', 'considered', 'practitioner', 'knows', 'or,', 'licences', 'research', 'global', 'Facebookand', 'obscure', 'for', 'Leaked', 'Clinical', 'consistent:', 'rich', 'drugs', 'as', 'whole', 'American', '“The', 'histories', 'corresponds', 'US,', 'actively', 'Surveillance', 'estimated', 'care.”', 'Office', 'individual)', 'back', 'out', 'few', 'cancer,', 'officials', 'someone’s', 'Research', 'sells', 'responsible', 'Until', 'Montfort', 'said:', 'reports', 'unrestricted', 'Healthcare', 'one', 'years.', 'Health', 'website.', 'Britain’s', 'sought', 'This', 'href="https://www.theguardian.com/politics/2019/dec/07/nhs-medical-data-sales-american-pharma-lack-transparency"', 'easier', 'insists', 'answer', 'previously', 'academic', 'kind', 'UK', '</a>that', 'record', 'any', 'herself', 'at', 'sale', 'amended', 'officials,', 'he', 'Lead', 'initial', 'coordinator', 'anonymous,”', 'GP', '“Indeed,', 'University', 'rightly', 'still', 'dataset', 'Social', 'clear', 'campaigns', 'title="">on', 'leading', 'through', 'confidentiality', 'avoid', 'resources.', 'system', 'Cyber', 'researchers', 'reconstruct', 'post-Brexit', 'believes', 'We', 'via', 'I', 'thanks', 'The', 'would', 'surgery.', 'Alleging', 'diagnostic', 'throughout', 'money,', 'Eerke', 'acquisition', 'is,', 'meaning', 'following', 'this', 'now', 'visits,', 'raised', '<em>Observer’s</em>', 'told', 'can', 'Licences', 'over', 'people', 'If', 'purchaser', '“anonymous”,', 'by', 'agreement.', 'thought', 'Indeed,', 'ripe', 'were', 'privacy', 'from', 'after', 'top', 'Products', '<a', 'because', '(CPRD),', 'include', 'so-called', 'fingerprint', 'de-identify', 'individual', 'have', '(ICO)', 'referrals', 'agencies', '“Truly', 'industry.', 'records', 'be', 'public', 'are', 'trick,”', 'each', 'Department', 'end', 'last', 're-identify', 'bits', 'href="http://www.bu.edu/bcdsp/gprd/"', 'then', 'clearly', 'said', 'individuals’', 'according', '‘anonymisation’', 'having', 'anonymous', 'data.”', 'into', 'doesn’t', 'Datalink', 'centralised', 'hospitals', 'showed', 'gross', 'Boston', '£10bn', 'Boiten', 'record-level', 'said.', 'international', 'given', 'anonymous.', 'which', 'full', 'it.', 'such', 'tests', 'Concerns', 'compiled', 'MHRA', 'if', 'Professor', 'field.', 'their', 'out,', 'Care', 'information.', 'Boiten,', 'practice”.', 'misused.', 'betrayed', 'methods', 'thorough', 'handling', '•', 'in', 'wants', 'buy', 'organisations', 'anonymous.”', 'Validation', 'made', 'consistent', 'link', 'data.', 'controlling', 'surgeries', 'measures', 'commissioner', '<em>Observer</em>', 'under', 'lack', 'part', 'confusion.”', 'replaced', 'term', 'was', 'is', 'controls,', 'already', 'Governance', 'health', 'issued', 'kept', 'Regulatory', 'ICO’s', 'People', 'means', 'value', 'recognises', 'ensure', 'so', 'used', 'patient', 'guile', 'individuals.”', 'up', 'information', 'a', 'Information', 'href="https://www.theguardian.com/society/2014/jan/19/nhs-patient-data-available-companies-buy"', 'spokesman', 'selling', 'other', 'December,', 'domain', 'prescriptions', 'history', 'insurance', 'unique', 'potentially', '“Patients', 'title="">', 'used,', 'pseudonyms', 'practice', 'we', 'reach.', 'family’s', 'GPs', 'notes.Neil', 'there', 'individuals', 'prevent', 'anonymous,', 'early', 'hospitalizations', '‘anonymised,’', 'very', 'identified', 'between', 'wanted', 'Access', 'claims', 'your', 'website', 'added:', 'obvious', 'confidential', 'utterly', 'only', 'like', 'companies</a>.', 'it’s', 'story,', 'and', 'do', 'millions', 'out”', 'wording', 'DHSC,', 'particular', 'senior', 'But', 'line', 'meetings', 'believing', 'pressure', 'identifiable.', 'right', 'individuals.', 'almost', 'our', 'terminology', 'medConfidential,', 'Commissioner’s', 'revealed', 'A', '“opt', 'achieve,', 'world', 'want', 'done', 'but,', '“Anonymized', 'interest.', 'anonymised', 'Institute', 'instance,', 'Booth', 'that', '0fficer', 'million', 'routinely', 'total', 'dispensed', 'trade', 'possible', 'changed', 'request.”', 'Phil', 'Bhatia,', 'patients’', 'betrayal', 'necessarily', 'no,', 'Program', 'say', '</a><em><a', 'incapable', '“Removing', 'article', 'has', 'they', 'huge', '“anonymised”', '55', 'breaches', 'cure', 'world,', '“If', 'director', 'concerned', 'CPRD’s', 'protection', 'use', 'demographics,', 'companies', 'traced', 'across', 'bought', 'should', 'trust.', 'heaven', 'held', 'had', 'US', 'tech', 'how', 'ICO', 'obtained', 'Hampshire,', 'Now', 'or', 'linked', 'Practice', 'tighten', '2018', 'us', 'most', '2020', 'uses', 'Drug', 'Leicester,', 'response', 'does', 'anonymized', 'more', 'but', 'Google,', 'intermediaries,', 'They', 'to', 'missed', 'access', 'received', 'common', 'DHSC', 'opt', 'obscuring', 'website</a>:', 'Asked', 'NHS', 'personal', 'experts', 'word', 'February', 'not', 'of', 'impossible', 'truly', 'Agency', 'protect', '£10m', 'much', 'law', '–', 'increasingly', 'medical', 'disclosed', 'hard', '“anonymous”', 'life.', 'upon', 'who', '‘anonymised’', 'sums', 'details', 'respect', 'accordance', 'with', 'make', 'could', 'deadline.', 'CPRD', 'outpatient', 'In', 'about', 'fact,', 'code', 'largest', 'customers', 'truth.', 'own', 'believe', 'De', 'number', '“anonymised', '11', 'title="">sold', '“Following', 'every', 'precisely', 'who.', '“We', 'on', '‘anonymous’', 'data', '(MHRA).', 'it', 'priority', 'an', 'patient’s', 'stores', 'whose', 'figures', 'massive', '(where', 'faced', 'publication', 'some', 'commercial', 'notes', 'patients', 'sold', 'says', 'richer', 'being', 'combination', 'late', 'limiting', 'retrieve', 'finding', 'Medicines', 'know', 'Senior', 'misled', 'its', 'events', 'title="">Observer</a></em><a', 'government', 'general', 'Booth,', 'been', 'granting', 'name', 'Collaborative', 'diagnoses,', 'when'], 'count': [2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 3, 1, 1, 1, 1, 1, 1, 62, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 3, 2, 1, 1, 15, 1, 1, 1, 1, 4, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 5, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 1, 2, 3, 1, 1, 1, 2, 1, 1, 10, 1, 1, 1, 1, 2, 2, 9, 2, 1, 1, 3, 2, 1, 1, 1, 1, 1, 4, 9, 1, 1, 1, 1, 1, 3, 12, 5, 6, 1, 1, 1, 1, 2, 1, 1, 1, 5, 1, 4, 1, 1, 1, 2, 3, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 3, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 6, 23, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 4, 1, 1, 2, 12, 10, 3, 1, 2, 3, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 3, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 3, 2, 1, 1, 1, 31, 1, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 17, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 4, 1, 1, 1, 1, 4, 3, 3, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 3, 1, 6, 1, 1, 1, 1, 1, 1, 1, 6, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 42, 1, 2, 1, 1, 3, 1, 1, 1, 1, 7, 3, 2, 1, 1, 6, 35, 2, 3, 1, 1, 1, 3, 1, 12, 2, 11, 1, 1, 2, 1, 1, 5, 1, 2, 1, 1, 1, 5, 1, 2, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 29, 1, 16, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 4, 2, 1, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 7, 1, 1, 1, 1, 1]}

导入pandas库,将list转成dataframe格式:

import pandas as pd
df = pd.DataFrame(count_dictionary)
df.sort_values(by='count', ascending=False)
word count
19 the 62
428 to 42
444 of 35
321 and 31
493 data 29
... ... ...
268 used 1
91 officials, 1
266 ensure 1
265 recognises 1
534 when 1

535 rows × 2 columns

So we have a dataframe with word occurence frequency in the article.

But there is punctuation messing this up. For example, we see that again. appears once, as does providers,.

One option to fix this would be to strip out the punctuation using Python string manipulation. But you could also use regular expressions to remove the punctuation. Below is a hacky example, but you can probably find a better solution.

请注意,正则表达式r’[^ \ w \ s]’将正文中不是单词\ w或空格\ s的任何内容替换为空字符串

import re  ## imports the regular expression library
words_wo_punctuation = re.sub(r'[^\w\s]','',body.replace('<p>','').replace('</p>','')).split()  

Note that the regex r'[^\w\s]' substitutes anything in body that is not a word \w or and blank space \s with the empty string ''.

unique_words = list(set(words_wo_punctuation))
print(len(unique_words))
count_dictionary = {'word': unique_words, 'count': [words_wo_punctuation.count(w) for w in unique_words]}
479
df = pd.DataFrame(count_dictionary)
df.sort_values(by='count', ascending=False)
word count
15 the 62
372 to 42
390 of 35
436 data 34
283 and 31
... ... ...
183 Patients 1
181 full 1
178 international 1
176 Boston 1
478 when 1

479 rows × 2 columns



python Share Tweet +1