如果你需要知道一本书90%的词汇,你必须要读多少页?

如果你需要知道一本书90%的词汇,你必须要读多少页?

作者:Roman Kierzkowski

本文是在Vocapouch上的一部分研究,我们主要为语言学者做出贡献,本文的结果也包括在我们的博客里。

我的英语老师说过,如果我熬过了一本书的前20页,剩下的部分会变得简单,因为这20页上出现的词汇组成了整本书词汇的90%。所以你一旦过了这关,就不必前前后后翻阅字典了。

他说的对吗?

我们用3本书来验证一下:

  • The Secret Adversary ——是Agatha Christie的侦探小说
  • Eve’s Day ——Mark Twain的短篇小说
  • Ulysses ——James Joyce的长篇小说

阅读

%pylab inline

import spacy
import codecs
import seaborn as sns

from __future__ import unicode_literals
from collections import Counter
from matplotlib import pyplot


nlp = spacy.load('en')

激活交互numpy和matplotlib的命名空间

为了分析文本,我们使用sapcy。我们清除所有不是词汇的东西,我们只用词汇的典型形式。比如,我们把went变成goplays变成play 等等。我们把所有字母都变成小写的,所有代词都变为’-PRON-‘。最后结果,它们都被记为一个词汇。

def extract_words(path):
    with codecs.open(path, encoding='utf-8', mode="r") as book:
        content = book.read()
        doc = nlp(content)
        return [ token.lemma_ for token in doc if token.is_alpha ]

ulysses = extract_words('ulysses.txt')
eves_diary = extract_words('eves_diary.txt')
the_secret_adversary = extract_words('the_secret_adversary.txt')
finnegans = extract_words('finnegans.txt')

词覆盖计数

为了计算词覆盖度,我们一页一页的检查在该页之前出现的词汇占整本书词汇量的比例。我们也已经计算了某个已出现的词汇的比例。

WPP = 300 

def count_coverage(words, wpp = WPP): # wpp - words per page
    coverage = []
    uniqueness = []
    occurances = Counter(words)
    counter = Counter()
    total = float(len(words))
    total_uniq = float(len(occurances.keys()))
    for n in xrange(len(words) // wpp):
        page = words[n*wpp:(n+1)*wpp]
        counter.update(page)
        s = sum((occurances[w] for w in counter.keys()))
        coverage.append(s/total)
        uniqueness.append(len(counter.keys())/total_uniq)
    return occurances, coverage, uniqueness

为了清除我们检查的书到底多难,你读多少页能够打到词覆盖度90%就是我们的衡量标准。

def calculate_hardness(coverage):
    for i in range(len(coverage)):
        if coverage[i] > 0.9:
            break
    hardness = (i / float(len(coverage))) * 100
    return i, hardness

测试书籍

def analyze_book(words, title):
    occurances, coverage, uniqueness = count_coverage(words)
    page, hardness = calculate_hardness(coverage)
    file_name = title.lower().replace(' ', '_').replace('\'','') + '.png'

    print("Number of Pages: %.0f" % (len(words) / WPP))
    print("Number of Total Words: %s" % len(words))
    print("Number of Unique Words: %s" % len(occurances.keys()))
    print("You will know 90%% of words after %s pages which are %.2f%% of the book." % (page, hardness))
    print("At that page, you will know %.2f%% of unique words." % (uniqueness[page] * 100, ))

    pyplot.plot(coverage, color='b', label="All words")
    pyplot.plot(uniqueness, color='g', label="Unique words")
    pyplot.legend(loc=4)
    pyplot.title(title)
    pyplot.xlabel('Page')
    pyplot.ylabel('Coverage [%]')
    pyplot.savefig(file_name)

analyze_book(the_secret_adversary, title="The Secret Adversary")

Number of Pages: 250
Number of Total Words: 75208
Number of Unique Words: 5248
You will know 90% of words after 40 pages which are 16.00% of the book.
At that page, you will know 39.21% of unique words.

analyze_book(eves_diary, title="Eve's Diary")

Number of Pages: 22
Number of Total Words: 6858
Number of Unique Words: 1104
You will know 90% of words after 9 pages which are 40.91% of the book.
At that page, you will know 56.70% of unique words.

analyze_book(finnegans, title="Finnegans Wake")

Number of Pages: 729
Number of Total Words: 218793
Number of Unique Words: 50872
You will know 90% of words after 387 pages which are 53.09% of the book.
At that page, you will know 60.64% of unique words.

结论

新的词汇会随着阅读的深入不断涌出,但是当你读过了前几页之后,你之前见过的词汇覆盖整本书的大部分。然而,确切的需要读的页码会随着输的厚度和作者多变的语言而不同。最难的书就是尤利西斯(Ulysses),你需要读221页(整本书的25%)。虽然不是20页,但我的老师说的某种程度上是正确的

更多课程和文章尽在微信号:「datartisan数据工匠」

由 Editor 于 2018 年 01 月 11 日 发布在 数据科学 栏目