Type: Package
Title: Semi-Supervised Model for Geographical Document Classification
Version: 0.9.2
Maintainer: Kohei Watanabe <watanabe.kohei@gmail.com>
Description: Semissupervised model for geographical document classification (Watanabe 2018) <doi:10.1080/21670811.2017.1293487>. This package currently contains seed dictionaries in English, German, French, Spanish, Italian, Russian, Hebrew, Arabic, Turkish, Japanese and Chinese (Simplified and Traditional).
License: MIT + file LICENSE
URL: https://github.com/koheiw/newsmap
BugReports: https://github.com/koheiw/newsmap/issues
LazyData: TRUE
Encoding: UTF-8
Depends: R (≥ 3.5), methods
Imports: utils, Matrix, quanteda (≥ 2.1), quanteda.textstats, stringi
Suggests: testthat
Language: en-GB
RoxygenNote: 7.3.2
NeedsCompilation: no
Packaged: 2025-07-10 08:04:17 UTC; watan
Author: Kohei Watanabe [aut, cre, cph], Stefan Müller [aut], Dani Madrid-Morales [aut], Katerina Tertytchnaya [aut], Ke Cheng [aut], Chung-hong Chan [aut], Claude Grasland [aut], Giuseppe Carteny [aut], Elad Segev [aut], Dai Yamao [aut], Barbara Ellynes Zucchi Nobre Silva [aut], Lanabi la Lova [aut], Lungta Seki [aut]
Repository: CRAN
Date/Publication: 2025-07-10 12:50:12 UTC

Evaluate classification accuracy in precision and recall

Description

Evaluate classification accuracy in precision and recall

Usage

accuracy(x, y)

Arguments

x

vector of predicted classes

y

vector of true classes

Examples

class_pred <- c('US', 'GB', 'US', 'CN', 'JP', 'FR', 'CN') # prediction
class_true <- c('US', 'FR', 'US', 'CN', 'KP', 'EG', 'US') # true class
acc <- accuracy(class_pred, class_true)
print(acc)
summary(acc)

Compute average feature entropy (AFE)

Description

AFE computes randomness of occurrences features in labelled documents.

Usage

afe(x, y, smooth = 1)

Arguments

x

a dfm for features

y

a dfm for labels

smooth

a numeric value for smoothing to include all the features


Coerce various objects to coefficients_textmodel This is a helper function used in ⁠summary.textmodel_*⁠.

Description

Coerce various objects to coefficients_textmodel This is a helper function used in ⁠summary.textmodel_*⁠.

Usage

as.coefficients_textmodel(x)

Arguments

x

an object to be coerced


Coerce various objects to statistics_textmodel

Description

This is a helper function used in ⁠summary.textmodel_*⁠.

Usage

as.statistics_textmodel(x)

Arguments

x

an object to be coerced


Assign the summary.textmodel class to a list

Description

Assign the summary.textmodel class to a list

Usage

as.summary.textmodel(x)

Arguments

x

a named list


Extract coefficients for features

Description

Extract coefficients for features

Usage

## S3 method for class 'textmodel_newsmap'
coef(object, n = 10, select = NULL, ...)

## S3 method for class 'textmodel_newsmap'
coefficients(object, n = 10, select = NULL, ...)

Arguments

object

a Newsmap model fitted by textmodel_newsmap().

n

the number of coefficients to extract.

select

returns the coefficients for the selected class; specify by the names of rows in object$model.

...

not used.


Seed geographical dictionary in Arabic

Description

Seed geographical dictionary in Arabic

Author(s)

Dai Yamao daiyamao@scs.kyushu-u.ac.jp


Seed geographical dictionary in German

Description

Seed geographical dictionary in German

Author(s)

Stefan Müller mullers@tcd.ie


Seed geographical dictionary in English

Description

Seed geographical dictionary in English

Author(s)

Kohei Watanabe watanabe.kohei@gmail.com


Seed geographical dictionary in Spanish

Description

Seed geographical dictionary in Spanish

Author(s)

Dani Madrid-Morales dani.madrid@my.cityu.edu.hk


Seed geographical dictionary in French

Description

Seed geographical dictionary in French

Author(s)

Claude Grasland claude.grasland@parisgeo.cnrs.fr


Seed geographical dictionary in Hebrew

Description

Seed geographical dictionary in Hebrew

Author(s)

Elad Segev eladseg@gmail.com


Seed geographical dictionary in Italian

Description

Seed geographical dictionary in Italian

Author(s)

Giuseppe Carteny giuseppe.carteny@unimi.it


Seed geographical dictionary in Japanese

Description

Seed geographical dictionary in Japanese

Author(s)

Kohei Watanabe watanabe.kohei@gmail.com


Seed geographical dictionary in Portuguese

Description

Seed geographical dictionary in Portuguese

Author(s)

Barbara Ellynes Zucchi Nobre Silva barbara@zucchi.science


Seed geographical dictionary in Russian

Description

Seed geographical dictionary in Russian

Author(s)

Katerina Tertytchnaya katerina.tertytchnaya@gmail.com

Lanabi la Lova l.lalova@lse.ac.uk


Seed geographical dictionary in Turkish

Description

Seed geographical dictionary in Turkish

Author(s)

Lungta Seki yahoo.co.jp0409@gmail.com


Seed geographical dictionary in Chinese (simplified)

Description

Seed geographical dictionary in Chinese (simplified)

Author(s)

Ke Cheng kecheng.ac@gmail.com


Seed geographical dictionary in Chinese (traditional)

Description

Seed geographical dictionary in Chinese (traditional)

Author(s)

Chung-hong Chan chainsawtiney@gmail.com


Prediction method for textmodel_newsmap

Description

Predict document class using trained a Newsmap model

Usage

## S3 method for class 'textmodel_newsmap'
predict(
  object,
  newdata = NULL,
  confidence = FALSE,
  rank = 1L,
  type = c("top", "all"),
  rescale = FALSE,
  min_conf = -Inf,
  min_n = 0L,
  ...
)

Arguments

object

a fitted Newsmap textmodel.

newdata

dfm on which prediction should be made.

confidence

if TRUE, it returns likelihood ratio score.

rank

rank of the class to be predicted. Only used when type = "top".

type

if top, returns the most likely class specified by rank; otherwise return a matrix of likelihood ratio scores for all possible classes.

rescale

if TRUE, likelihood ratio scores are normalized using scale(). This affects both types of results.

min_conf

return NA when confidence is lower than this value.

min_n

set the minimum number of polarity words in documents.

...

not used.


Print methods for textmodel features estimates This is a helper function used in print.summary.textmodel.

Description

Print methods for textmodel features estimates This is a helper function used in print.summary.textmodel.

Usage

## S3 method for class 'coefficients_textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)

Arguments

x

a coefficients_textmodel object

digits

minimal number of significant digits, see print.default()

...

additional arguments not used


Implements print methods for textmodel_statistics

Description

Implements print methods for textmodel_statistics

Usage

## S3 method for class 'statistics_textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)

Arguments

x

a textmodel_wordscore_statistics object

digits

minimal number of significant digits, see print.default()

...

further arguments passed to or from other methods


print method for summary.textmodel

Description

print method for summary.textmodel

Usage

## S3 method for class 'summary.textmodel'
print(x, digits = max(3L, getOption("digits") - 3L), ...)

Arguments

x

a summary.textmodel object

digits

minimal number of significant digits, see print.default()

...

additional arguments not used


Calculate micro and macro average measures of accuracy

Description

This function calculates micro-average precision (p) and recall (r) and macro-average precision (P) and recall (R) based on a confusion matrix from accuracy().

Usage

## S3 method for class 'textmodel_newsmap_accuracy'
summary(object, ...)

Arguments

object

output of accuracy()

...

not used.


Semi-supervised Bayesian multinomial model for geographical document classification

Description

Train a Newsmap model to predict geographical focus of documents with labels given by a dictionary.

Usage

textmodel_newsmap(
  x,
  y,
  label = c("all", "max"),
  smooth = 1,
  boolean = FALSE,
  drop_label = TRUE,
  verbose = quanteda_options("verbose"),
  entropy = c("none", "global", "local", "average"),
  ...
)

Arguments

x

a dfm or fcm created by quanteda::dfm()

y

a dfm or a sparse matrix that record class membership of the documents. It can be created applying quanteda::dfm_lookup() to x.

label

if "max", uses only labels for the maximum value in each row of y.

smooth

a value added to the frequency of words to smooth likelihood ratios.

boolean

if TRUE, only consider presence or absence of features in each document to limit the impact of words repeated in few documents.

drop_label

if TRUE, drops empty columns of y and ignore their labels.

verbose

if TRUE, shows progress of training.

entropy

[experimental] the scheme to compute the entropy to regularize likelihood ratios. The entropy of features are computed over labels if global or over documents with the same labels if local. Local entropy is averaged if average. See the details.

...

additional arguments passed to internal functions.

Details

Newsmap learns association between words and classes as likelihood ratios based on the features in x and the labels in y. The large likelihood ratios tend to concentrate to a small number of features but the entropy of their frequencies over labels or documents helps to disperse the distribution.

References

Kohei Watanabe. 2018. "Newsmap: semi-supervised approach to geographical news classification." Digital Journalism 6(3): 294-309.

Examples

require(quanteda)
text_en <- c(text1 = "This is an article about Ireland.",
             text2 = "The South Korean prime minister was re-elected.")

toks_en <- tokens(text_en)
label_toks_en <- tokens_lookup(toks_en, data_dictionary_newsmap_en, levels = 3)
label_dfm_en <- dfm(label_toks_en)

feat_dfm_en <- dfm(toks_en, tolower = FALSE)

model_en <- textmodel_newsmap(feat_dfm_en, label_dfm_en)
predict(model_en)