




????????— Transformer 和 LLM 速度慢且成本高。使用 OpenAI 的 API 需要花费大量金钱,而且由于速度相当慢而可能不切实际。您当然可以自行托管一个较小的变压器模型,但如果您希望事情变得敏捷且响应迅速(这通常是生产中的要求),它仍然需要大量计算资源。




2.1 数据

????????为了证明我的观点,我将做一些作弊,我将使用带标签的数据集来证明我提出的方法的有效性。不过,我只会使用标签进行评估,创建管道的整个过程将基于无监督学习和我们自己的人类直觉。该数据集有 20 个新闻组,您可以使用 scikit-learn 轻松加载。

pip install scikit-learn
from sklearn.datasets import fetch_20newsgroups
import numpy as np

newsgroups = fetch_20newsgroups(subset="all")
corpus = newsgroups.data

# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = [newsgroups.target_names[label] for label in newsgroups.target]

array(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
       'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
       'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
       'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt',
       'sci.electronics', 'sci.med', 'sci.space',
       'soc.religion.christian', 'talk.politics.guns',
       'talk.politics.mideast', 'talk.politics.misc',
       'talk.religion.misc'], dtype='<U24')


from sklearn.model_selection import train_test_split
is_about_space = np.array(group_labels) == "sci.space"

X_train, X_test, y_train, y_test = train_test_split(corpus, is_about_space)>

2.2 无监督模型


pip install topicwizard

????????在 topicwizard 中创建主题管道被认为是将矢量化器和分解模型链接在一起。

????????作为矢量化器,我们将使用 scikit-learn 内置的 CountVectorizer 并设置一些看起来合理的默认频率截止值,并将过滤英语停用词。

????????我们将使用非负矩阵分解作为我们的主题模型,因为它非常快(训练和推理)并且通常工作得相当好。我将确定 30 个主题,这是一个完全任意的数字。

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import CountVectorizer
from topicwizard.pipeline import make_topic_pipeline

# Setting up topic modelling pipeline
vectorizer = CountVectorizer(max_df=0.5, min_df=10, stop_words="english")
# NMF topic model with 20 topics
nmf = NMF(n_components=30)

topic_pipeline = make_topic_pipeline(vectorizer, nmf)

2.3 模型解读

????????topicwizard 附带了许多内置的可视化来解释主题模型。现在我们主要感兴趣的是哪些主题可能包含大部分与空间相关的单词。

????????为此,我们将使用 topicwizard 的图形 API。首先,我们通过创建一组条形图来查看与每个主题最相关的单词。

from topicwizard.figures import topic_barcharts

topic_barcharts(X_train, pipeline=topic_pipeline, top_n=5)



from topicwizard.figures import word_map

word_map(X_train, pipeline=topic_pipeline, top_n=5)


我们还可以检查“space”和“astro”这两个词属于哪些主题,以及它们最接近的 20 个关联。我们只会显示前 8 个主题。

from topicwizard.figures import word_association_barchart

fig = word_association_barchart(
  ["space", "astro"],



import plotly.express as px


topic_df = topic_pipeline.transform(X_train)

我们可以看到绝大多数文本都在0.1以下。我说我们尝试将阈值设置为 0.05,然后查看从中得到的随机文本样本。

topic_df["content"] = X_train
sample = topic_df[topic_df["23_satellite_space_launch_nasa"] > 0.05].content.sample(10)
for text in sample:
From: CPKJP@vm.cc.latech.edu (Kevin Parker)
Subject: Insurance Rates on Performance Cars SUMMARY
Organization: Louisiana Tech University
Lines: 244
NNTP-Posting-Host: vm.cc.latech.edu
X-Newsreader: NN
From: pjs@euclid.JPL.NASA.GOV (Peter J. Scott)
Subject: Re: Did Microsoft buy Xhibition??
Organization: Jet Propulsion Laboratory, NASA/Caltech
Lines: 8
Distribution: world
Reply-To: pjs@euclid.jpl.na
From: ml@chiron.astro.uu.se (Mats Lindgren)
Subject: Re: Comet in Temporary Orbit Around Jupiter?
Organization: Uppsala University
Lines: 14
Distribution: world
NNTP-Posting-Host: chiron.astro.uu.se

From: mike@gordian.com (Michael A. Thomas)
Subject: Re: The Role of the National News Media in Inflaming Passions
Organization: Gordian; Costa Mesa, CA
Distribution: ca
Lines: 13

In article <1qjtmjIN
From: leech@cs.unc.edu (Jon Leech)
Subject: Space FAQ 04/15 - Calculations
Supersedes: <math_730956451@cs.unc.edu>
Organization: University of North Carolina, Chapel Hill
Lines: 334
Distribution: worl
From: wls@calvin.usc.edu (Bill Scheding)
Subject: Re: "Full page" PB screen
Organization: University of Southern California, Los Angeles, CA
Lines: 14
Distribution: world
NNTP-Posting-Host: calvin.usc
From: ghelf@violet.berkeley.edu (;;;;RD48)
Subject: Re: Soyuz and Shuttle Comparisons
Organization: University of California, Berkeley
Lines: 11
NNTP-Posting-Host: violet.berkeley.edu

Are you guys ta
From: gsh7w@fermi.clas.Virginia.EDU (Greg Hennessy)
Subject: Re: Keeping Spacecraft on after Funding Cuts.
Organization: University of Virginia
Lines: 13

In article <1r6aqr$dnv@access.digex.net> prb@
From: oeth6050@iscsvax.uni.edu
Subject: ****COMIC BOOK SALE****
Organization: University of Northern Iowa
Lines: 36

        my name is John and I have the following comic books for sale - plea
From: shafer@rigel.dfrf.nasa.gov (Mary Shafer)
Subject: Re: Inner Ear Problems from Too Much Flying?
Article-I.D.: rigel.SHAFER.93Apr6095951
Organization: NASA Dryden, Edwards, Cal.
Lines: 33
Hmm some of these texts do not seem to have much to do with space, let’s set a higher threshold.

rom: gene@theporch.raider.net (Gene Wright)
Subject: NASA Special Publications for Voyager Mission?
Organization: The MacInteresteds of Nashville, Tn.
Lines: 12

I have two books, both NASA Special P
From: 18084TM@msu.edu (Tom)
Subject: Billsats
X-Added: Forwarded by Space Digest
Organization: [via International Space University]
Distribution: sci
From: pww@spacsun.rice.edu (Peter Walker)
Subject: Re: The Universe and Black Holes, was Re: 2000 years.....
Organization: I didn't do it, nobody saw me, you can't prove a thing.
Lines: 28

In article
From: da709@cleveland.Freenet.Edu (Stephen Amadei)
Subject: Project Help
Organization: Case Western Reserve University, Cleveland, Ohio (USA)
Lines: 17
NNTP-Posting-Host: hela.ins.cwru.edu

From: dbm0000@tm0006.lerc.nasa.gov (David B. Mckissock)
Subject: Washington Post Article on SSF Redesign
News-Software: VAX/VMS VNEWS 1.41    
Nntp-Posting-Host: tm0006.lerc.nasa.gov
Organization: NAS
From: u920496@daimi.aau.dk (Hans Erik Martino Hansen)
Subject: Commercials on the Moon
Organization: DAIMI: Computer Science Department, Aarhus University, Denmark
Lines: 16

I have often thought abou
From: wb8foz@skybridge.SCL.CWRU.Edu (David Lesher)
Subject: Re: No. Re: Space Marketing would be wonderfull.
Organization: NRK Clinic for habitual NetNews abusers - Beltway Annex
Lines: 11
Reply-To: w
From: 18084TM@msu.edu (Tom)
Subject: Solid state vs. tube/analog
X-Added: Forwarded by Space Digest
Organization: [via International Space University]
From: pgf@srl03.cacs.usl.edu (Phil G. Fraering)
Subject: Re: Gamma Ray Bursters. positional stuff.
Organization: Univ. of Southwestern Louisiana
Lines: 24

belgarath@vax1.mankato.msus.edu writes:

From: rnichols@cbnewsg.cb.att.com (robert.k.nichols)
Subject: Re: Permanaent Swap File with DOS 6.0 dbldisk
Summary: PageOverCommit=factor
Organization: AT&T
Lines: 50

In article <93059@hydra.gatech.

这些似乎确实与空间相关,所以让我们保留 0.15 作为阈值。

2.4 分类管道



????????为此,我们必须冻结主题模型,以便在管道上调用 fit() 时不会对其进行训练

pip install human-learn
from hulearn.classification import FunctionClassifier
from sklearn.pipeline import make_pipeline

# Creating rule for classifying something as a space document
def space_rule(df, threshold=0.15):
    return df["23_satellite_space_launch_nasa"] > threshold

# Freezing topic pipeline
topic_pipeline.freeze = True
classifier = FunctionClassifier(space_rule)
cls_pipeline = make_pipeline(topic_pipeline, classifier).fit(X_train)




from sklearn.metrics import classification_report

y_pred = cls_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
   precision    recall  f1-score   support

       False       0.98      0.98      0.98      4475
        True       0.65      0.70      0.68       237

    accuracy                           0.97      4712
   macro avg       0.82      0.84      0.83      4712
weighted avg       0.97      0.97      0.97      4712



