In the previous posts, we introduce Topic Modeling with Latent Dirichlet Allocation Model on theoretical aspect and applications in finding similar documents and similar users. In this post, we will show another application in categorizing problem for unlabeled documents with a practice real data.
Introduction
Recently, we have a huge amount of data from social network, but they aren’t categorized. In other words, they are unlabeled data. Our task is categorizing those documents and naming each category with meaning.
Our solution is building topic modeling with LDA, then considering each topic as one category. Besides, LDA model can show top keywords for each topic. Based on them, we can know the meaning of topics and name them.
Practice
Input data
Input data we had is about 6 M records from social network resources with json format. Below is an example:
Each record have 5 attributes: meta_desciption, meta_tittle, media_type, link and text. There are 5 kinds of media_type. Attributes meta_desciption, meta_tittle, link, text can be null, depend onmedia_type.
Preprocessing
Because of data’s specification, data have an attribute media_type (each media_type is corresponding to one kind of data, there are 5 media_type 1-5), and each record has 2 types of content: text and meta (combine meta_title and meta_description). So, we group content documents by content types and media type. Besides, we create joint groups by merging data by types and media types (full = meta + text, ALL = 1 + 2 + 3 + 4 + 5). We can have 18 groups as possible.
full-0 meta-0 text-0
full-1 meta-1 text-1
full-2 meta-2 text-2
full-3 meta-3 text-3
full-4 meta-4 text-4
full-5 meta-5 text-5
To build LDA model, we need do some preprocessing operations, such as getting contents from raw data, grouping, building stop words for each group. We use frequency threshold to build stop words (words appeared too much or too few become noises in topic modeling process, in other words, they are stop words.)
Build LDA Model
After preprocessing, we build LDA model. For each group, we build one topic model. After trying to build models with different numbers of topics, suitable number of topics for those data is 30. With building process, we export a report file each model. Report files help to know meaning of categories.
[code language=”text” collapse=”true” light=”true”]
Index, Category Name, Score, Top keywords
0,,0.04704479651699264,jesus; christian; bible; faith; english; language; israel; pope; christ; peace; quotes; prayer; quiz; muslim; catholic; ancient; holy; francis; religion; relationships; happiness; fear; spiritual; religious; jewish; christians; lord; spirit; writing; pastor; lessons; soul; wisdom; teach; teaching; worship; puritan; positive; understanding; islam; pray; meditation; healing; knowledge; speak; personality; gender; muslims; meaning; joy;
1,,0.03185554976691602,patients; autism; diabetes; mental; medicine; cannabis; awareness; marijuana; kidney; drugs; chronic; zika; symptoms; researchers; virus; patient; depression; breast; stress; diagnosed; disorder; hiv; surgery; therapy; scientists; illness; anxiety; pregnancy; effects; addiction; syndrome; prevention; cells; vaccine; condition; prevent; diseases; abuse; physical; clinical; cases; diagnosis; adults; nurse; positive; studies; cure; transplant; recovery; reduce;
2,,0.026257795466838662,card; coupon; giveaway; deals; diy; coupons; shipping; cards; amazon; pack; target; cash; print; shopping; ends; printable; cleaning; credit; bag; walmart; purchase; stores; bags; prize; toys; grab; ticket; furniture; paper; plastic; savings; reg; grocery; storage; gifts; kit; van; score; prices; freebies; bottles; sample; winners; discount; subscription; rewards; stock; laundry; bottle; bed;
3,,0.009262196936968566,snow; rain; horoscope; forecast; storm; severe; forbes; uma; wind; accurate; winds; storms; showers; não; temperatures; tornado; dos; dailyhoroscopes.net; cold; mais; por; temple; issued; veja; hail; como; aries; neonstylish; brazil; warm; você; brasil; heavy; mph; foi; mas; sobre; sunny; radar; lollipop; são; conditions; pes; seu; thunderstorms; dia; fireworks; tem; thunderstorm; mai;
4,,0.024750862921251972,traffic; flight; airport; truck; plane; train; vehicle; accident; bus; drivers; highway; bridge; closed; firefighters; chase; passengers; passenger; uber; emergency; construction; airlines; crews; injuries; pilot; parking; një; route; struck; për; transportation; roads; speed; crashed; aircraft; flying; egyptair; crew; dhe; collision; avenue; lane; incident; helicopter; rail; a.m; tree; crashes; patrol; airline; aviation;
5,,0.05279408297697491,wisconsin; governor; democratic; supreme; gov; senate; abortion; lgbt; voters; trump’s; convention; poll; gay; senator; wage; candidates; debate; supporters; republicans; laws; rally; sen; delegates; minimum; religious; transgender; mayor; nomination; politics; freedom; lawmakers; ban; legislation; legal; discrimination; committee; voting; conservative; congress; liberal; trump’s; nominee; illegal; mississippi; kasich; democrats; marijuana; colorado; protest; speech;
6,,0.0277407314481256,tattoo; tattoos; celebrities; girslviralitty; che; promhub; celebs; makeup; personality; fails; selfies; mistakes; perfectly; sexy; moms; pixpus; surely; snapchat; una; shape; newvirral; pics; targets; mirrorsdol; achieving; diply; vinesfor; fantasticpictureshindig.net; photoshop; couples; ladies; embarrassing; bollywood; selfie; celebrity; della; weird; awkward; hottest; sarcasticfacts; reveal; kissing; lolzbeans; finger; shave; pixpus.org; boobs; gorgeous; positions; viralinfotoday;
7,,0.048689002374921145,campus; museum; programs; youth; volunteer; fair; opportunities; library; teachers; veterans; registration; volunteers; marathon; organization; charity; graduate; scholarship; leadership; attend; academy; institute; fund; boston; donate; deadline; classes; scholarships; mission; teacher; management; downtown; interested; professor; valley; communities; activities; camp; degree; faculty; alumni; resources; benefit; grant; engineering; application; michigan; workshop; applications; fundraising; a.m;
8,,0.028324853243852154,isis; army; brussels; nba; islamic; kobe; warriors; syrian; nuclear; terror; navy; bryant; syria; forces; russian; suicide; terrorist; capitol; korea; marine; defense; pakistan; iraq; soldiers; europe; terrorists; muslim; terrorism; base; golden; russia; refugees; turkey; troops; veteran; weapons; curry; bomb; soldier; playoffs; israel; playoff; guard; iran; lakers; nhl; hockey; afghanistan; veterans; israeli;
9,,0.024477243608732952,manchester; liverpool; madrid; trending; premier; chelsea; barcelona; champions; soccer; galleries; england; arsenal; ronaldo; van; olympic; leicester; sport; ireland; cristiano; boss; rugby; tennis; stadium; louis; rio; goals; newcastle; messi; clash; irish; striker; dortmund; klopp; derry; tottenham; transfer; paris; gaal; euro; goal.com; injury; galleries.com; midfielder; draw; jurgen; uefa; olympics; squad; defender; utd;
10,,0.04100689443142912,dress; shoes; makeup; dresses; designer; designs; colors; shirt; wearing; tutorial; paint; ring; gorgeous; beer; flowers; diy; clothing; jewelry; pink; nail; cheap; painting; trends; vintage; interior; inspiration; rings; silver; pieces; selling; pattern; stylish; craft; shirts; styles; shopping; prom; flower; accessories; trend; clothes; jeans; glass; pair; ladies; decor; designers; leather; hairstyles; bridal;
11,,0.010626409762327756,por; una; más; este; como; esta; sus; pero; sobre; años; earthquake; aquí; mundo; vida; qué; fotos; hoy; está; todo; fue; abril; ser; nuevo; sin; cuando; puerto; todos; día; dos; cómo; muy; nueva; desde; entre; tiene; tus; han; ver; esto; mira; noticiaslocas; rico; español; hacer; casa; hay; mejor; puede; gran; uno;
12,,0.007360182141229025,allu; sarrainodu; arjun; bollywood; jabardasth; chiranjeevi; khan; sreeja; singh; kalyan; laugh; preet; rakul; anchor; skit; pawan; rofl; actress; shocked; mla; laughing; megastar; stunned; pratyusha; omg; alluarjun; comedy; rakulpreet; tollywood; shah; sardaar; krishna; blockbuster; mahesh; kapoor; gabbar; arytube.tv; telusaa; suicide; hai; murali; posters; roja; một; của; baahubali; lol; actor; srinu; unseen;
13,,0.0256093852050514,golf; bike; racing; masters; ford; tesla; merle; auto; haggard; truck; motor; speed; engine; electric; jordan; speedway; muscle; vehicles; riding; horse; nascar; motorcycle; porsche; vehicle; augusta; bikes; cycling; chevrolet; gear; mustang; toyota; drag; bmw; sport; legend; custom; mercedes; dodge; trucks; spieth; tournament; formula; auction; wheels; honda; wheel; championship; walker; ebay; races;
14,,0.040908259871850476,songs; metal; hop; hip; remix; jazz; guitar; announces; playlist; tracks; debut; bands; studio; stream; roll; blues; vinyl; punk; perform; apr; spotify; tribute; dates; roses; guns; premiere; musical; albums; label; rolling; presents; lineup; solo; musicians; soul; bass; releases; notified; rapper; feat; july; sounds; concerts; ticket; jam; producer; performing; theatre; fame; stone;
15,,0.037777952031764025,baseball; ufc; draft; ang; ncaa; championship; tournament; jones; villanova; mga; stadium; champion; broncos; men’s; quarterback; sox; madness; mike; mma; espn; houston; jon; opener; patriots; defensive; cowboys; denver; mcgregor; giants; michigan; dallas; syracuse; nba; mlb; softball; conor; cubs; ohio; cleveland; coaches; tigers; boxing; oklahoma; roster; ers; diaz; dodgers; pittsburgh; fighter; trade;
16,,0.034529473131977885,rid; remove; stomach; teeth; belly; remedies; naturally; trick; ingredients; hacks; exercises; essential; exercise; dry; period; soda; blackheads; burn; coconut; tricks; habits; reduce; surgery; stress; ingredient; prevent; juice; english; turmeric; detox; remedy; homemade; nails; vinegar; drinking; routine; pounds; lemon; oils; sleeping; pimple; salt; baking; kerala; gross; honey; nose; tea; acne; malayalam;
17,,0.032824722892401556,gif; giphy; taylor; fools; gifs; jennifer; idol; swift; nick; animated; crazyarticles; celebrities; prank; celebrity; weirdthings; joke; melissa; jimmy; amy; dancing; ryan; pranks; lopez; kiss; jackson; mtv; iheartradio; actress; lady; rihanna; ellen; carrie; miley; leonardo; kelly; iggy; dirty; justin; mccarthy; dicaprio; carpet; adele; cyrus; jokes; russell; gay; fool’s; names; acm; blake;
18,,0.0067881222758409245,yang; dan; ini; untuk; dengan; dari; kompas.com; indonesia; dalam; akan; itu; malaysia; tidak; ada; pada; bisa; tak; jakarta; ahok; saat; kuala; lumpur; sudah; orang; menjadi; tahun; lebih; hari; juga; oleh; apa; baru; tersebut; anda; satu; sebagai; mereka; adalah; anak; karena; atau; jadi; dki; banyak; masih; kami; forum; kpk; telah; dia;
19,,0.024218658533974854,batman; wars; superman; thrones; harry; comic; les; potter; comics; dawn; prince; characters; des; marvel; character; captain; pour; anime; civil; duke; ben; superhero; awakens; rogue; sur; kate; zombie; fantasy; william; affleck; finale; squad; apocalypse; dans; une; teaser; duchess; jungle; films; reviews; est; suicide; patty; lego; strange; royal; spider; cinema; scenes; cambridge;
20,,0.006384814940396441,der; und; cricket; ipl; von; das; mit; indies; den; ist; kohli; virat; für; katy; auf; ein; tal; england; perry; icc; fil; yarn; mumbai; eine; nicht; dhoni; twenty; dem; sie; wir; berlin; sich; hat; aus; des; auch; logos; bowling; dolly; bahrain; als; bei; wie; germany; malta; pakistan; chick; tvm; german; sind;
21,,0.030664112511277628,climate; solar; wildlife; scientists; plant; farm; plants; planet; species; fish; forest; nasa; hunting; farmers; conservation; river; trees; environmental; birds; ocean; zoo; organic; moon; survey; gas; giant; deer; ice; researchers; tree; bird; humans; wind; environment; earthquake; carbon; soil; farmer; waste; farming; gardening; ancient; marine; mars; japan; drought; endangered; farms; nuclear; agriculture;
22,,0.023030308249050557,kardashian; kim; wwe; jenner; fitness; fashionmangotube; workout; yoga; justin; kylie; gym; papers; selena; bieber; placement; wrestlemania; rob; bikini; omgviralentertainment; kanye; gomez; blac; kendall; butt; sexy; chyna; wrestling; khloe; strength; answers; exercise; exercises; beyonce; kourtney; celebrity; beyoncé; gigi; beckham; workouts; pics; selfie; omgviralentertainment.com; drama; pregnant; engagement; nude; ivy; teacher; trainer; moves;
23,,0.042202697013190786,prison; allegedly; jail; arrest; assault; guilty; victim; brussels; incident; trial; abuse; killing; rape; violence; victims; investigating; teacher; alleged; sentenced; criminal; robbery; stolen; convicted; armed; connection; charge; lawsuit; cops; sheriff’s; suspects; facing; deputies; custody; deputy; cop; domestic; attorney; fired; enforcement; suspected; drugs; jury; crimes; theft; kill; orleans; filed; sexually; sheriff; investigators;
24,,0.05004646169259518,theatre; comedy; podcast; actor; musical; documentary; films; writer; theater; quot; production; lee; tom; broadway; robert; bbc; richard; casting; jim; television; george; martin; thomas; museum; anniversary; williams; founder; novel; scott; drama; peter; writers; amp; guest; writing; comedian; mary; poetry; tim; steve; tony; jones; editor; exhibition; hamilton; frank; episodes; producer; tribute; actors;
25,,0.04304910064282166,china; panama; papers; billion; economic; budget; prime; economy; trade; gas; countries; chinese; crisis; africa; income; taxes; offshore; nigeria; foreign; prices; debt; insurance; growth; investment; rate; housing; commission; firm; fund; leak; saudi; union; european; singapore; banks; african; dollars; parliament; documents; finance; rates; costs; sector; stock; europe; property; cent; investors; interest; average;
26,,0.0967160919977989,pet; cats; pregnant; puppy; girlfriend; babies; sees; boyfriend; kid; strange; bed; pets; text; weird; cheating; tiny; texts; tears; married; dating; unbelievable; reaction; pit; noticed; bull; homeless; forever; shocked; abandoned; lol; shelter; sad; puppies; saved; loves; omg; sister; sleeping; awkward; newborn; parent; bear; imagine; laugh; looked; twins; poor; sick; felt; boys;
27,,0.04590357379994951,vacation; lake; mountain; wine; beer; fishing; resort; river; restaurants; adventure; cruise; parks; holiday; boat; downtown; located; outdoor; trail; hotels; luxury; pool; ocean; valley; cities; square; destination; explore; historic; village; miami; dining; location; ski; palm; flights; vegas; destinations; activities; homes; springs; spots; museum; property; colorado; views; beaches; shark; mountains; estate; ship;
28,,0.0388214965742422,chocolate; chicken; egg; coffee; cheese; breakfast; eggs; cake; pizza; cooking; meal; cream; ice; sugar; chef; salad; vegan; butter; taste; ingredients; dish; milk; bread; meat; cook; lunch; gluten; rice; cookies; fruit; protein; flavor; sauce; menu; meals; dishes; soup; tasty; dessert; banana; wine; beef; burger; coconut; pot; pasta; fish; slow; pie; potato;
29,,0.04033416703925559,iphone; marketing; android; users; apps; software; computer; microsoft; customers; smart; device; windows; smartphone; sales; tools; ios; samsung; devices; bitcoin; engineering; virtual; galaxy; fbi; platform; customer; ipad; tool; startup; systems; phones; user; screen; cloud; netflix; management; amazon; engineer; edge; htc; solutions; battery; ceo; developer; streaming; gaming; advertising; launched; strategy; privacy; developers;
[/code]
When getting report, we name each category. (The category’s name is expressed meaning for this category and showed in query’s result.). See keywords, we can know the meaning of categories. In the above report, according to top keywords, category 0 includes documents about “Religion”, category 1 is about “Health”, category 2 is about “Trading”, ….
After complete all report, we push them up server, then run service.
Query
We build a service that allows to query a document, such as a post on social network, and know its category with probability by models.
Below is a demo of query:
[code language=”text” collapse=”true” light=”true”]
INPUT INFO:
Text: Lonestar, Where are you out tonight? This feeling I’m trying to fight It’s dark and I think that I would give anything For you to shine down on me How far you are? I just don’t know..
PREDICTED TOPICS:
Model: Type: meta, media type: ALL:
* Topic: 21 – with probability 71.04%
Represented words: band, songs, dance, listen, concert, singer, wwe, pop, hall, hop, hip, theatre, metal
Model: Type: meta, media type: 4:
* Topic: 24 – with probability 75.54%
Represented words: listen, hip, hop, songs, download, mix, band, remix, records, pop, dance, playlist, featuring
Model: Type: text, media type: ALL:
* Topic: 19 – with probability 55.60%
Represented words: vip, downtown, bunny, drinks, christmas, admission, brunch, booth, craft, pizza, brewing, irish, menu
Model: Type: text, media type: 1:
* Topic: 17 – with probability 27.06%
Represented words: racing, horse, wwe, champion, track, pro, golf, championship, speedway, bike, horses, wrestlemania, fight
Model: Type: text, media type: 2:
* Topic: 26 – with probability 59.51%
Represented words: i’m, lol, going, you’re, fun, guys, funny, wow, videos, credit, awesome, work, friends
Model: Type: text, media type: 4:
* Topic: 13 – with probability 40.16%
Represented words: album, band, song, rock, theatre, stage, radio, artist, performance, concert, release, featuring, david
Model: Type: text, media type: 5:
* Topic: 7 – with probability 54.94%
Represented words: event, join, friends, great, it’s, going, check, time, april, night, we’re, coming, weekend
Model: Type: full, media type: ALL:
* Topic: 16 – with probability 36.52%
Represented words: auto, wheels, motor, speedway, engine, tesla, wheel, bikes, mustang, vehicles, electric, toyota, motorcycle
* Topic: 7 – with probability 42.21%
Represented words: restaurants, marathon, monster, dining, drinks, brunch, expo, admission, cafe, brewing, booth, parking, chef
Model: Type: full, media type: 1:
* Topic: 14 – with probability 31.67%
Represented words: movie, batman, superman, wwe, cosplay, character, disney, wars, movies, episode, comic, amazon, dragon
Model: Type: full, media type: 2:
* Topic: 14 – with probability 69.20%
Represented words: i’m, lol, going, you’re, credit, funny, videos, wow, things, man, guys, baby, cute
Model: Type: full, media type: 4:
* Topic: 13 – with probability 54.71%
Represented words: golf, bike, racing, masters, ford, tesla, merle, auto, haggard, truck, motor, speed, engine
Model: Type: full, media type: 5:
* Topic: 27 – with probability 54.08%
Represented words: tonight, tickets, night, event, it’s, we’re, tomorrow, coming, saturday, going, time, great, party
[/code]
Input is a part of lyric of the song named “Lonestar” – Norah Jones, an American singer-songwriter. And as you see, the result is good on meta models with high probability.
Besides, our service’s performance is good. 1M queries take 4534418ms with 220 rps (requests per second) and latency 54ms. (Each query is 1 input text, and queries on 12 existed models.)
In summary, steps to do:
- Preprocessing
- Building Topic modeling with LDA
- Naming categories (optional)
- Run service, do classify document.
That’s all today. We showed you the way to apply Topic Modeling with LDA to categorize documents.
In fact, this service is being used by vimp.co to categorize their data to help improve their search engine.