algorithm - Contextual Search: Classifying shopping products -


i have got new task(not traditional) client, machine learning. have never been "machine learning" except little data mining stuff need help.

my task classify product present on shopping site, on basis of gender(whom product belongs to),agegroup etc, training data can have product's title, keywords(available in html of product page), , product description.

i did lot of r&d , found image recog apis(cloudsight,vufind) returned details of product image did not full fill need, used google suggestqueries, searched out many machine learning algorithms , finally...

i came know "decision tree learning algorithm" cannot figure out, how applicable problem. tried out "playingtennis" dataset couldn't make sense do.

can give me direction start journey? should focus on decision tree learning algorithm or there other algorithm suggest should focus on categorize products on basis of context?

if , share in detail things searched solve problem.

i suggest following:

  1. go through items in dataset , classify them manually (decide gender each item is). store each decision able somehow link each item in original dataset target class.
  2. develop algorithm converting each item dataset feature vector. algorithm should able convert each item in original dataset in vector of numbers (more how later).
  3. convert dataset appropriate classes dataset this:

feature_1, feature_2, feature_3, ..., gender

value_1, value_2, value_3, ... male

it decision store in csv file since able load , process in different machine learning tools (more later).

  1. load dataset you've created @ step 3 in machine learning tool of choice , try come best model can classify items in dataset gender.

  2. store model created @ step 4. part of production system.

  3. develop production code can convert unclassified product, create feature vector out of , pass feature vector model you've saved @ step 5. result of operation should predicted gender.

details

if there many items (say tens of thousands) in original dataset may impractical classify them yourself. can use amazon mechanical turk simplify task. if unable use (the last time i've checked had have usa address use it) can classify few hundreds of items start working on model , classify rest improve accuracy of classification (the more training data use better accuracy, point)

how extract features dataset

if keyword has form tag=true/false, it's boolean feature. if keyword has form tag=42, it's numerical 1 or ordinal. example can price value or price range (0-10, 10-50, 50-100, etc.) if keyword has form tag=string_value can convert into categorical value class (gender) boolean value 0/1 can experiment bit how extract features, since may influence result accuracy.

how extract features product description

there different ways convert text feature vector. tf-idf algorithms or similar.

machine learning tools

you can use 1 of existing machine learning libraries , hack code loads csv dataset, trains model , checks accuracy, @ first suggest use weka. has more or less intuitive ui , can start experiment different machine learning algorithms, convert different features in dataset string categories, or real values ordinal values, etc. thing weka has java api, can automate process of data conversion, train models programmatically, etc.

what algorithms choose

i suggest use decision tree algorithms c4.5. it's fast , show results on wide range of machine learning tasks. additionally can use ensemble of classifiers. there various algorithms can combine several algorithms (google boosting or random forest find out more) give better results, work more (since need run single feature vector through several algorithms.

one trick can use make algorithm more accurate use models work on different sets of features (say 1 algorithm uses features extracted tags , algorithm uses data extracted product description). can combine them using algorithms stacking come final result.

for classification on basis of features extracted text, can try use naive bayes algorithm or svm. both show results in text classification.


Comments

Popular posts from this blog

powershell Start-Process exit code -1073741502 when used with Credential from a windows service environment -

twig - Using Twigbridge in a Laravel 5.1 Package -

c# - LINQ join Entities from HashSet's, Join vs Dictionary vs HashSet performance -