On Sat, Sep 12, 2009 at 12:51 AM, Manch <[address removed]> wrote:
> Thanks Ben. I think I may have seen StockMood in one of the NewTech meetups.
> If not StockMood then something very similar.
>
> Can you comment on how difficult it is to build your systems? The 2 systems
> you mentioned are interesting in different aspects: financial news has a
> relatively narrow domain. On the other hand, tweets are much shorter and
> much simpler in semantics.
Building the systems, hmm...
There's a scalable, properly-structured database aspect, then a
"getting the data in real-time" aspect ... those are standard but not
necessarily trivial...
You can use standard algorithms like SVM or GP for the actual classification...
You need to build a training corpus, via having experts or Mechanical
Turkers mark up some texts with sentiment ratings...
Then the real creativity comes into how do you map texts into feature
vectors. Pure statistical word frequency TFIDF stuff? Tagging
sentences, extracting common noun phrases, etc. This is the part that
can be quick or time-consuming depending on how good you need the
results to e...
And text preprocessing. Do you separate headlines from article text?
How do you weight each? etc. Lots of small domain-dependent
decisions...
So, it's a standard datafeed-based DB-backend Web product, with
integration of some existing machine learning code, plus an open-ended
amount of work on statistical and linguistics based text processing...
To get good results ... anywhere from a few man-months up to infinity ;-)
ben g