Thursday, March 24, 2011

Dead Simple Data Mining with Data Science Toolkit-This is a cool tool!

The tools included at this time are:

  • Street Address to Coordinates - Street Address to Location calculates the latitude/longitude coordinates for a postal address.
  • File to Text - Converts PDFs, Word Documents, Excel Spreadsheets to text. Recovers text from JPEG, PNG or TIFF images of scanned documents.
  • Coordinates to Political Areas - Returns the country, region, state, county, constituencies and neighborhood a point is inside.
  • Geodict - Geodict pulls country, city and region names from unstructured English text, and returns their coordinates.
  • IP Address to Coordinates - IP Address to Location calculates country, state, city and latitude/longitude coordinates for IP addresses.
  • Text to Sentences - Removes any parts of the text that look like boilerplate instead of real sentences.
  • HTML to Text - Returns the full text that would actually be displayed in the browser when an HTML document was rendered.
  • HTML to Story - Takes an HTML document representing a news article or similar page, and extracts just the story text.
  • Text to People - Spots text fragments that look like people's names or titles, and guesses their gender where possible.

You can learn about the sources of these tools here.

According to Pete, "It's essentially a specialized Linux distribution, with a lot of useful data software pre-installed and exposing a simple interface."

If you want to do intensive data mining, you'll probably want your own server. The Data Science Toolkit is available as either a VMware machine or as an Amazon EC2 image. You can find out more about this here. Alternately, you can find the source on Github.

Blue Zoo Creative - Easy Web Design

No comments:

Post a Comment