Personal tools
Skip to content. | Skip to navigation
A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005.
The eggdeps tool reports dependencies between eggs in the working set. Dependencies are considered recursively, creating a directed graph. This graph is printed to standard output either as plain text, or as an input file to the graphviz tools. Usage eggdeps [options] [specifications] Specifications must follow the usual syntax for specifying distributions of Python packages as defined by pkg_resources. * If any specifications are given, the corresponding distributions will make up the roots of the dependency graph, and the graph will be restricted to their dependencies. * If no specifications are given, the graph will map the possible dependencies between all eggs in the working set and its roots will be those distributions that aren't dependencies of any other distributions. Options -h, --help show this help message and exit -i IGNORE, --ignore=IGNORE project names to ignore -I RE_IGNORE, --re-ignore=RE_IGNORE regular expression for project names to ignore -e DEAD_ENDS, --dead-end=DEAD_ENDS names of projects whose dependencies to ignore -E RE_DEAD_ENDS, --re-dead-end=RE_DEAD_ENDS regular expression for project names whose dependencies to ignore -x, --no-extras always omit extra dependencies -n, --version-numbers print version numbers of active distributions -1, --once in plain text output, include each distribution only once -t, --terse in plain text output, omit any hints at unprinted distributions, such as ellipses -d, --dot produce a dot graph -c, --cluster in a dot graph, cluster direct dependencies of each root distribution -r, --requirements produce a requirements list -s, --version-specs in a requirements list, print loosest possible version specifications The -i, -I, -e, and -E options may occur multiple times. If both the -d and -r options are given, the one listed last wins. When printing requirements lists, -v wins over -s. The script entry point recognizes default values for all options, the variable names being the long option names with any dashes replaced by underscores (except for --no-extras, which translates to setting extras=False). This allows for setting defaults using the arguments option of the egg recipe in a buildout configuration, for example. Details The goal of eggdeps is to compute a directed dependency graph with nodes that represent egg distributions from the working set, and edges which represent either mandatory or extra dependencies between the eggs. Working set The working set eggdeps operates on is defined by the egg distributions available to the running Python interpreter. For example, these may be the distributions activated by easy_install or installed in a zc.buildout environment. If the graph is to be calculated to such specifications that not all required distributions are in the working set, the missing ones will be marked in the output, and their dependencies cannot be determined. The same happens if any distribution that is either specified on the command line or required by any other distribution is available in the working set, but at a version incompatible with the specified requirement. Graph building strategies The dependency graph may be built following either of two strategies: Analysing the whole working set: Nodes correspond exactly to the distributions in the working set. Edges corresponding to all conceivable dependencies between any active distributions are included, but only if the required distribution is active at the correct version. The roots of the graph correspond to those distributions no other active distributions depend upon. Starting from one or more eggs: Nodes include all packages depended upon by the specified distributions and extras, as well as their deep dependencies. They may cover only part of the working set, as well as include nodes for distributions that are not active at the required versions or not active at all (so their dependencies can not be followed). The roots of the graph correspond to the specified distributions. Some information will be lost while building the graph: * If a dependency occurs both mandatorily and by way of one or more extras, it will be recorded as a plain mandatory dependency. * If a distribution A with installed extras is a dependency of multiple other distributions, they will all appear to depend on A with all its required extras, even if they individually require none or only a few of them. Reducing the graph In order to reduce an otherwise big and tangled dependency graph, certain nodes and edges may be omitted. Ignored nodes: Nodes may be ignored completely by exact name or regular expression matching. This is useful if a very basic distribution is a depedency of a lot of others. An example might be setuptools. Dead ends: Distributions may be declared dead ends by exact name or regular expression matching. Dead ends are included in the graph but their own dependencies will be ignored. This allows for large subsystems of distributions to be blotted out except for their "entry points". As an example, one might declare zope.app.* dead ends in the context of zope.* packages. No extras: Reporting and following extra dependencies may be switched off completely. This will probably make most sense when analysing the working set rather than the dependencies of specified distributions. Output There are two ways eggdeps can output the computed dependency graph: plain text (the default) and a dot file to be fed to the graphviz tools. Plain text output The graph is printed to standard output essentially one node per line, indented according to nesting depth, and annotated where appropriate. The dependencies of each node are sorted after the following criteria: * Mandatory dependencies are printed before extra requirements. * Dependencies of each set of extras are grouped, the groups being sorted alphabetically by the names of the extras. * Dependencies which are either all mandatory or by way of the same set of extras are sorted alphabetically by name. As an illustrating example, the following dependency graph was computed for two Zope packages, one of them required with a "test" extra depending on an uninstalled egg, and some graph reduction applied: zope.annotation zope.app.container * zope.component zope.deferredimport zope.proxy zope.deprecation zope.event zope.dublincore zope.annotation ... [test] (zope.app.testing) * Brackets []: If one or more dependencies of a node are due to extra requirements only, the names of those extras are printed in square brackets above their dependencies, half-indented relative to the node which requires them. Ellipsis ...: If a node with further dependencies occurs at several places in the graph, the subgraph is printed only once, the other occurences being marked by an ellipsis. The place where the subgraph is printed is chosen such that * extra dependencies occur as late as possible in the path, if at all, * shallow nesting is preferred, * paths early in the alphabet are preferred. Parentheses (): If a distribution is not in the working set, its name is parenthesised. Asterisk *: Dead ends are marked by an asterisk. Dot file output In a dot graphics, nodes and edges are not annotated with text but colored. These are the color codes for nodes, later ones overriding earlier ones in cases where more than one color is appropriate: Green: Nodes corresponding to the roots of the graph. Yellow: Direct dependencies of any root nodes, whether mandatory or through extras. Lightgrey: Dead ends. Red: Nodes for eggs installed at a version incompatible with some requirement, or not installed at all. Edge colors: Black: Mandatory dependencies. Lightgrey: Extra dependencies. Other than being highlighted by color, root nodes and their direct dependencies may be clustered. eggdeps tries to put each root node in its own cluster. However, if two or more root nodes share any direct dependencies, they will share a cluster as well. Requirements list All the distributions included in the graph may be output as the Python representation of a list of requirement specifications, either * listing bare package names, * including the exact versions as they occur in the working set, or * specifying complex version requirements that take into account all version requirements made for the distribution in question (but disregard extras completely for the time being). Complex version requirements always require at least the version that occurs in the working set, assuming that we cannot know the version requirements of past versions but reasonably assume that requirements might stay the same for future versions. The list is sorted alphabetically by distribution name.
The transmogrify.dexterity package provides a transmogrifier pipeline section for updating field values of dexterity content objects. The blueprint name is transmogrify.dexterity.schemaupdater. The schemaupdater section needs at least the path to the object to update. Paths to objects are always interpreted as being relative to the context. Any writable field who's id matches a key in the current item will be updated with the corresponding value. Fields that do not get a value from the pipeline are initialized with their default value or get a missing_value marker. This functionality will be moved into a separate constructor pipeline... The schmemaupdater section can also handle fields defined in behaviors.
This Transmogrifier blueprint extracts text from within the specified CSS id.
Transmogrifier source for reading files from the filesystem This package provides a Transmogrifier data source for reading files, images and directories from the filesystem. The output format is geared towards constructing Plone File, Image or Folder content. It is also possible to add arbitrary metadata (such as titles and descriptions) to the content items, by providing these in a separate CSV file.
Helpful transmogrifier blueprints to extract text or html out of html content. transmogrify.htmlcontentextractor.auto ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This blueprint has a clustering algorithm that tries to automatically extract the content from the HTML template. This is slow and not always effective. Often you will need to input your own template extraction rules. In addition to extracting Title, Description and Text of items the blueprint will output the rules it generates to a logger with the same name as the blueprint. Setting debug mode on templateauto will give you details about the rules it uses. :: ... DEBUG:templateauto:'icft.html' discovered rules by clustering on 'http://...' Rules: text= html //div[@id = "dal_content"]//div[@class = "content"]//p title= text //div[@id = "dal_content"]//div[@class = "content"]//h3 Text: TITLE: ... MAIN-10: ... MAIN-10: ... MAIN-10: ... Options ------- condition TAL Expression to control use of this blueprint debug default is '' transmogrify.htmlcontentextractor ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This blueprint extracts out title, description and body from html either via xpath, TAL or by automatic cluster analysis Rules are in the form of :: (title|description|text|anything) = (text|html|optional|tal) Expression Where expression is either TAL or XPath For example :: [template1] blueprint = transmogrify.htmlcontentextractor title = text //div[@class='body']//h1[1] _delete1 = optional //div[@class='body']//a[@class='headerlink'] _delete2 = optional //div[contains(@class,'admonition-description')] description = text //div[contains(@class,'admonition-description')]//p[@class='last'] text = html //div[@class='body'] Note that for a single template e.g. template1, ALL of the XPaths need to match otherwise that template will be skipped and the next template tried. If you'd like to make it so that a single XPath isn't nessary for the template to match then use the keyword `optional` or `optionaltext` instead of `text` or `html` before the XPath. When an XPath is applied within a single template, the HTML it matches will be removed from the page. Another rule in that same template can't match the same HTML fragment. If a content part is not useful (e.g. redundant text, title or description) it is a way to effectively remove that HTML from the content. To help debug your template rules you can set debug mode. For more information about XPath see - http://www.w3schools.com/xpath/default.asp - http://blog.browsermob.com/2009/04/test-your-selenium-xpath-easily-with-firebug/ HTMLContentExtractor ==================== This blueprint extracts out fields from html either via xpath rules or by automatic cluster analysis transmogrify.htmlcontentextractor --------------------------------- You can define a series of rules which will get applied to the to the '_text' of the input item. The rules use a XPATH expression or a TAL expression to extract html or text out of the html and adds it as key to the outputted item. Each option of the blueprint is a rule of the following form :: (N-)field = (optional)(text|html|delete|optional) xpath OR (N-)field = (optional)tal tal-expression "field" is the attribute that will be set with the results of the xpath "format" is what to do with the results of the xpath. "optional" means the same as "delete" but won't cause the group to not match. if the format is delete or optional then the field name doesn't matter but will still need to be unique "xpath' is an xpath expression If the format is 'tal' then instead of an XPath use can use a TAL expression. TAL expression is evaluated on the item object AFTER the XPath expressions have been applied. For example :: [template] blueprint = transmogrify.htmlcontentextractor title = text //div[@class='body']//h1[1] _permalink = text //div[@class='body']//a[@class='headerlink'] _text = html //div[@class='body'] _label = optional //p[contains(@class,'admonition-title')] description = optional //div[contains(@class,'admonition-description')]/p[@class='last']/text() _remove_useless_links = optional //div[@id = 'indices-and-tables'] mimetype = tal string:text/html text = tal python:item['_text'].replace('id="blah"','') You can delete a number of parts of the html by extracting content to fields such as _permalink and _label. These items won't get used be set used to set any properties on the final content so are effective as a means of deleting parts of the html. TAL expressions are evaluated after XPath expressions so we can post process the _text XPath to produce a text stripped of a certain id. N is the group number. Groups are run in order of group number. If any rule doesn't match (unless its marked optional) then the next group will be tried instead. Group numbers are optional. Instead of groups you can also chain several blueprints togeather. The blueprint will set '_template' on the item. If another blueprint finds the '_template' key in an item it will ignore that item. The '_template' field is the remainder of the html once all the content selected by the XPATH expressions have been applied. transmogrify.htmlcontentextractor.auto -------------------------------------- This blueprint will analyse the html and attempt to discover the rules to extract out the title, description and body of the html. If the logger output is in DEBUG mode then the XPaths used by the auto extrator will be output to the logger.
transmogrify.pathsorter is a blueprint for reordering items into tree sorted order
transmogrifier.ploneremote is package of transmogrifier blueprints for uploading content via Zope XML-RPC API to a Plone site. Plone site does not need any modifications, but vanilla Zope XML-RPC is used.
Note As of version 1.3 Transmogrifier provides a similar feature, via a blueprint called: collective.transmogrifier.sections.logger. This Transmogrifier blueprint is based on collective.transmogrifier.sections.tests.PrettyPrinter, which anyone can use in their project by creating a utility like so: <utility component="collective.transmogrifier.sections.tests.PrettyPrinter" name="print" /> Then adding a section to your pipeline like so: [transmogrifier] pipeline = … print [print] blueprint = print transmogrify.print has has two advantages over the above approach: It adds the utility for you It allows you to specify a keys parameter to print individual keys. If no key is provided, it prints the entire item.